Welcome to the new version of CaltechAUTHORS. Login is currently restricted to library staff. If you notice any issues, please email coda@library.caltech.edu
Published June 2019 | Submitted + Published + Supplemental Material
Journal Article Open

Open Vocabulary Learning on Source Code with a Graph-Structured Cache

Abstract

Machine learning models that take computer program source code as input typically use Natural Language Processing (NLP) techniques. However, a major challenge is that code is written using an open, rapidly changing vocabulary due to, e.g., the coinage of new variable and method names. Reasoning over such a vocabulary is not something for which most NLP methods are designed. We introduce a Graph-Structured Cache to address this problem; this cache contains a node for each new word the model encounters with edges connecting each word to its occurrences in the code. We find that combining this graph-structured cache strategy with recent Graph-Neural-Network-based models for supervised learning on code improves the models' performance on a code completion task and a variable naming task — with over 100% relative improvement on the latter — at the cost of a moderate increase in computation time.

Additional Information

© 2019 by the author(s). Proceedings of the 36th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Many thanks to Miltos Allamanis and Hyokun Yun for their advice and useful conversations.

Attached Files

Published - cvitkovic19b.pdf

Submitted - 1810.08305.pdf

Supplemental Material - cvitkovic19b-supp.pdf

Files

cvitkovic19b-supp.pdf
Files (2.7 MB)
Name Size Download all
md5:c0d0a6d8558146398beea2e7537e76aa
82.5 kB Preview Download
md5:659e781df72a534b4367519f753da453
1.3 MB Preview Download
md5:0224fe27fa33ac98a48e93d151ee7561
1.3 MB Preview Download

Additional details

Created:
August 19, 2023
Modified:
October 20, 2023