Representing source code

Source code representation is the task of converting source code data to data that can be consumed by ML models. The task attempts to optimize the loss of information during the conversion.

Forms of representation

  • Text-only representations

    • Treating source code as natural language text
  • Structured representations

    • Using the AST and / or semantic information
  • Graph representations, using data/control flow information

  • Combined representations

Tokens

Similar to other NLP applications, textual representations treat source code as a tokens.

def add(x, y):
  return x + y

Tokens comprise:

  • language syntax: def, (, ), :, +, return
  • developer-defined identifiers: add, x, y

Tokens can be considered:

  • as a list / sequence (most frequently)
  • as a set / bag, when order is not important

Learning with tokens

NLP approaches use a dictionary \(d: |D| \rightarrow \mathbb{R}\), to convert textual representations to numerical vectors. A dictionary is built by mapping all individual tokens (\(|D|\)) in a corpus to a unique number in \(\mathbb{R}\).

Given \(d\), we can construct a function \(\epsilon: D \rightarrow \mathbb{R}^k\), where:

  • \(D\) is a sequence / set of tokens representing our input
  • \(k\) is the number of dimensions

that will convert a token to a vector of \(k\) dimensions. The \(\epsilon\) function is called the embedding function.

One-hot encoding

The simplest \(\epsilon\) uses a vector \(V\) of size \(|d|\), i.e., equal to the vocabulary size, to encode tokens.

All elements of \(|V|\) are 0 except the index position of token \(t\).

This is called the one-hot encoding.

Q: What is the biggest problem of one-hot encodings?

A: The representation is extremely sparse, so lots of space is wasted.

Token embeddings

More useful embedding \(\epsilon\) functions produce vectors of size \(s\) where \(s << |D|\).

Useful embedding functions can be learned by:

  • Pre-training with external models, such as Word2Vec or FastText
  • Training them jointly with the model they serve

D: What are the advantages of each approach?

A: In joint training, the embeddings fit the task perfectly, but cannot be reused for other tasks. Pre-training is task agnostic, so embeddings may be reused (e.g., across tasks).

Embedding tokens: Word2Vec

Word2Vec builds embeddings by solving a fake problem:

  • Assume that \(t_i\) is the token we are currently processing
  • Tokens \(t_{i-2}\), \(t_{i-1}\), \(t_{i+1}\) \(t_{i+2}\) are its context
  • We give the context to a feedforward layer \(V\) and we ask it to predict \(t_i\) (CBOW)
    • Alternatively, we give \(t_i\) as input and ask it to predict the context (skip-gram)
  • We keep \(V\) as the embedding of \(t_i\)

Embedding tokens: FastText

Word2Vec is a closed vocabulary model: Tokens not encountered during training cannot be embedded during prediction.

FastText solves this, by training on ngrams. The idea is that unseen tokens can be represented by combining the learned ngram embeddings.

On a recent study, FastText embeddings were found to better represent developer’s opinions on identifier similarity [1]

Documents as vectors

A document is an abstract term we use to represent a sequence / bag of tokens.

In NLP, a document can be a sentence, a paragraph or a document.

In ML4SE, a document is usually a function/method or an expression.

From token embeddings to document embeddings

Q Given a source code document and an embedding model for its tokens, how can we obtain an embedding for the document?

Is sequence information important?

  • N use one of the max, sum, avg functions on the embedded token bag
  • Y use a model that learns token sequences

Naturalness

Software is a form of human communication; software corpora have similar statistical properties to natural language corpora; and these properties can be exploited to build better software engineering tools.
— Hindle et al.

The repetitive nature of code is called naturalness[2].

The homonymous seminal paper found that source code is less surprising than natural language text, so we can get good sequence compression.

The naturalness of language

Q Fill in the following sentence: Legendary Dutch football player Johan …

Most probably: Cruyff

  • Probabilistic view: the probabilities are very skewed
    • \(P(Cruyff | dutch, football, legendary)\) = 0.90
    • \(P(Gousios | dutch, football, legendary)\) = 0.0001
  • With less context?
    • \(P(Cruyff | dutch)\) = 0.01
    • \(P(Cruyff | legendary)\) = 0.1

Language models

  • A language model assigns probabilities to sentences (or programs)
  • \(T\): set of possible tokens
  • \(S \subset T^*\): set of possible “sentences” (after applying a grammar)
  • \(p\): probability distribution over \(s \in S\), such as

\(\forall s \in S [ 0 \le p(s) \le 1 ] \land \sum_{c \in C}p(s)=1\)

  • Given a corpus \(C\), we can estimate \(p\)

Language models in practice

To exploit naturalness in code documents, we can build sequence embeddings using techniques such as:

  • n-grams
  • Sequence models

For applications where order does not matter, we can also represent documents with TF-IDF / BM25

N-grams: a simple language model

  • Probability of a sentence: \(s = a_1 a_2 a_3 … a_n\)

\(P(s) = P(a_1) P(a_2|a_1) P(a_3 | a_1 a_2)\)

  • The N-gram model only cares about last \(N - 1\) tokens. For 4-gram:

\(P(a_i|a_1 ... a_{i-1}) \cong P(a_i | a_{i_1} a_{i_2} a_{i_3})\)

  • Estimation by counting over the training corpus

\(P(a_4|a_1a_2a_3) = \frac{count(a_1a_2a_3a_4)}{count(a_1a_2a_3)}\)

n-gram language model examples

n-gram models attempt to compute the probability \(P(token|history)\), i.e., the probability a sequence of tokens (history) is followed by token.

Given the document if True: print("🐔", "🥚") and ignoring syntax tokens, we get the following for various \(n\).

n n-grams
1-gram if, True, print, 🐔, 🥚
bigram if True, True print, print 🐔, 🐔🥚
trigram if True print, True print 🐔, print 🐔🥚

Sequence modeling

  • n-grams have limited context

    • 3-grams with a vocabulary of 1k tokens leads to 10^9 combinations
  • RNNs and their variations (notably LSTMs) are the basic building block for learning from sequences.

  • RNNs take a sequence of \(n\) (embedded) tokens and learn and embedding of the sequence

What does an RNN look like?

The basic RNN

How does an RNN learn?

RNN learning from a sequence

We feed the RNN one token at a time and ask it to predict the next token.

RNN: the equations

  • In \(i\) steps, an RNN takes in \([w_1, w_2, …, w_i]\) and outputs [y_1, y_2, …, y_i]

  • Output vectors represent probability distribution over tokens

  • \(P(w^{i+1}|w^1w^2...w^i) \cong softmax(y^i)[w^{i+1}]\)

RNN sequence embeddings

RNN hidden state weights

The weights of the hidden states are used as the input sequence embeddings. Those can be fed to downsteam models, such as:

  • Feedforward layers, e.g. to predict labels
  • Other RNNs, to learn deeper sequence embeddings

Sequence to Sequence models

A particularly interesting class of models translates between sequences:

  • Source code summarization
  • Auto completion
  • Type prediction
  • Language translation
  • Code search

How does seq2seq translation work?

  • A (stack of) RNN(s) encodes the input
  • The sequence embedding is fed to:
    • a (stack of) RNN(s) that predicts the output sequence
    • a parallel feed-forward net that learns to weight the sequence embedding cells based on the output (attention)
  • The loss is computed by comparing the ground truth with the decoder output
  • At prediction time, beam search is employed to learn the most probable sequences given top-n predictions at each step

How does attention work?

Seq2Seq with attention

Image credit: [3]

Common problems

The following problems are particular to the application of NLP techniques on source code

The vocabulary problem

There are only two hard things in Computer Science: cache invalidation and naming things.
Phil Karlton
  • C standard library: strcpy
  • RabbitMQ: UnexpectedSuccessException
  • PyTorch: Transformer
  • Spring framework: SimpleBeanFactoryAwareAspectInstanceFactory

While software is more repetitive than natural language, software vocabulary is much more diverse as developers can create new identifiers at will.[4]

Vocabulary statistics

In 14k (deduplicated [5]) repositories, Babii et al. [4] found:

  • 11.5M unique tokens
  • After removing comments and whitespace: 8M
  • After splitting words (FooBar ->Foo, Bar): 975k

It is impossible for NLP models (e.g., ngrams, RNNs) to learn long tail distributions, e.g., tokens appearing 1-2 times in the vocabulary.

Vocabulary size grows with input

Vocabulary size grows linearly with input size

How to solve the vocabulary problem?

  • Top-N tokens: Keep top x% of unique tokens, so that y% of all tokens are covered

    • x and y are usually ~20% and ~90%, respectively
    • The remaining tokens are replaced with an <unk> token
  • Byte Pair Encoding: Recursively replace bigrams with single bytes

    • Effective in both compressing the vocabulary and allowing models to predict new combinations of bigrams (open vocabulary models)
    • In use by large NLP models, e.g. GPT-3 and BERT

How effective is BPE?

An exploration of applying BPE to source code corpora is presented in [6]. The authors find:

  • A 10k BPE vocabulary can represent an initial vocabulary of 11M items
  • 97% of the BPE tokens appear > 1000 times, which leads to better embeddings
  • The size of the corpus increases a bit, as more tokens a needed to represent infrequent words

The long range dependency problem

Language models can only capture local dependencies:

  • n-grams: Their only explicit dependency is 1 token
  • RNNs: 512-long RNNs are very slow to train, and they tend to overfit

How can we model the code below so that we capture the type bug?

class Foo:
  self.y : int = ""

  # 512 tokens later...
  def bar(x: str):
    self.y = x # bug: y is int

Transformers to the rescue?

Transformers [7] are a recent innovation that transformed the NLP field. Transformers replace the time-step training in RNNs, with fully parallel, feed-forward, attention based layers.

The transformer architecture

Read more about transformers in the excellent article The illustrated transformer.

Modern language models

BERT: A self-supervised seq2seq, transformer-based model [8].

  • Bidirectional encoding: Uses both preceeding and succeeding tokens as context
  • Self-supervision: Uses token masking to predict masks as labels
input:   Have a  nice  day. Even though it rains.
masked:  Have a <MASK> day. Even <MASK> it rains.
label1: nice, label2: though

State of the art results, 100s of variations.

Bibliography

[1]
Y. Wainakh, M. Rauf, and M. Pradel, “Evaluating semantic representations of source code,” arXiv preprint arXiv:1910.05177, 2019.
[2]
A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu, “On the naturalness of software,” in Software engineering (ICSE), 2012 34th international conference on, 2012, pp. 837–847.
[3]
T. Shi, Y. Keneshloo, N. Ramakrishnan, and C. K. Reddy, “Neural abstractive text summarization with sequence-to-sequence models,” arXiv preprint arXiv:1812.02303, 2018.
[4]
H. Babii, A. Janes, and R. Robbes, “Modeling vocabulary for big code machine learning,” arXiv preprint arXiv:1904.01873, 2019.
[5]
M. Allamanis, The adverse effects of code duplication in machine learning models of code,” in Proceedings of the 2019 ACM SIGPLAN international symposium on new ideas, new paradigms, and reflections on programming and software, 2019, pp. 143–153.
[6]
R.-M. Karampatsis, H. Babii, R. Robbes, C. Sutton, and A. Janes, “Big code!= Big vocabulary: Open-vocabulary models for source code,” arXiv preprint arXiv:2003.07914, 2020.
[7]
A. Vaswani et al., “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
[8]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[9]
V. J. Hellendoorn and P. Devanbu, Are deep neural networks the best choice for modeling source code? in Proceedings of the 2017 11th joint meeting on foundations of software engineering, 2017, pp. 763–773.