Representing source code

ML models work on numerical vectors
Source code comprises textual data
- But also semantics inherited through the PL definition

Source code representation is the task of converting source code data to data that can be consumed by ML models. The task attempts to optimize the loss of information during the conversion.

Forms of representation

Text-only representations
- Treating source code as natural language text
Structured representations
- Using the AST and / or semantic information
Graph representations, using data/control flow information
Combined representations

Tokens

Similar to other NLP applications, textual representations treat source code as a tokens.

def add(x, y):
  return x + y

Tokens comprise:

language syntax: def, (, ), :, +, return
developer-defined identifiers: add, x, y

Tokens can be considered:

as a list / sequence (most frequently)
as a set / bag, when order is not important

Learning with tokens

NLP approaches use a dictionary \(d: |D| \rightarrow \mathbb{R}\), to convert textual representations to numerical vectors. A dictionary is built by mapping all individual tokens (\(|D|\)) in a corpus to a unique number in \(\mathbb{R}\).

Given \(d\), we can construct a function \(\epsilon: D \rightarrow \mathbb{R}^k\), where:

\(D\) is a sequence / set of tokens representing our input
\(k\) is the number of dimensions

that will convert a token to a vector of \(k\) dimensions. The \(\epsilon\) function is called the embedding function.

One-hot encoding

The simplest \(\epsilon\) uses a vector \(V\) of size \(|d|\), i.e., equal to the vocabulary size, to encode tokens.

All elements of \(|V|\) are 0 except the index position of token \(t\).

This is called the one-hot encoding.

Q: What is the biggest problem of one-hot encodings?

A: The representation is extremely sparse, so lots of space is wasted.

Token embeddings

More useful embedding \(\epsilon\) functions produce vectors of size \(s\) where \(s << |D|\).

Useful embedding functions can be learned by:

Pre-training with external models, such as Word2Vec or FastText
Training them jointly with the model they serve

D: What are the advantages of each approach?

A: In joint training, the embeddings fit the task perfectly, but cannot be reused for other tasks. Pre-training is task agnostic, so embeddings may be reused (e.g., across tasks).

Embedding tokens: Word2Vec

Word2Vec builds embeddings by solving a fake problem:

Assume that \(t_i\) is the token we are currently processing
Tokens \(t_{i-2}\), \(t_{i-1}\), \(t_{i+1}\) \(t_{i+2}\) are its context
We give the context to a feedforward layer \(V\) and we ask it to predict \(t_i\) (CBOW)
- Alternatively, we give \(t_i\) as input and ask it to predict the context (skip-gram)
We keep \(V\) as the embedding of \(t_i\)

Embedding tokens: FastText

Word2Vec is a closed vocabulary model: Tokens not encountered during training cannot be embedded during prediction.

FastText solves this, by training on ngrams. The idea is that unseen tokens can be represented by combining the learned ngram embeddings.

On a recent study, FastText embeddings were found to better represent developer’s opinions on identifier similarity [1]

Documents as vectors

A document is an abstract term we use to represent a sequence / bag of tokens.

In NLP, a document can be a sentence, a paragraph or a document.

In ML4SE, a document is usually a function/method or an expression.

From token embeddings to document embeddings

Q Given a source code document and an embedding model for its tokens, how can we obtain an embedding for the document?

Is sequence information important?

N use one of the max, sum, avg functions on the embedded token bag
Y use a model that learns token sequences

Naturalness

Software is a form of human communication; software corpora have similar statistical properties to natural language corpora; and these properties can be exploited to build better software engineering tools.
— Hindle et al.

The repetitive nature of code is called naturalness[2].

The homonymous seminal paper found that source code is less surprising than natural language text, so we can get good sequence compression.

The naturalness of language

Q Fill in the following sentence: Legendary Dutch football player Johan …

Most probably: Cruyff

Probabilistic view: the probabilities are very skewed
- \(P(Cruyff | dutch, football, legendary)\) = 0.90
- \(P(Gousios | dutch, football, legendary)\) = 0.0001

With less context?
- \(P(Cruyff | dutch)\) = 0.01
- \(P(Cruyff | legendary)\) = 0.1

Language models

A language model assigns probabilities to sentences (or programs)
\(T\): set of possible tokens
\(S \subset T^*\): set of possible “sentences” (after applying a grammar)
\(p\): probability distribution over \(s \in S\), such as

\(\forall s \in S [ 0 \le p(s) \le 1 ] \land \sum_{c \in C}p(s)=1\)

Given a corpus \(C\), we can estimate \(p\)

Language models in practice

To exploit naturalness in code documents, we can build sequence embeddings using techniques such as:

n-grams
Sequence models

For applications where order does not matter, we can also represent documents with TF-IDF / BM25

N-grams: a simple language model

Probability of a sentence: \(s = a_1 a_2 a_3 … a_n\)

\(P(s) = P(a_1) P(a_2|a_1) P(a_3 | a_1 a_2)\)

The N-gram model only cares about last \(N - 1\) tokens. For 4-gram:

\(P(a_i|a_1 ... a_{i-1}) \cong P(a_i | a_{i_1} a_{i_2} a_{i_3})\)

Estimation by counting over the training corpus

\(P(a_4|a_1a_2a_3) = \frac{count(a_1a_2a_3a_4)}{count(a_1a_2a_3)}\)

n-gram language model examples

n-gram models attempt to compute the probability \(P(token|history)\), i.e., the probability a sequence of tokens (history) is followed by token.

Given the document if True: print("🐔", "🥚") and ignoring syntax tokens, we get the following for various \(n\).

n	n-grams
1-gram	`if`, `True`, `print`, `🐔`, `🥚`
bigram	`if True`, `True print`, `print 🐔`, `🐔🥚`
trigram	`if True print`, `True print 🐔`, `print 🐔🥚`

Sequence modeling

n-grams have limited context
- 3-grams with a vocabulary of 1k tokens leads to 10^9 combinations
RNNs and their variations (notably LSTMs) are the basic building block for learning from sequences.
RNNs take a sequence of \(n\) (embedded) tokens and learn and embedding of the sequence

What does an RNN look like?

The basic RNN

class RNN:

  def forward(self, x):
    # update the hidden state
    self.h = np.tanh(
      np.dot(self.W_hh, self.h) +
      np.dot(self.W_xh, x)
    )

    # compute the output vector
    y = np.dot(self.W_hy, self.h)

    return y

How does an RNN learn?

RNN learning from a sequence

We feed the RNN one token at a time and ask it to predict the next token.

RNN: the equations

In \(i\) steps, an RNN takes in \([w_1, w_2, …, w_i]\) and outputs [y_1, y_2, …, y_i]
Output vectors represent probability distribution over tokens
\(P(w^{i+1}|w^1w^2...w^i) \cong softmax(y^i)[w^{i+1}]\)

RNN sequence embeddings

RNN hidden state weights

The weights of the hidden states are used as the input sequence embeddings. Those can be fed to downsteam models, such as:

Feedforward layers, e.g. to predict labels
Other RNNs, to learn deeper sequence embeddings

Sequence to Sequence models

A particularly interesting class of models translates between sequences:

Source code summarization
Auto completion
Type prediction
Language translation
Code search

How does `seq2seq` translation work?

A (stack of) RNN(s) encodes the input
The sequence embedding is fed to:
- a (stack of) RNN(s) that predicts the output sequence
- a parallel feed-forward net that learns to weight the sequence embedding cells based on the output (attention)
The loss is computed by comparing the ground truth with the decoder output
At prediction time, beam search is employed to learn the most probable sequences given top-n predictions at each step

How does attention work?

Seq2Seq with attention

Image credit: [3]

Common problems

The following problems are particular to the application of NLP techniques on source code

The vocabulary problem
The long-range dependency problem

The vocabulary problem

There are only two hard things in Computer Science: cache invalidation and naming things.
Phil Karlton

C standard library: strcpy

RabbitMQ: UnexpectedSuccessException

PyTorch: Transformer

Spring framework: SimpleBeanFactoryAwareAspectInstanceFactory

“While software is more repetitive than natural language, software vocabulary is much more diverse as developers can create new identifiers at will.”[4]

Vocabulary statistics

In 14k (deduplicated [5]) repositories, Babii et al. [4] found:

11.5M unique tokens
After removing comments and whitespace: 8M
After splitting words (FooBar ->Foo, Bar): 975k

It is impossible for NLP models (e.g., ngrams, RNNs) to learn long tail distributions, e.g., tokens appearing 1-2 times in the vocabulary.

Vocabulary size grows with input

Vocabulary size grows linearly with input size

How to solve the vocabulary problem?

Top-N tokens: Keep top x% of unique tokens, so that y% of all tokens are covered
- x and y are usually ~20% and ~90%, respectively
- The remaining tokens are replaced with an <unk> token
Byte Pair Encoding: Recursively replace bigrams with single bytes
- Effective in both compressing the vocabulary and allowing models to predict new combinations of bigrams (open vocabulary models)
- In use by large NLP models, e.g. GPT-3 and BERT

How effective is BPE?

An exploration of applying BPE to source code corpora is presented in [6]. The authors find:

A 10k BPE vocabulary can represent an initial vocabulary of 11M items
97% of the BPE tokens appear > 1000 times, which leads to better embeddings
The size of the corpus increases a bit, as more tokens a needed to represent infrequent words

The long range dependency problem

Language models can only capture local dependencies:

n-grams: Their only explicit dependency is 1 token
RNNs: 512-long RNNs are very slow to train, and they tend to overfit

How can we model the code below so that we capture the type bug?

class Foo:
  self.y : int = ""

  # 512 tokens later...
  def bar(x: str):
    self.y = x # bug: y is int

Transformers to the rescue?

Transformers [7] are a recent innovation that transformed the NLP field. Transformers replace the time-step training in RNNs, with fully parallel, feed-forward, attention based layers.

The transformer architecture

Read more about transformers in the excellent article The illustrated transformer.

Modern language models

BERT: A self-supervised seq2seq, transformer-based model [8].

Bidirectional encoding: Uses both preceeding and succeeding tokens as context
Self-supervision: Uses token masking to predict masks as labels

input:   Have a  nice  day. Even though it rains.
masked:  Have a <MASK> day. Even <MASK> it rains.
label1: nice, label2: though

State of the art results, 100s of variations.

Bibliography

[1]

Y. Wainakh, M. Rauf, and M. Pradel, “Evaluating semantic representations of source code,” arXiv preprint arXiv:1910.05177, 2019.

[2]

A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu, “On the naturalness of software,” in Software engineering (ICSE), 2012 34th international conference on, 2012, pp. 837–847.

[3]

T. Shi, Y. Keneshloo, N. Ramakrishnan, and C. K. Reddy, “Neural abstractive text summarization with sequence-to-sequence models,” arXiv preprint arXiv:1812.02303, 2018.

[4]

H. Babii, A. Janes, and R. Robbes, “Modeling vocabulary for big code machine learning,” arXiv preprint arXiv:1904.01873, 2019.

[5]

M. Allamanis, “The adverse effects of code duplication in machine learning models of code,” in Proceedings of the 2019 ACM SIGPLAN international symposium on new ideas, new paradigms, and reflections on programming and software, 2019, pp. 143–153.

[6]

R.-M. Karampatsis, H. Babii, R. Robbes, C. Sutton, and A. Janes, “Big code!= Big vocabulary: Open-vocabulary models for source code,” arXiv preprint arXiv:2003.07914, 2020.

[7]

A. Vaswani et al., “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.

[8]

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.

[9]

V. J. Hellendoorn and P. Devanbu, “Are deep neural networks the best choice for modeling source code?” in Proceedings of the 2017 11th joint meeting on foundations of software engineering, 2017, pp. 763–773.

Copyright

The course contents are copyrighted (c) 2018,2019,2020 - onwards by TU Delft and their respective authors and licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.

Representing source code as text

Georgios Gousios and Satish Chandra

06 October 2022

Representing source code

Forms of representation

Tokens

Learning with tokens

One-hot encoding

Token embeddings

Embedding tokens: Word2Vec

Embedding tokens: FastText

Documents as vectors

From token embeddings to document embeddings

Naturalness

The naturalness of language

Language models

Language models in practice

N-grams: a simple language model

n-gram language model examples

Sequence modeling

What does an RNN look like?

How does an RNN learn?

RNN: the equations

RNN sequence embeddings

Sequence to Sequence models

How does `seq2seq` translation work?

How does attention work?

Common problems

The vocabulary problem

Vocabulary statistics

Vocabulary size grows with input

How to solve the vocabulary problem?

How effective is BPE?

The long range dependency problem

Transformers to the rescue?

Modern language models

Bibliography

Copyright

Representing source code as text

Georgios Gousios and Satish Chandra

06 October 2022

Representing source code

Forms of representation

Tokens

Learning with tokens

One-hot encoding

Token embeddings

Embedding tokens: Word2Vec

Embedding tokens: FastText

Documents as vectors

From token embeddings to document embeddings

Naturalness

The naturalness of language

Language models

Language models in practice

N-grams: a simple language model

n-gram language model examples

Sequence modeling

What does an RNN look like?

How does an RNN learn?

RNN: the equations

RNN sequence embeddings

Sequence to Sequence models

How does seq2seq translation work?

How does attention work?

Common problems

The vocabulary problem

Vocabulary statistics

Vocabulary size grows with input

How to solve the vocabulary problem?

How effective is BPE?

The long range dependency problem

Transformers to the rescue?

Modern language models

Bibliography

Copyright

How does `seq2seq` translation work?