Source code representation is the task of converting source code data to data that can be consumed by ML models. The task attempts to optimize the loss of information during the conversion.
Text-only representations
Structured representations
Graph representations, using data/control flow information
Combined representations
Similar to other NLP applications, textual representations treat source code as a tokens.
def add(x, y):
return x + y
Tokens comprise:
def
, (
, )
,
:
, +
, return
add
, x
,
y
Tokens can be considered:
NLP approaches use a dictionary \(d: |D| \rightarrow \mathbb{R}\), to convert textual representations to numerical vectors. A dictionary is built by mapping all individual tokens (\(|D|\)) in a corpus to a unique number in \(\mathbb{R}\).
Given \(d\), we can construct a function \(\epsilon: D \rightarrow \mathbb{R}^k\), where:
that will convert a token to a vector of \(k\) dimensions. The \(\epsilon\) function is called the embedding function.
The simplest \(\epsilon\) uses a vector \(V\) of size \(|d|\), i.e., equal to the vocabulary size, to encode tokens.
All elements of \(|V|\) are 0 except the index position of token \(t\).
This is called the one-hot encoding.
Q: What is the biggest problem of one-hot encodings?
A: The representation is extremely sparse, so lots of space is wasted.
More useful embedding \(\epsilon\) functions produce vectors of size \(s\) where \(s << |D|\).
Useful embedding functions can be learned by:
D: What are the advantages of each approach?
A: In joint training, the embeddings fit the task perfectly, but cannot be reused for other tasks. Pre-training is task agnostic, so embeddings may be reused (e.g., across tasks).
Word2Vec builds embeddings by solving a fake problem:
Word2Vec is a closed vocabulary model: Tokens not encountered during training cannot be embedded during prediction.
FastText solves this, by training on ngrams. The idea is that unseen tokens can be represented by combining the learned ngram embeddings.
On a recent study, FastText embeddings were found to better represent developer’s opinions on identifier similarity [1]
A document is an abstract term we use to represent a sequence / bag of tokens.
In NLP, a document can be a sentence, a paragraph or a document.
In ML4SE, a document is usually a function/method or an expression.
Q Given a source code document and an embedding model for its tokens, how can we obtain an embedding for the document?
Is sequence information important?
max
,
sum
, avg
functions on the embedded token
bagSoftware is a form of human communication; software corpora have similar statistical properties to natural language corpora; and these properties can be exploited to build better software engineering tools.
The repetitive nature of code is called naturalness[2].
The homonymous seminal paper found that source code is less surprising than natural language text, so we can get good sequence compression.
Q Fill in the following sentence: Legendary Dutch football player Johan …
Most probably: Cruyff
\(\forall s \in S [ 0 \le p(s) \le 1 ] \land \sum_{c \in C}p(s)=1\)
To exploit naturalness in code documents, we can build sequence embeddings using techniques such as:
For applications where order does not matter, we can also represent documents with TF-IDF / BM25
\(P(s) = P(a_1) P(a_2|a_1) P(a_3 | a_1 a_2)\)
\(P(a_i|a_1 ... a_{i-1}) \cong P(a_i | a_{i_1} a_{i_2} a_{i_3})\)
\(P(a_4|a_1a_2a_3) = \frac{count(a_1a_2a_3a_4)}{count(a_1a_2a_3)}\)
n-gram models attempt to compute the probability \(P(token|history)\), i.e., the probability a
sequence of tokens (history
) is followed by
token
.
Given the document if True: print("🐔", "🥚")
and
ignoring syntax tokens, we get the following for various \(n\).
n | n-grams |
---|---|
1-gram | if , True , print ,
🐔 , 🥚 |
bigram | if True , True print ,
print 🐔 , 🐔🥚 |
trigram | if True print , True print 🐔 ,
print 🐔🥚 |
n-grams have limited context
RNNs and their variations (notably LSTMs) are the basic building block for learning from sequences.
RNNs take a sequence of \(n\) (embedded) tokens and learn and embedding of the sequence
class RNN:
def forward(self, x):
# update the hidden state
self.h = np.tanh(
np.dot(self.W_hh, self.h) +
np.dot(self.W_xh, x)
)
# compute the output vector
y = np.dot(self.W_hy, self.h)
return y
We feed the RNN one token at a time and ask it to predict the next token.
In \(i\) steps, an RNN takes in \([w_1, w_2, …, w_i]\) and outputs [y_1, y_2, …, y_i]
Output vectors represent probability distribution over tokens
\(P(w^{i+1}|w^1w^2...w^i) \cong softmax(y^i)[w^{i+1}]\)
The weights of the hidden states are used as the input sequence embeddings. Those can be fed to downsteam models, such as:
A particularly interesting class of models translates between sequences:
seq2seq
translation work?Image credit: [3]
The following problems are particular to the application of NLP techniques on source code
There are only two hard things in Computer Science: cache invalidation and naming things.
strcpy
UnexpectedSuccessException
Transformer
SimpleBeanFactoryAwareAspectInstanceFactory
“While software is more repetitive than natural language, software vocabulary is much more diverse as developers can create new identifiers at will.”[4]
In 14k (deduplicated [5]) repositories, Babii et al. [4] found:
FooBar
->Foo
,
Bar
): 975kIt is impossible for NLP models (e.g., ngrams, RNNs) to learn long tail distributions, e.g., tokens appearing 1-2 times in the vocabulary.
Top-N tokens: Keep top x% of unique tokens, so that y% of all tokens are covered
<unk>
tokenByte Pair Encoding: Recursively replace bigrams with single bytes
An exploration of applying BPE to source code corpora is presented in [6]. The authors find:
Language models can only capture local dependencies:
How can we model the code below so that we capture the type bug?
class Foo:
self.y : int = ""
# 512 tokens later...
def bar(x: str):
self.y = x # bug: y is int
Transformers [7] are a recent innovation that transformed the NLP field. Transformers replace the time-step training in RNNs, with fully parallel, feed-forward, attention based layers.
Read more about transformers in the excellent article The illustrated transformer.
BERT: A self-supervised seq2seq, transformer-based model [8].
input: Have a nice day. Even though it rains.
masked: Have a <MASK> day. Even <MASK> it rains.
label1: nice, label2: though
State of the art results, 100s of variations.
The course contents are copyrighted (c) 2018,2019,2020 - onwards by TU Delft and their respective authors and licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.