Representing source code as ASTs/paths

The two papers to read for this session are:

The code2vec approach, proposed by Uri Alon and colleagues [1].
The code2seq approach, the “next version” of code2vec, also proposed by Uri Alon and colleagues [2].

We suggest you to read these papers in this order.

More on code2vec

As a complement, you can watch Uri Alon’s presentation at POPL 2019:

You may also look at the source code implementation of code2vec. Sometimes it is easier to understand the model by reading its actual code! In particular, the Code2VecModel#_create_keras_model method is the one you should care about: https://github.com/tech-srl/code2vec/blob/master/keras_model.py#L37

More on code2seq

As a complement, you can watch Ashley Kelgard’s summary of the code2seq paper: https://www.youtube.com/watch?v=qO0E-otkJFI (video does not allow embedding).

If you are not familiar with seq2seq models, check out this very clear explanation by Jay Alammar: http://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/

Reflection

After reading the two papers, we want you to raise discussion points (e.g., things you found interesting and curious, things you did not fully understand) for our meeting.

Moreover, reflect about the following points. We will all share our perspectives during the lecture.

Regarding code2vec:

In the presentation above, Uri Alon shows a learning effort vs cost analysis graph, and he concludes that AST paths is a good trade-off; it’s quite language-independent, it captures some aspects of the PL itself, and it’s cheaper than CFGs or DFGs. What do you think of that? Do you see any other alternatives?
In your opinion, what is the importance of the “arrows” in the path representation?
The network is trained altogether, i.e., together with the downstream task (in that case, predict method names). Do you think the generated code vectors are too coupled with the downstream task or can we reuse them for other tasks?
Regarding out-of-vocabulary (OoV) predictions, the paper states that “We thus believe that our efforts would be better spent on the prediction of complete names.” What do you think of that?
The “Large corpus, simple model” is an interesting section of that paper. Do you agree with this?
Interestingly, not having the tokens reduces the performance significantly. Can you think of reasons why that happens?
Code2vec selects random K paths from the code snippet. Do you think this is a good decision or having a systematic way to extract these paths would work better?

Regarding code2seq:

What are the differences and similarities with the code2vec paper? What are the differences to standard seq2seq models? (Note that in code2vec, the code vector is created through a “simple” dense layer; code2seq uses LSTMs. What type of differences would you expect in this new architecture?)
In contrast to code2vec, code2seq uses subtokens to encode the terminal nodes. And, according to the ablation study, it seems to make an important difference. Why do you think they changed their minds from one paper to another?
Transformers (the coolest thing now!) performed worse than code2seq. However, code was treated as a sequence of inputs only. Is it a good idea to try it out with the AST paths? Can you see code2seq using more modern architectures?
This paper uses code2vec as one of the baselines. Interestingly, code2vec, which had a F1 score of 58-59% in its original paper, now only presented a 42% F1 score. Why do you think that happened? What does that tell us?

Can I read more about code2vec and code2seq?

While not compulsory for this lecture, other interesting papers along the same lines are [3] and [4]. Feel free to read them!

References

[1]

U. Alon, M. Zilberstein, O. Levy, and E. Yahav, “Code2Vec: Learning distributed representations of code,” Proc. ACM Program. Lang., vol. 3, no. POPL, pp. 40:1–40:29, Jan. 2019.

[2]

U. Alon, S. Brody, O. Levy, and E. Yahav, “code2seq: Generating sequences from structured representations of code,” arXiv preprint arXiv:1808.01400, 2018.

[3]

T. Hoang, H. J. Kang, J. Lawall, and D. Lo, “CC2Vec: Distributed representations of code changes,” arXiv preprint arXiv:2003.05620, 2020.

[4]

R. Compton, E. Frank, P. Patros, and A. Koay, “Embedding java classes with code2vec: Improvements from variable obfuscation [accepted],” in MSR 2020, 2020.