The two papers to read for this session are:

  1. The code2vec approach, proposed by Uri Alon and colleagues [1].
  2. The code2seq approach, the “next version” of code2vec, also proposed by Uri Alon and colleagues [2].

We suggest you to read these papers in this order.

More on code2vec

As a complement, you can watch Uri Alon’s presentation at POPL 2019:

You may also look at the source code implementation of code2vec. Sometimes it is easier to understand the model by reading its actual code! In particular, the Code2VecModel#_create_keras_model method is the one you should care about:

More on code2seq

As a complement, you can watch Ashley Kelgard’s summary of the code2seq paper: (video does not allow embedding).

If you are not familiar with seq2seq models, check out this very clear explanation by Jay Alammar:


After reading the two papers, we want you to raise discussion points (e.g., things you found interesting and curious, things you did not fully understand) for our meeting.

Moreover, reflect about the following points. We will all share our perspectives during the lecture.

Regarding code2vec:

Regarding code2seq:

Can I read more about code2vec and code2seq?

While not compulsory for this lecture, other interesting papers along the same lines are [3] and [4]. Feel free to read them!


U. Alon, M. Zilberstein, O. Levy, and E. Yahav, Code2Vec: Learning distributed representations of code,” Proc. ACM Program. Lang., vol. 3, no. POPL, pp. 40:1–40:29, Jan. 2019.
U. Alon, S. Brody, O. Levy, and E. Yahav, “code2seq: Generating sequences from structured representations of code,” arXiv preprint arXiv:1808.01400, 2018.
T. Hoang, H. J. Kang, J. Lawall, and D. Lo, “CC2Vec: Distributed representations of code changes,” arXiv preprint arXiv:2003.05620, 2020.
R. Compton, E. Frank, P. Patros, and A. Koay, “Embedding java classes with code2vec: Improvements from variable obfuscation [accepted],” in MSR 2020, 2020.