IN4334 - Machine Learning for Software Engineering

General description

Software repositories archive valuable software engineering data, such as source code, execution traces, historical code changes, mailing lists, and bug reports. This data contains a wealth of information about a project’s status and history. Doing data science on software repositories, researchers can gain empirically based understanding of software development practices, and practitioners can better manage, maintain, and evolve complex software projects.

In the recent years, the advances in Machine Learning and AI technologies, as demonstrated by the successful application of Deep Neural Networks in various domains did not go unnoticed in the field of Software Engineering. Researchers have applied DNNs to tackle issues such as automated program repair, code summarization, code completion, code structure representation, etc.

IN4334 is a seminar course that aims to give students a deep understanding of and hands-on approach on how deep neural networks and NLP techniques are used to represent knowledge and solve existing SE problems in novel ways.

Learning Objectives

This course will enable students to:

Understand and analyze related work in the area of machine learning for software engineering
Apply appropriate methods to represent source code for consumption by ML algorithms
Evaluate the applicability of results in the software analytics literature on practical problems
Analyze data from software repositories and extract new insights

Before you decide to join the course

We welcome all students that are willing to dive into cutting edge ML research.
We expect students to have experience with ML already. We also expect them to have some interest in program analysis. This is not an introductory course.
Due to resource constraints, the course is strictly offered for 40 students. If we have more than 40 registrations, we will apply CV-based selection. Students with demonstrable experience (i.e., GitHub repos) in ML will be preferred.

Course Organization

5 ECTS: This means that you need to devote at least 140 hours of study for this course, per person. Given that the course runs in a period of 7 weeks, the workload is around 20 hours a week.
Reading sessions: The course consists of lectures and reading sessions. You are not required, but you are strongly encouraged, to attend. Per meeting, we will be discussing 1 paper (presentations given either by the lecturer or by teams) in terms of techniques, insights and impact.
Homework: Before each lecture, you must read and prepare questions about the papers that will be discussed during the lecture. You can find the list of the papers to read on the beginning of each week’s lecture. You must also watch/read any material pointed to by the sylabus.

Please keep in mind that you are attending this course on voluntary basis. Coming to the classroom unprepared will not be the best use of your time, so do your homework first!

Lecturers: The course is supervised by Georgios Gousios and Maliheh Izadi, who are responsible for the content and the assignments. Several people will provide help in topics of their expertise.
Course deliverables: To finish the course you will need to:
- Give a 30 min presentation of 1-2 papers
- Apply ML to a software engineering problem
- Give a final presentation about your project
- Peer review of 2 papers from other groups
Groups: You will work in groups of 4-5 persons. You are free to choose your group partners; if this is not possible, the course lecturers will assign you to a group.
Labs: Unsupervised, optional. 4 hours per week, designed to give you a place and time to work together. No feedback will be provided during lab hours by the lecturers.

The project

Lately, machine learning techniques have been successfully tailored to many software engineering problems. For instance, intelligent code completion helps developers finish their programming tasks faster and more efficiently by decreasing the typing effort, providing type-correct solutions, and enabling them to explore APIs. InCoder, UniXcoder, and CoPilot are among the most recent deep learning-based solutions for an enhanced software development experience. In this project, we aim to tailor pre-trained language models for source code to solve software engineering tasks including code completion, type completion, and code summarization. Each gorup will fine-tune a pre-trained model for the specific task at hand. Then, you will evaluate your model on the provided test set. As for the dataset, you will use the benchmark datasets provided by CodeXGlue, the General Language Understanding Evaluation benchmark for CODE. If you aim to use your models on more languages or data scoures, you should use other publicly available datasets or scrape and proeprocess the new data yourself.

You will implement different ML/DL models. You are required to use Python and more specifically, Pytorch. Check our curated list of tutorials that might help you in getting started with different NLP, DL, and ML topics.

Required reading for week 1:

This paper, by Pradel and Sen [1]
Discussion presentation template

Date	Week	Lecture	Reading material	Lecturer
6/9	1	1	Course Introduction, How to read a paper in a group, DeepBugs	GG
9/9	1	2	Representing source code as text, Naturalness of software	GG
13/9	2	3	Large language models and alt representations, Code2Seq	GG
16/9	2	4	Graph Neural Networks: Introduction, Learning to represent programs with graphs	MA
20/9	3	5	Code Understanding and Generation: CodeT5	MI
23/9	3	6	Code Representation: UniXcoder	MI
27/9	4	7	Code Filling: InCoder	GG / MI
30/9	4	8	Code summarization: On the Evaluation of Neural Code Summarization	GG / MI
4/10	5	9	Type prediction: Type4Py	AM / GG
7/10	5	10	Feedback session	GG / MI
11/10	6	11	Type prediction: HiTyper	AM
14/10	6	12	Reverse engineering Learning to Find Usages of Library Functions in Optimized Binaries	AS
18/10	7	13	Software Effort Estimation: Heterogeneous Graph Neural Networks for Software Effort Estimation	EK / GG
21/10	7	14		MI / GG
28/10	8	15	Presentation day	GG / MI

Lecturers

GG: Georgios Gousios
MI: Maliheh Izadi

Guest lecturers

MA: Miltos Allamanis (Google)
AS: Anand Sawant (Endor Labs)

Assistants

AM: Amir Mir
EK: Elvan Kula

Deadlines

29/10: Submission of the final version of the report (incorporating the feedback from your peers) + blog post
30/10: Final presentations

Assessment

The course grade will be calculated as:

80% - Paper
20% - Final presentation

The final papers will be peer-reviewed by 2 other teams.

Online resources

Here are some resources for extra study, if you are interested in the field:

Software Analytics Lab’s GH repositoriy provides a curated list of papers, datasets, and tools related to the application of Machine Learning for Software Engineering
Miltos Allamanis’s web page provides a great set of links to relevant literature
Michael Pradel’s course at U Stuttgart
Panel at MSR 2020 on ML4SE

Bibliography

[1]

M. Pradel and K. Sen, “DeepBugs: A learning approach to name-based bug detection,” Proc. ACM Program. Lang., vol. 2, no. OOPSLA, pp. 147:1–147:25, Oct. 2018.

Copyright

The course contents are copyrighted (c) 2018,2019,2020 - onwards by TU Delft and their respective authors and licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.