Software repositories archive valuable software engineering data, such as source code, execution traces, historical code changes, mailing lists, and bug reports. This data contains a wealth of information about a project’s status and history. Doing data science on software repositories, researchers can gain empirically based understanding of software development practices, and practitioners can better manage, maintain, and evolve complex software projects.
In the recent years, the advances in Machine Learning and AI technologies, as demonstrated by the successful application of Deep Neural Networks in various domains did not go unnoticed in the field of Software Engineering. Researchers have applied DNNs to tackle issues such as automated program repair, code summarization, code completion, code structure representation, etc.
IN4334 is a seminar course that aims to give students a deep understanding of and hands-on approach on how deep neural networks and NLP techniques are used to represent knowledge and solve existing SE problems in novel ways.
This course will enable students to:
The course will be very demanding. We welcome all students that are willing to work hard and dive into cutting edge ML research.
We expect students to have significant experience with ML already. We also expect them to have some experience with program analysis. This is not an introductory course. At the very least, you should know what an RNN is, what an embedding is and what an AST is. Courses that might have helped along the way include Deep Learning and Information Retrieval.
Due to resource constraints, the course is strictly offered for 40 students. If we have more than 40 registrations, we will apply CV-based selection. Students with demonstrable experience (i.e., GitHub repos) in ML will be preferred.
5 ECTS: This means that you need to devote at least 140 hours of study for this course, per person. Given that the course runs in a period of 7 weeks, the workload is around 20 hours a week.
Lectures: The course consists of 14 2-hour lectures. You are not required, but you are strongly encouraged, to attend. We will be discussing 2-3 papers (presentations given either by the lecturer or by teams) in terms of techniques, insights and impact.
Homework: Before each lecture, you must read and prepare questions about the papers that will be discussed during the lecture. You can find the list of the papers to read on the beginning of each week’s lecture.
Lecturers: The course is supervised by Georgios Gousios and Maurício Aniche, who are responsible for the content and the assignments. Several people will provide help in topics of their expertise.
Course deliverables: To finish the course you will need to:
Groups: You will work in groups of 5 persons. You are free to choose your group partners; if this is not possible, the course lecturers will assign you to a group.
Labs: Unsupervised, optional. 4 hours per week, designed to give you a place and time to work together.
During the course, you will choose a software engineering problem and will propose a ML-based solution for that problem.
See the list of suggested projects (and existing papers that try to tackle it). Your job will be to either:
Replicate an existing paper: Replication is a topic much touted but seldom practiced in the software engineering community. It is, however, a core aspect of science, especially empirical. You can download readily available data sets published together with the paper, requesting the data from the original authors or by applying the same techniques on different data.
Propose a completely new approach to the problem (highly appreciated!). Did you find a way to improve the existing work? Did you see the problem from a perspective that current research hasn’t explored yet? Your task will be to collect data and test your hypothesis.
You will implement different ML/DL models. You are required to use Python and more specifically, Pytorch. Check our curated list of tutorials that might help you in getting started with different NLP, DL, and ML topics.
|3/9||1||1||Course Introduction, How to read a paper in a group, DeepBugs(V)||GG / MFA|
|8/9||2||2||Representing source code as ASTs/paths||MFA|
|10/9||2||3||Representing source code as text||GG|
|15/9||3||4||Representing source code as graphs||MA / GG|
|17/9||3||5||Type inference||AM / GG|
|22/9||4||6||Code summarization||MI / GG|
|24/9||4||7||Code completion||MI / GG|
|29/9||5||8||Feedback session||GG / MFA|
|1/10||5||9||Encoding changes||AM / GG|
|8/10||6||11||Localizing bugs||GG / MP|
|13/10||7||12||Log recommendation||JC / MFA|
|15/10||7||13||Effort estimation||EK / GG|
|30/10||9||14||Presentation day||MFA / GG|
The course grade will be calculated as:
The final papers will be peer-reviewed by 2 other teams. The peer reviews are compulsory and will receive a pass/fail grade.
Here are some resources for extra study, if you are interested in the field:
The course contents are copyrighted (c) 2018,2019,2020 - onwards by TU Delft and their respective authors and licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.