We propose the following projects for the 2020 edition. Remember that we very much welcome new ideas!
Project 1: Log recommendation (Jean / Maurício) Log statements are fundamental for good software monitoring. In this project, we will build models that predict which variables of a given method should be logged. It is possible to extract large datasets of logged methods from open source projects.
Project 2: Type inference (Amir / Georgios) Inspired by recent Graph Neural Network-based type inference models, namely the Typilus [1] and LAMBDANET [2] approaches, we aim to build a GNN model that can potentially outperform the current state-of-the-art approaches at the type prediction task for Python. Specifically, we create a (new) graph representation for Python source codes that leverages the merits of both Typilus and LAMBDANET. Also, we consider function calls within a source code file in our graph representation. Finally, we feed the described graph representation to a GNN model for predicting types of symbols like functions’ arguments, local variables, and the return type of functions.
Project 3: Code completion (Maliheh / Georgios) Recent work has shown multimodal approaches can improve the performance of models in various fields of ML and SE such as code summarization. The goal in this project is to use both natural text (comments) and code snippets to better model source code for code completion. To learn more check out the work by Allamanis et al. [3].
Project 4: Code summarization (Maliheh / Maurício) Developers tend to customize the code they write, specifically, they define various identifiers. This causes an explosion of unique tokens in a source code dataset. This problem is called the Out-Of-Vocabulary (OOV) challenge. In this project, we aim at mitigating the OOV problem in code summarization. To learn more check out the work by Karampatsis et al. [4].
Project 5: VarMisuse in a different programming language (Maurício). Hellendoorn et al. [5] have shown that “graph sandwiches” have a better performance in the VarMisuse problem in Python. The goal of this project is to replicate this paper, but using Java as programming language. Do the results also hold for Java? Authors have released their source code here.
Project 6: Code refactoring (Maurício) Predict Extract Variable. The model would predict the parts of code that need the refactoring. It is possible to extract large datasets of Extract Variable refactorings automatically for training.
Project 7: Story point estimation (Elvan / Georgios) Choetkiertikul et al. [6] proposed a Long-Deep Recurrent Neural Network (LD-RNN) model that is end-to-end trainable from raw text (i.e., issue reports) to story point estimates. However, their approach assumes that the development team stays static over time, which is often not the case in practice. The goal of this project is to improve their approach by using the state-of-the-art language model BERT and by modeling team dynamics (adding features, such as team structure, developer’s skill and workload). Are we able to outperform models that are based on text features only?
Project 8: Task delay prediction (Elvan / Georgios) Choetkiertikul et al. [7] proposed a novel approach to leverage task dependencies (i.e. networked data) for predicting delays in software tasks. The goal of this project is to investigate which other types of (implicit) task relationships can be inferred from task information and how task weights can be applied to improve prediction performance.
Project 9: Representation benchmark (Georgios) Given 1-2 tasks, 1-2 benchmark datasets and various representations (GGNNs, Transformers, LSTMs, language models), which representation gives the best results? Can we standardize the tasks and the datasets so that others benchmark their solutions against our datasets and create leaderboards?
Project 10: Bug prediction (Amir / Georgios) Recently, Karampatsis et. al [8] provided the ManySStuBs4J dataset. From open-source Java projects, they mined single-statement bugs and categorized them into 16 bug templates. In this project, our aim is to predict single-statement bugs using existing ML-based techniques in the literature and possibly suggest a fix. The dataset can be downloaded from here.
Project 11: Mining and applying bug mutators (Georgios) Mine a set of buggy commits and automatically extract bug patterns (à la reverse Getafix [9]); apply them on code to train bug finding approaches like DeepBugs.
Those projects don’t have any related work: if you want to be bold and do real research, here is your chance!
Code translation. Translate source code from one language to the other, e.g., from Java to C# or, maybe more interesting to industry, Cobol to Java. See Chen et al. [11] as reference. As a possible dataset to be explored, coding websites, such as Codeforces and Rosetta Code, contain the same problem implemented in multiple different languages.
Code completion. IDEs have been suggesting code completion for years now. However, the use of DL brings us new possibilities: suggesting more contextual code completion. Researchers have been showing that this is indeed a tricky task [12]. This project is about replicating (or improving) upon this paper.
Type inference. Inferring the type of a variable, especially in dynamically typed languages, can be a challenge. Hellendoorn [13] has shown us that DL techniques can indeed be very precise in this task. This project aims at replicating this paper.
API usage. Developers often need help in learning how to use an API. Can we provide developers with API usage, given some natural text? Gu et al. [14] and Liu et al [15] showed that this is possible. Your project here is to replicate one of these papers.
Mutation testing. In mutation testing, we mutate the original program and check if the existing test cases are able to find the error. Large companies, such as Google, have been adopting mutation testing, but not without its challenges [16]. In particular, given the size of programs, the number of possible mutants is enormous; thus, prioritizing which mutants to generate is currently an open problem. Tufano et al. [17] proposed the use of deep learning to learn which mutants are really relevant, based on bug fixes.
Logging strategies. Identifying where to log is a hard task in large systems; on one hand, you don’t want to log too much; on the other hand, if you don’t log an important part of the code, you might miss information to debug a crash. Researchers have been empirically studying how developers decide where to log [18], and have been proposing supervised ML techniques to suggest improvements in log lines [19] [20]. In this project, you will study whether NLP based approaches provide better results.
Anomaly detection in logs. Under construction.
Log reduction. Modern software systems generate lots of runtime information, that developers need to examine in order to identify causes of failures, when those happen. With this project, you will build a tool that given a log and a label (e.g. pass/fail) it will learn a model that automatically identifies the important lines in an input log.
Code refactoring. Maintaining (bad) source code is not an easy task. And, although industry has widely adopted linters, they have a well-known problem: the number of false positives [21]. We conjecture that ML-based techniques will be able to provide more useful refactoring to developers. In this task, you will train ML models to recommend (or maybe even automatically applying) refactorings. See the RefactoringMiner tool, which might help you in collecting real-world refactorings.
Flaky tests. Flaky tests are tests that present non-deterministic behavior (i.e., tests that sometimes pass, sometimes fail). Mark Harman, Facebook’s senior scientist, mentioned that flaky test is an important problem at Facebook. Google says that 1.5% of their 4.2M tests present flaky behavior at some point [22]. Researchers have been empirically investigating the problem. Luo et al. [23] noticed that async wait, concurrency, and test order dependency are the most common causes for flakiness. Lam et al. [24] recently developed iDFlakies, a tool that aims at identifying tests that are flaky due to order execution. We ask: can the use of ML help us in identifying flaky tests?
Tagging Algorithm. Whenever solving coding challenges, e.g., the ones from CodeForces, you have to choose a strategy: will you apply dynamic programming? Will you apply brute force? Does it involve probabilities? Labeling a piece of code with such tags might be really useful to education. Or, somewhat related, given the textual description of the problem, can we suggest solution strategies? Current and closest (non ML) related work on this topic aims at inferring the algorithm complexity based on Java bytecode [25].
Programming styles. A recent paper [26] in the PL field caused lots of stir by (very vocally) refuting the findings of a Comm. ACM highlight paper[27]. Both papers try to quantify the effects of programming language use on error-proneness of software. However, their approach is too coarse-grained, as we can write any style of code in any programming language. What is needed is a more fine-grained approach, that e.g. links code styles (e.g. functional, imperative or declarative) to bugs. With this project, you will build an automated program style detector by feeding an ML solution with code in functional and imperative styles and let it learn to differentiate between the two.
The course contents are copyrighted (c) 2018,2019,2020 - onwards by TU Delft and their respective authors and licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.