IN4334 - Machine Learning for Software Engineering

New projects (2020)

We propose the following projects for the 2020 edition. Remember that we very much welcome new ideas!

Project 1: Log recommendation (Jean / Maurício) Log statements are fundamental for good software monitoring. In this project, we will build models that predict which variables of a given method should be logged. It is possible to extract large datasets of logged methods from open source projects.
Project 2: Type inference (Amir / Georgios) Inspired by recent Graph Neural Network-based type inference models, namely the Typilus [1] and LAMBDANET [2] approaches, we aim to build a GNN model that can potentially outperform the current state-of-the-art approaches at the type prediction task for Python. Specifically, we create a (new) graph representation for Python source codes that leverages the merits of both Typilus and LAMBDANET. Also, we consider function calls within a source code file in our graph representation. Finally, we feed the described graph representation to a GNN model for predicting types of symbols like functions’ arguments, local variables, and the return type of functions.
Project 3: Code completion (Maliheh / Georgios) Recent work has shown multimodal approaches can improve the performance of models in various fields of ML and SE such as code summarization. The goal in this project is to use both natural text (comments) and code snippets to better model source code for code completion. To learn more check out the work by Allamanis et al. [3].
Project 4: Code summarization (Maliheh / Maurício) Developers tend to customize the code they write, specifically, they define various identifiers. This causes an explosion of unique tokens in a source code dataset. This problem is called the Out-Of-Vocabulary (OOV) challenge. In this project, we aim at mitigating the OOV problem in code summarization. To learn more check out the work by Karampatsis et al. [4].
Project 5: VarMisuse in a different programming language (Maurício). Hellendoorn et al. [5] have shown that “graph sandwiches” have a better performance in the VarMisuse problem in Python. The goal of this project is to replicate this paper, but using Java as programming language. Do the results also hold for Java? Authors have released their source code here.
Project 6: Code refactoring (Maurício) Predict Extract Variable. The model would predict the parts of code that need the refactoring. It is possible to extract large datasets of Extract Variable refactorings automatically for training.
Project 7: Story point estimation (Elvan / Georgios) Choetkiertikul et al. [6] proposed a Long-Deep Recurrent Neural Network (LD-RNN) model that is end-to-end trainable from raw text (i.e., issue reports) to story point estimates. However, their approach assumes that the development team stays static over time, which is often not the case in practice. The goal of this project is to improve their approach by using the state-of-the-art language model BERT and by modeling team dynamics (adding features, such as team structure, developer’s skill and workload). Are we able to outperform models that are based on text features only?
Project 8: Task delay prediction (Elvan / Georgios) Choetkiertikul et al. [7] proposed a novel approach to leverage task dependencies (i.e. networked data) for predicting delays in software tasks. The goal of this project is to investigate which other types of (implicit) task relationships can be inferred from task information and how task weights can be applied to improve prediction performance.
Project 9: Representation benchmark (Georgios) Given 1-2 tasks, 1-2 benchmark datasets and various representations (GGNNs, Transformers, LSTMs, language models), which representation gives the best results? Can we standardize the tasks and the datasets so that others benchmark their solutions against our datasets and create leaderboards?
Project 10: Bug prediction (Amir / Georgios) Recently, Karampatsis et. al [8] provided the ManySStuBs4J dataset. From open-source Java projects, they mined single-statement bugs and categorized them into 16 bug templates. In this project, our aim is to predict single-statement bugs using existing ML-based techniques in the literature and possibly suggest a fix. The dataset can be downloaded from here.
Project 11: Mining and applying bug mutators (Georgios) Mine a set of buggy commits and automatically extract bug patterns (à la reverse Getafix [9]); apply them on code to train bug finding approaches like DeepBugs.

Wild ideas

Those projects don’t have any related work: if you want to be bold and do real research, here is your chance!

Model (real!) data-flow graphs with Heterogeneous Graph Transformers in order to solve the VarMisuse task: https://arxiv.org/pdf/2003.01332.pdf
Generative adversarial network (GAN) based source code generation using textual descriptions as inputs. Alternatively, GAN-based source code summarization.
Making static analysis sounder by predicting missing links in call graphs.
Authoship information: can we learn styles at the method or block level?
Code style extraction: is this code functional or imperative?
NLP-based automated fix description: Given a stack trace, mine StackOverflow for a solution and present a fix summary
Cross-lingual language modeling for source code: The goal of this project is to train one model for several programming languages while not losing per-language performance. Such a model is also beneficial for low-resource languages.
Abbreviation completion: Developers often type code in abbreviated format; either in naming the identifiers or writing syntax tokens. A model that can extend/complete multiple abbreviated tokens not only can save developers time while coding, but also will make the code more comprehensible using extended names. To learn more checkout Han et al. work [10].
Most developers are proficient with one programming language and relatively familiar with other ones. Hence, they may sometimes experience problems regarding the correct syntax and styling of the less-familiar programming language. Suppose a Java developer needs to write a Python script but keeps mixing/forgetting the correct syntax. It is highly possible that her Python code as-is contains some small errors. So we need a model that is able to recommend the correct phrase in Python! In other words, we aim at developing a model that, given an arbitrary phrase written by a developer, can find the closest phrase with the correct syntax for the target language.
Personalized code recommenders: Developers have different coding styles and expertise. Current code completion systems neglect this fact since they are mostly trained on a dataset collected from a broad range of software repositories. This approach obscures the distinction between developers’ styles. In this project, our goal is to incorporate developer-related historical information to design personalized auto-completion models.

Suggested projects in 2019 (for inspiration)

Code translation. Translate source code from one language to the other, e.g., from Java to C# or, maybe more interesting to industry, Cobol to Java. See Chen et al. [11] as reference. As a possible dataset to be explored, coding websites, such as Codeforces and Rosetta Code, contain the same problem implemented in multiple different languages.
Code completion. IDEs have been suggesting code completion for years now. However, the use of DL brings us new possibilities: suggesting more contextual code completion. Researchers have been showing that this is indeed a tricky task [12]. This project is about replicating (or improving) upon this paper.
Type inference. Inferring the type of a variable, especially in dynamically typed languages, can be a challenge. Hellendoorn [13] has shown us that DL techniques can indeed be very precise in this task. This project aims at replicating this paper.
API usage. Developers often need help in learning how to use an API. Can we provide developers with API usage, given some natural text? Gu et al. [14] and Liu et al [15] showed that this is possible. Your project here is to replicate one of these papers.
Mutation testing. In mutation testing, we mutate the original program and check if the existing test cases are able to find the error. Large companies, such as Google, have been adopting mutation testing, but not without its challenges [16]. In particular, given the size of programs, the number of possible mutants is enormous; thus, prioritizing which mutants to generate is currently an open problem. Tufano et al. [17] proposed the use of deep learning to learn which mutants are really relevant, based on bug fixes.
Logging strategies. Identifying where to log is a hard task in large systems; on one hand, you don’t want to log too much; on the other hand, if you don’t log an important part of the code, you might miss information to debug a crash. Researchers have been empirically studying how developers decide where to log [18], and have been proposing supervised ML techniques to suggest improvements in log lines [19] [20]. In this project, you will study whether NLP based approaches provide better results.
Anomaly detection in logs. Under construction.
Log reduction. Modern software systems generate lots of runtime information, that developers need to examine in order to identify causes of failures, when those happen. With this project, you will build a tool that given a log and a label (e.g. pass/fail) it will learn a model that automatically identifies the important lines in an input log.
Code refactoring. Maintaining (bad) source code is not an easy task. And, although industry has widely adopted linters, they have a well-known problem: the number of false positives [21]. We conjecture that ML-based techniques will be able to provide more useful refactoring to developers. In this task, you will train ML models to recommend (or maybe even automatically applying) refactorings. See the RefactoringMiner tool, which might help you in collecting real-world refactorings.
Flaky tests. Flaky tests are tests that present non-deterministic behavior (i.e., tests that sometimes pass, sometimes fail). Mark Harman, Facebook’s senior scientist, mentioned that flaky test is an important problem at Facebook. Google says that 1.5% of their 4.2M tests present flaky behavior at some point [22]. Researchers have been empirically investigating the problem. Luo et al. [23] noticed that async wait, concurrency, and test order dependency are the most common causes for flakiness. Lam et al. [24] recently developed iDFlakies, a tool that aims at identifying tests that are flaky due to order execution. We ask: can the use of ML help us in identifying flaky tests?
Tagging Algorithm. Whenever solving coding challenges, e.g., the ones from CodeForces, you have to choose a strategy: will you apply dynamic programming? Will you apply brute force? Does it involve probabilities? Labeling a piece of code with such tags might be really useful to education. Or, somewhat related, given the textual description of the problem, can we suggest solution strategies? Current and closest (non ML) related work on this topic aims at inferring the algorithm complexity based on Java bytecode [25].
Programming styles. A recent paper [26] in the PL field caused lots of stir by (very vocally) refuting the findings of a Comm. ACM highlight paper[27]. Both papers try to quantify the effects of programming language use on error-proneness of software. However, their approach is too coarse-grained, as we can write any style of code in any programming language. What is needed is a more fine-grained approach, that e.g. links code styles (e.g. functional, imperative or declarative) to bugs. With this project, you will build an automated program style detector by feeding an ML solution with code in functional and imperative styles and let it learn to differentiate between the two.

References

Bibliography

[1]

M. Allamanis, E. T. Barr, S. Ducousso, and Z. Gao, “Typilus: Neural type hints,” in Proceedings of the 41st ACM SIGPLAN conference on programming language design and implementation, 2020, pp. 91–105.

[2]

J. Wei, M. Goyal, G. Durrett, and I. Dillig, “LambdaNet: Probabilistic type inference using graph neural networks,” arXiv preprint arXiv:2005.02161, 2020.

[3]

M. Allamanis, D. Tarlow, A. Gordon, and Y. Wei, “Bimodal modelling of source code and natural language,” in International conference on machine learning, 2015, pp. 2123–2132.

[4]

R.-M. Karampatsis, H. Babii, R. Robbes, C. Sutton, and A. Janes, “Big code!= Big vocabulary: Open-vocabulary models for source code,” arXiv preprint arXiv:2003.07914, 2020.

[5]

V. J. Hellendoorn, C. Sutton, R. Singh, P. Maniatis, and D. Bieber, “Global relational models of source code,” in International conference on learning representations, 2019.

[6]

M. Choetkiertikul, H. K. Dam, T. Tran, T. Pham, A. Ghose, and T. Menzies, “A deep learning model for estimating story points,” IEEE Transactions on Software Engineering, vol. 45, no. 7, pp. 637–656, 2018.

[7]

M. Choetkiertikul, H. K. Dam, T. Tran, and A. Ghose, “Predicting delays in software projects using networked classification (t),” in 2015 30th IEEE/ACM international conference on automated software engineering (ASE), 2015, pp. 353–364.

[8]

R.-M. Karampatsis and C. Sutton, “How often do single-statement bugs occur? The ManySStuBs4J dataset,” arXiv preprint arXiv:1905.13334, 2019.

[9]

J. Bader, A. Scott, M. Pradel, and S. Chandra, “Getafix: Learning to fix bugs automatically,” Proceedings of the ACM on Programming Languages, vol. 3, no. OOPSLA, pp. 1–27, 2019.

[10]

S. Han, D. R. Wallace, and R. C. Miller, “Code completion from abbreviated input,” in 2009 IEEE/ACM international conference on automated software engineering, 2009, pp. 332–343.

[11]

X. Chen, C. Liu, and D. Song, “Tree-to-tree neural networks for program translation,” in Advances in neural information processing systems, 2018, pp. 2547–2557.

[12]

V. J. Hellendoorn, S. Proksch, H. C. Gall, and A. Bacchelli, “When code completion fails: A case study on real-world completions,” in Proceedings of the 41st international conference on software engineering, 2019, pp. 960–970.

[13]

V. J. Hellendoorn, C. Bird, E. T. Barr, and M. Allamanis, “Deep learning type inference,” in Proceedings of the 2018 26th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, 2018, pp. 152–162.

[14]

X. Gu, H. Zhang, D. Zhang, and S. Kim, “Deep API learning,” in Proceedings of the 2016 24th ACM SIGSOFT international symposium on foundations of software engineering, 2016, pp. 631–642.

[15]

J. Liu, S. Kim, V. Murali, S. Chaudhuri, and S. Chandra, “Neural query expansion for code search,” in Proceedings of the 3rd ACM SIGPLAN international workshop on machine learning and programming languages, 2019, pp. 29–37.

[16]

G. Petrović and M. Ivanković, “State of mutation testing at google,” in Proceedings of the 40th international conference on software engineering: Software engineering in practice, 2018, pp. 163–171.

[17]

M. Tufano, C. Watson, G. Bavota, M. Di Penta, M. White, and D. Poshyvanyk, “Learning how to mutate source code from bug-fixes,” arXiv preprint arXiv:1812.10772, 2018.

[18]

Q. Fu et al., “Where do developers log? An empirical study on logging practices in industry,” in Companion proceedings of the 36th international conference on software engineering, 2014, pp. 24–33.

[19]

H. Li, W. Shang, Y. Zou, and A. E. Hassan, “Towards just-in-time suggestions for log changes,” Empirical Software Engineering, vol. 22, no. 4, pp. 1831–1865, 2017.

[20]

H. Li, W. Shang, and A. E. Hassan, “Which log level should developers choose for a new logging statement?” Empirical Software Engineering, vol. 22, no. 4, pp. 1684–1716, 2017.

[21]

B. Johnson, Y. Song, E. Murphy-Hill, and R. Bowdidge, “Why don’t software developers use static analysis tools to find bugs?” in Proceedings of the 2013 international conference on software engineering, 2013, pp. 672–681.

[22]

J. Listfield, “Where do our flaky tests come from?” https://testing.googleblog.com/2017/04/where-do-our-flaky-tests-come-from.html, 2017.

[23]

Q. Luo, F. Hariri, L. Eloussi, and D. Marinov, “An empirical analysis of flaky tests,” in Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering, 2014, pp. 643–653.

[24]

W. Lam, R. Oei, A. Shi, D. Marinov, and T. Xie, “iDFlakies: A framework for detecting and partially classifying flaky tests,” in 2019 12th IEEE conference on software testing, validation and verification (ICST), 2019, pp. 312–322.

[25]

E. Albert, P. Arenas, S. Genaim, G. Puebla, and D. Zanardini, “Cost analysis of java bytecode,” in European symposium on programming, 2007, pp. 157–172.

[26]

E. D. Berger, C. Hollenbeck, P. Maj, O. Vitek, and J. Vitek, “On the impact of programming languages on code quality,” arXiv preprint arXiv:1901.10220, 2019.

[27]

B. Ray, D. Posnett, V. Filkov, and P. Devanbu, “A large scale study of programming languages and code quality in github,” in Proceedings of the 22Nd ACM SIGSOFT international symposium on foundations of software engineering, 2014, pp. 155–165.

Copyright

The course contents are copyrighted (c) 2018,2019,2020 - onwards by TU Delft and their respective authors and licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.