New projects (2020)

We propose the following projects for the 2020 edition. Remember that we very much welcome new ideas!

Wild ideas

Those projects don’t have any related work: if you want to be bold and do real research, here is your chance!

  • Model (real!) data-flow graphs with Heterogeneous Graph Transformers in order to solve the VarMisuse task: https://arxiv.org/pdf/2003.01332.pdf
  • Generative adversarial network (GAN) based source code generation using textual descriptions as inputs. Alternatively, GAN-based source code summarization.
  • Making static analysis sounder by predicting missing links in call graphs.
  • Authoship information: can we learn styles at the method or block level?
  • Code style extraction: is this code functional or imperative?
  • NLP-based automated fix description: Given a stack trace, mine StackOverflow for a solution and present a fix summary
  • Cross-lingual language modeling for source code: The goal of this project is to train one model for several programming languages while not losing per-language performance. Such a model is also beneficial for low-resource languages.
  • Abbreviation completion: Developers often type code in abbreviated format; either in naming the identifiers or writing syntax tokens. A model that can extend/complete multiple abbreviated tokens not only can save developers time while coding, but also will make the code more comprehensible using extended names. To learn more checkout Han et al. work [10].
  • Most developers are proficient with one programming language and relatively familiar with other ones. Hence, they may sometimes experience problems regarding the correct syntax and styling of the less-familiar programming language. Suppose a Java developer needs to write a Python script but keeps mixing/forgetting the correct syntax. It is highly possible that her Python code as-is contains some small errors. So we need a model that is able to recommend the correct phrase in Python! In other words, we aim at developing a model that, given an arbitrary phrase written by a developer, can find the closest phrase with the correct syntax for the target language.
  • Personalized code recommenders: Developers have different coding styles and expertise. Current code completion systems neglect this fact since they are mostly trained on a dataset collected from a broad range of software repositories. This approach obscures the distinction between developers’ styles. In this project, our goal is to incorporate developer-related historical information to design personalized auto-completion models.

Suggested projects in 2019 (for inspiration)

  1. Code translation. Translate source code from one language to the other, e.g., from Java to C# or, maybe more interesting to industry, Cobol to Java. See Chen et al. [11] as reference. As a possible dataset to be explored, coding websites, such as Codeforces and Rosetta Code, contain the same problem implemented in multiple different languages.

  2. Code completion. IDEs have been suggesting code completion for years now. However, the use of DL brings us new possibilities: suggesting more contextual code completion. Researchers have been showing that this is indeed a tricky task [12]. This project is about replicating (or improving) upon this paper.

  3. Type inference. Inferring the type of a variable, especially in dynamically typed languages, can be a challenge. Hellendoorn [13] has shown us that DL techniques can indeed be very precise in this task. This project aims at replicating this paper.

  4. API usage. Developers often need help in learning how to use an API. Can we provide developers with API usage, given some natural text? Gu et al. [14] and Liu et al [15] showed that this is possible. Your project here is to replicate one of these papers.

  5. Mutation testing. In mutation testing, we mutate the original program and check if the existing test cases are able to find the error. Large companies, such as Google, have been adopting mutation testing, but not without its challenges [16]. In particular, given the size of programs, the number of possible mutants is enormous; thus, prioritizing which mutants to generate is currently an open problem. Tufano et al. [17] proposed the use of deep learning to learn which mutants are really relevant, based on bug fixes.

  6. Logging strategies. Identifying where to log is a hard task in large systems; on one hand, you don’t want to log too much; on the other hand, if you don’t log an important part of the code, you might miss information to debug a crash. Researchers have been empirically studying how developers decide where to log [18], and have been proposing supervised ML techniques to suggest improvements in log lines [19] [20]. In this project, you will study whether NLP based approaches provide better results.

  7. Anomaly detection in logs. Under construction.

  8. Log reduction. Modern software systems generate lots of runtime information, that developers need to examine in order to identify causes of failures, when those happen. With this project, you will build a tool that given a log and a label (e.g. pass/fail) it will learn a model that automatically identifies the important lines in an input log.

  9. Code refactoring. Maintaining (bad) source code is not an easy task. And, although industry has widely adopted linters, they have a well-known problem: the number of false positives [21]. We conjecture that ML-based techniques will be able to provide more useful refactoring to developers. In this task, you will train ML models to recommend (or maybe even automatically applying) refactorings. See the RefactoringMiner tool, which might help you in collecting real-world refactorings.

  10. Flaky tests. Flaky tests are tests that present non-deterministic behavior (i.e., tests that sometimes pass, sometimes fail). Mark Harman, Facebook’s senior scientist, mentioned that flaky test is an important problem at Facebook. Google says that 1.5% of their 4.2M tests present flaky behavior at some point [22]. Researchers have been empirically investigating the problem. Luo et al. [23] noticed that async wait, concurrency, and test order dependency are the most common causes for flakiness. Lam et al. [24] recently developed iDFlakies, a tool that aims at identifying tests that are flaky due to order execution. We ask: can the use of ML help us in identifying flaky tests?

  11. Tagging Algorithm. Whenever solving coding challenges, e.g., the ones from CodeForces, you have to choose a strategy: will you apply dynamic programming? Will you apply brute force? Does it involve probabilities? Labeling a piece of code with such tags might be really useful to education. Or, somewhat related, given the textual description of the problem, can we suggest solution strategies? Current and closest (non ML) related work on this topic aims at inferring the algorithm complexity based on Java bytecode [25].

  12. Programming styles. A recent paper [26] in the PL field caused lots of stir by (very vocally) refuting the findings of a Comm. ACM highlight paper[27]. Both papers try to quantify the effects of programming language use on error-proneness of software. However, their approach is too coarse-grained, as we can write any style of code in any programming language. What is needed is a more fine-grained approach, that e.g. links code styles (e.g. functional, imperative or declarative) to bugs. With this project, you will build an automated program style detector by feeding an ML solution with code in functional and imperative styles and let it learn to differentiate between the two.

References

Bibliography

[1]
M. Allamanis, E. T. Barr, S. Ducousso, and Z. Gao, “Typilus: Neural type hints,” in Proceedings of the 41st ACM SIGPLAN conference on programming language design and implementation, 2020, pp. 91–105.
[2]
J. Wei, M. Goyal, G. Durrett, and I. Dillig, “LambdaNet: Probabilistic type inference using graph neural networks,” arXiv preprint arXiv:2005.02161, 2020.
[3]
M. Allamanis, D. Tarlow, A. Gordon, and Y. Wei, “Bimodal modelling of source code and natural language,” in International conference on machine learning, 2015, pp. 2123–2132.
[4]
R.-M. Karampatsis, H. Babii, R. Robbes, C. Sutton, and A. Janes, “Big code!= Big vocabulary: Open-vocabulary models for source code,” arXiv preprint arXiv:2003.07914, 2020.
[5]
V. J. Hellendoorn, C. Sutton, R. Singh, P. Maniatis, and D. Bieber, “Global relational models of source code,” in International conference on learning representations, 2019.
[6]
M. Choetkiertikul, H. K. Dam, T. Tran, T. Pham, A. Ghose, and T. Menzies, “A deep learning model for estimating story points,” IEEE Transactions on Software Engineering, vol. 45, no. 7, pp. 637–656, 2018.
[7]
M. Choetkiertikul, H. K. Dam, T. Tran, and A. Ghose, “Predicting delays in software projects using networked classification (t),” in 2015 30th IEEE/ACM international conference on automated software engineering (ASE), 2015, pp. 353–364.
[8]
R.-M. Karampatsis and C. Sutton, “How often do single-statement bugs occur? The ManySStuBs4J dataset,” arXiv preprint arXiv:1905.13334, 2019.
[9]
J. Bader, A. Scott, M. Pradel, and S. Chandra, “Getafix: Learning to fix bugs automatically,” Proceedings of the ACM on Programming Languages, vol. 3, no. OOPSLA, pp. 1–27, 2019.
[10]
S. Han, D. R. Wallace, and R. C. Miller, “Code completion from abbreviated input,” in 2009 IEEE/ACM international conference on automated software engineering, 2009, pp. 332–343.
[11]
X. Chen, C. Liu, and D. Song, “Tree-to-tree neural networks for program translation,” in Advances in neural information processing systems, 2018, pp. 2547–2557.
[12]
V. J. Hellendoorn, S. Proksch, H. C. Gall, and A. Bacchelli, “When code completion fails: A case study on real-world completions,” in Proceedings of the 41st international conference on software engineering, 2019, pp. 960–970.
[13]
V. J. Hellendoorn, C. Bird, E. T. Barr, and M. Allamanis, “Deep learning type inference,” in Proceedings of the 2018 26th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, 2018, pp. 152–162.
[14]
X. Gu, H. Zhang, D. Zhang, and S. Kim, “Deep API learning,” in Proceedings of the 2016 24th ACM SIGSOFT international symposium on foundations of software engineering, 2016, pp. 631–642.
[15]
J. Liu, S. Kim, V. Murali, S. Chaudhuri, and S. Chandra, “Neural query expansion for code search,” in Proceedings of the 3rd ACM SIGPLAN international workshop on machine learning and programming languages, 2019, pp. 29–37.
[16]
G. Petrović and M. Ivanković, “State of mutation testing at google,” in Proceedings of the 40th international conference on software engineering: Software engineering in practice, 2018, pp. 163–171.
[17]
M. Tufano, C. Watson, G. Bavota, M. Di Penta, M. White, and D. Poshyvanyk, “Learning how to mutate source code from bug-fixes,” arXiv preprint arXiv:1812.10772, 2018.
[18]
Q. Fu et al., “Where do developers log? An empirical study on logging practices in industry,” in Companion proceedings of the 36th international conference on software engineering, 2014, pp. 24–33.
[19]
H. Li, W. Shang, Y. Zou, and A. E. Hassan, “Towards just-in-time suggestions for log changes,” Empirical Software Engineering, vol. 22, no. 4, pp. 1831–1865, 2017.
[20]
H. Li, W. Shang, and A. E. Hassan, “Which log level should developers choose for a new logging statement?” Empirical Software Engineering, vol. 22, no. 4, pp. 1684–1716, 2017.
[21]
B. Johnson, Y. Song, E. Murphy-Hill, and R. Bowdidge, “Why don’t software developers use static analysis tools to find bugs?” in Proceedings of the 2013 international conference on software engineering, 2013, pp. 672–681.
[22]
J. Listfield, “Where do our flaky tests come from?” https://testing.googleblog.com/2017/04/where-do-our-flaky-tests-come-from.html, 2017.
[23]
Q. Luo, F. Hariri, L. Eloussi, and D. Marinov, “An empirical analysis of flaky tests,” in Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering, 2014, pp. 643–653.
[24]
W. Lam, R. Oei, A. Shi, D. Marinov, and T. Xie, “iDFlakies: A framework for detecting and partially classifying flaky tests,” in 2019 12th IEEE conference on software testing, validation and verification (ICST), 2019, pp. 312–322.
[25]
E. Albert, P. Arenas, S. Genaim, G. Puebla, and D. Zanardini, “Cost analysis of java bytecode,” in European symposium on programming, 2007, pp. 157–172.
[26]
E. D. Berger, C. Hollenbeck, P. Maj, O. Vitek, and J. Vitek, “On the impact of programming languages on code quality,” arXiv preprint arXiv:1901.10220, 2019.
[27]
B. Ray, D. Posnett, V. Filkov, and P. Devanbu, A large scale study of programming languages and code quality in github,” in Proceedings of the 22Nd ACM SIGSOFT international symposium on foundations of software engineering, 2014, pp. 155–165.