On the importance of tools in software engineering research

On the first weekend of June, I was at the Mining Software Repositories (MSR) 2012 conference. For those not familiar with MSR, it is a venue where software engineering meets information extraction and data mining. Researchers present the tools and methods that they applied on software repositories (source code repositories, but also bug databases, mailing lists and wikis) to understand how software is written and how its quality is affected by certain events in the project’s history. Due to its wide scope, MSR is always a bit unbalanced with respect to the quality of the papers presented. This year however, there were some really great submissions.

One of the most interesting talks, was Dongmei Zhang’s keynote address on the first day. Dongmei is a senior researcher at Microsoft Research Asia, where she leads the development analytics project. During her presentation, she told some great tales from the research vs practice battlefield. One of them, concerned a code cloning detection tool, that has successfully graduated from Microsoft Research to internal Microsoft teams and finally to a Visual Studio 2012 plug-in. Dongmei explained that the most important reason this tool was successful was not that the research upon which it was based, but the fact that it was a TOOL. Imperfect in the beginning, its speed and accuracy was improved after suggestions from users started pouring in. What she learned from this experience was the importance of producing reusable tools out of the research was greater than doing the research itself. ‘Make tools. It works on my computer is no longer enough’, as she put it.

I was curious as to whether the above apply to the papers presented the very same day (and the next) to the very same conference that Dongmei gave the keynote talk to. To do so, I went through each paper and looked for pointers to the tools or datasets used. I also Googled the paper titles, hoping that the authors had put together a page containing the paper’s data or tools, as it is often the case.

The following table summarizes what I have found:

Paper	Data	Tools	Documentation	Comment
Towards Improving Bug Tracking Systems with Game Mechanisms	Partial	No	No
GHTorrent: Github's Data from a Firehose	Yes	Yes	Partial
MIC Check: A Correlation Tactic for ESE Data	No	No	No
An Empirical Study of Supplementary Bug Fixes	No	No	No
Incorporating Version Histories in Information Retrieval Based Bug Localization	Yes	No	Yes	Uses existing documented dataset
Think Locally, Act Globally: Improving Defect and Effort Prediction Models	No	No	No	Promise to upload data
Green Mining: A Methodology of Relating Software Change to Power Consumption	No	No	No	Best paper award
Analysis of Customer Satisfaction Survey Data	No	No	No	Not based on open data
Mining Usage Data and Development Artifacts	No	No	No
Why Do Software Packages Conflict?	No	No	No	Original data in Debian repository
Discovering Complete API Rules with Mutation Testing	Yes	Yes	Yes	Not open source
Inferring Semantically Related Words from Software Context	No	No	No
Do Faster Releases Improve Software Quality? An Empirical Case Study of Mozilla Firefox	No	No	No
Explaining Software Defects Using Topic Models	No	No	No
A Qualitative Study on Performance Bugs	No	No	No
Can We Predict Types of Code Changes? An Empirical Analysis	No	Yes (most)	No
An Empirical Investigation of Changes in Some Software Properties Over Time	Yes	No	Yes	Uses existing dataset
Who? Where? What? Examining Distributed Development in Two Large Open Source Projects	Yes(partially)	No	No	Paper mentions that data is on the PROMISE dataset, could not be retrieved at the date of the conference.

As you can see, the results are not particularly encouraging. In one of the most prominent empirical software engineering conferences, only two out of 18 papers provide really reusable tools (I have not investigated the degree of reusability).

In my opinion, what applies in practice should also apply in research. As researchers, we are often hesitant to provide reusable tools. Many times, this is due to the fact that going the extra mile to convert our ‘works on my computer’ scripts to tools is very time consuming and lacking any direct scientific value (i.e. does not lead to papers). Some of us might even be afraid of competing teams; if a tool is published this might allow others to find flaws in our research or that a more resourceful team will leap ahead of us using our effort.

Publishing a tool along with a paper has several advantages to research as a whole:

It enables research to become repeatable, facilitating both horizontal (more hypotheses) and vertical (more data) scaling of research efforts.
It enables research to become reproducible, leading to more credible results.
It enables people to become creative with someone else’s effort. This is precisely the reason that made open source software successful, and also applies with research tools too (see for example LLVM or JikesRVM).

I believe that publishing reusable tools (plus data and documentation) should be a prerequisite to publishing papers, especially so in empirical venues. Thereby, I hope that efforts such as the RESER workshop and the
the will raise the awareness of the importance of tools in software engineering research.

Why do you think that people are not investing time to create tools?

Published

03 June 2012

On the importance of tools in software engineering research

Published

Tags