On the first weekend of June, I was at the Mining Software Repositories (MSR) 2012 conference. For those not familiar with MSR, it is a venue where software engineering meets information extraction and data mining. Researchers present the tools and methods that they applied on software repositories (source code repositories, but also bug databases, mailing lists and wikis) to understand how software is written and how its quality is affected by certain events in the project’s history. Due to its wide scope, MSR is always a bit unbalanced with respect to the quality of the papers presented. This year however, there were some really great submissions.

One of the most interesting talks, was Dongmei Zhang’s keynote address on the first day. Dongmei is a senior researcher at Microsoft Research Asia, where she leads the development analytics project. During her presentation, she told some great tales from the research vs practice battlefield. One of them, concerned a code cloning detection tool, that has successfully graduated from Microsoft Research to internal Microsoft teams and finally to a Visual Studio 2012 plug-in. Dongmei explained that the most important reason this tool was successful was not that the research upon which it was based, but the fact that it was a TOOL. Imperfect in the beginning, its speed and accuracy was improved after suggestions from users started pouring in. What she learned from this experience was the importance of producing reusable tools out of the research was greater than doing the research itself. ‘Make tools. It works on my computer is no longer enough’, as she put it.

I was curious as to whether the above apply to the papers presented the very same day (and the next) to the very same conference that Dongmei gave the keynote talk to. To do so, I went through each paper and looked for pointers to the tools or datasets used. I also Googled the paper titles, hoping that the authors had put together a page containing the paper’s data or tools, as it is often the case.

The following table summarizes what I have found:

Paper Data Tools DocumentationComment
Towards Improving Bug Tracking Systems with Game Mechanisms Partial No No
GHTorrent: Github's Data from a Firehose Yes Yes Partial
MIC Check: A Correlation Tactic for ESE Data No No No
An Empirical Study of Supplementary Bug Fixes No No No
Incorporating Version Histories in Information Retrieval Based Bug Localization Yes No Yes Uses existing documented dataset
Think Locally, Act Globally: Improving Defect and Effort Prediction Models No No No Promise to upload data
Green Mining: A Methodology of Relating Software Change to Power Consumption No No No Best paper award
Analysis of Customer Satisfaction Survey Data No No No Not based on open data
Mining Usage Data and Development Artifacts No No No
Why Do Software Packages Conflict? No No No Original data in Debian repository
Discovering Complete API Rules with Mutation Testing Yes Yes Yes Not open source
Inferring Semantically Related Words from Software Context No No No
Do Faster Releases Improve Software Quality? An Empirical Case Study of Mozilla Firefox No No No
Explaining Software Defects Using Topic Models No No No
A Qualitative Study on Performance Bugs No No No
Can We Predict Types of Code Changes? An Empirical Analysis No Yes (most) No
An Empirical Investigation of Changes in Some Software Properties Over Time Yes No Yes Uses existing dataset
Who? Where? What? Examining Distributed Development in Two Large Open Source Projects Yes(partially) No No Paper mentions that data is on the PROMISE dataset, could not be retrieved at the date of the conference.

As you can see, the results are not particularly encouraging. In one of the most prominent empirical software engineering conferences, only two out of 18 papers provide really reusable tools (I have not investigated the degree of reusability).

In my opinion, what applies in practice should also apply in research. As researchers, we are often hesitant to provide reusable tools. Many times, this is due to the fact that going the extra mile to convert our ‘works on my computer’ scripts to tools is very time consuming and lacking any direct scientific value (i.e. does not lead to papers). Some of us might even be afraid of competing teams; if a tool is published this might allow others to find flaws in our research or that a more resourceful team will leap ahead of us using our effort.

Publishing a tool along with a paper has several advantages to research as a whole:

  • It enables research to become repeatable, facilitating both horizontal (more hypotheses) and vertical (more data) scaling of research efforts.
  • It enables research to become reproducible, leading to more credible results.
  • It enables people to become creative with someone else’s effort. This is precisely the reason that made open source software successful, and also applies with research tools too (see for example LLVM or JikesRVM).

I believe that publishing reusable tools (plus data and documentation) should be a prerequisite to publishing papers, especially so in empirical venues. Thereby, I hope that efforts such as the RESER workshop and the
the will raise the awareness of the importance of tools in software engineering research.

Why do you think that people are not investing time to create tools?


03 June 2012