We discuss the paper “Empirical validation of object-oriented metrics on open source software for fault prediction” [2] by T. Gyimothy et al.
Why do this research?
Object-Oriented (OO) Metrics have been extensively and independently validated.
However, this validation was on small or non-public programs.
For this reason, and the peculiarities of OSS, an OSS validation seems in order.
The standard reasons for doing fault prediction applies (see our previous paper on the topic “Cross project defect prediction” discussed by Georgios.): Defect prediction saves time and money if done right.
What does the paper do?
The paper calculates eight OO metrics on the Mozilla OSS to compare them with a set of metrics previously calculated by Basili et al. and two additional metrics (one of them LOC). It does so via their own Columbus reverse engineering framework. It extracts the number of bugs found and fixed in each class from Mozilla’s Bugzilla.
It then uses not only statistical methods, but also machine learning (decision tress and neural networks) to predict the fault proneness of code.
It also does an evolutionary study of seven versions of Mozilla.
What does the paper find?
The CBO (Coupling between Object classes) metric was best in predicting fault-proneness. LOC also does “fairly well.”
Why are the results important?
Size of class is proportionally related with how buggy it is. The larger a class, the more bugs. But what about the interaction between classes? The CBO metric was best overall! So, how about combining both?
“The precision of our models is not yet satisfactory.”
Things change over time! For example, a bump in Mozilla version 1.2, which the paper interprets as a quality decrease.
In which ways are the negative findings of this paper important?
What is problematic with using bug reports?
What can we really do with these results?
Can you download the toolset (cf. conclusion)? What can we (as researchers) do about this?
What do we think about this paper in general (quality)?
How could we improve this paper ourselves? There are obvious omissions in the paper (no results until page 5 in the paper, extremely casual language, interesting introduction of OSS in the first paragraph of introduction). Yet it is cited so much. What does that tell us?
Did the future work (studying OpenOffice.org) ever happen?
by Joost Wooning
The moderator asks if there are any general remarks on the paper. Someone notes that the paper doesn’t really explicitly states any research questions, this is indeed not the case.
The system starts with describing open source systems, this is not the main point of the paper so that is somewhat strange. It could be that at the time of writing the paper OSS did not yet have its current position, therefor it might have been needed to point this out.
Because of the lack of explicit research questions it is hard to assess the quality of the paper. Another issue is that the description of the graphs in the paper aren’t that good, it misses some description of the metric the each graph. Furthermore, there is not a real clear threats to validity section.
It is also noted that the first remark about the results of the paper is found at the results. It would be better to mention them before, it should be mentioned in the abstract or otherwise at least in the introduction.
The moderator asks what could be problematic this about using bug reports.
It is stated that the listed tool is no longer available online, And the question arises how you should keep this tools online
The paper lists some negative findings, some of these are contradictions to previous research. This might have happened because of a different dataset.
by Joost Wooning
The moderator asks what everyone thought about the developer study. It was strange that the paper stated it had developers which were authoritative over the code, however, they had no experience with quite some files which were in the study.
The listed required characteristics for bug-prediction software was pretty much agreed on, however, the bias towards the new is not always a good thing, older bugs may still not be solved and are therefor still important.
The display of the single flag that a file is bug-prone could be improved. With this message a developer still doesn’t know what to do. This could be fixed by focusing on smaller code parts, for example showing that a method of code section is bug-prone.
Another problem with the solution provided in the paper is with the soundness of the algorithm. Developers tend to ignore warnings if there are to many false positives.
The moderator asks if people think how the results of the paper would change if the code is deployed to an IDE instead of the code review tool. It might change in that a developer working on a file is extra careful when changing a file. However, this could also be a problem, a developer might choose to not change a bug-prone file and create some kind of workaround. Another problem could be that developers still won’t ‘fix’ the file, this could be because it is not an actionable message and thus requires extra work.
This tool might be of better use to quality assurance developers instead of front-line developers. However, for this to work the quality assurance developers should be able to decide changing a file.
Would a machine learning technique provide better results then this solution. One problem is that the insights from machine learning aren’t that good, you still don’t know how to fix the problem. Another problem is that is currently is difficult to train a system on another projects data.
Other problems noted from the paper are:
Another question is if it would be useful if a file shown a number about the bug-proneness. However, for this the be useful it would also need some kind of timestamp with the number. And while the number provides more insights it still is no actionable message.