Paper discussion: Gyimothy et al. [1]

We discuss the paper “Empirical validation of object-oriented metrics on open source software for fault prediction[2] by T. Gyimothy et al.

Paper

  • Appeared in 2005
  • Published at TSE – top-tier journal in SE
  • Cited 844 times
  • Citations still going strong
Citations over time

Citations over time

People

  • It is nice that the paper gives us some information about them on the last page.
  • Interesting order of authors (two professors, first author likely a full professor even back then), one engineer. Have a look at the last page of the journal!
  • By and large the most cited work of all three authors

Motivation

Why do this research?

  • Object-Oriented (OO) Metrics have been extensively and independently validated.

  • However, this validation was on small or non-public programs.

  • For this reason, and the peculiarities of OSS, an OSS validation seems in order.

  • The standard reasons for doing fault prediction applies (see our previous paper on the topic “Cross project defect prediction discussed by Georgios.): Defect prediction saves time and money if done right.

Research method

What does the paper do?

The paper calculates eight OO metrics on the Mozilla OSS to compare them with a set of metrics previously calculated by Basili et al. and two additional metrics (one of them LOC). It does so via their own Columbus reverse engineering framework. It extracts the number of bugs found and fixed in each class from Mozilla’s Bugzilla.

It then uses not only statistical methods, but also machine learning (decision tress and neural networks) to predict the fault proneness of code.

It also does an evolutionary study of seven versions of Mozilla.

Results

What does the paper find?

The CBO (Coupling between Object classes) metric was best in predicting fault-proneness. LOC also does “fairly well.”

Discussion / Implications

Why are the results important?

  • Size of class is proportionally related with how buggy it is. The larger a class, the more bugs. But what about the interaction between classes? The CBO metric was best overall! So, how about combining both?

  • “The precision of our models is not yet satisfactory.”

  • Things change over time! For example, a bump in Mozilla version 1.2, which the paper interprets as a quality decrease.

Questions

Technical questions

  • In which ways are the negative findings of this paper important?

  • What is problematic with using bug reports?

  • What can we really do with these results?

  • Can you download the toolset (cf. conclusion)? What can we (as researchers) do about this?

Meta questions

  • What do we think about this paper in general (quality)?

  • How could we improve this paper ourselves? There are obvious omissions in the paper (no results until page 5 in the paper, extremely casual language, interesting introduction of OSS in the first paragraph of introduction). Yet it is cited so much. What does that tell us?

  • Did the future work (studying OpenOffice.org) ever happen?

Discussion summary

by Joost Wooning

The moderator asks if there are any general remarks on the paper. Someone notes that the paper doesn’t really explicitly states any research questions, this is indeed not the case.

The system starts with describing open source systems, this is not the main point of the paper so that is somewhat strange. It could be that at the time of writing the paper OSS did not yet have its current position, therefor it might have been needed to point this out.

Because of the lack of explicit research questions it is hard to assess the quality of the paper. Another issue is that the description of the graphs in the paper aren’t that good, it misses some description of the metric the each graph. Furthermore, there is not a real clear threats to validity section.

It is also noted that the first remark about the results of the paper is found at the results. It would be better to mention them before, it should be mentioned in the abstract or otherwise at least in the introduction.

The moderator asks what could be problematic this about using bug reports.

  • Bug reports can be incomplete, bug still need to be found.
  • In evolutionary studies, a bug timeline has to be kept in mind, bugs can be in multiple versions.
  • The latest versions of a system should not be used, because there can be hidden bugs.

It is stated that the listed tool is no longer available online, And the question arises how you should keep this tools online

  • GitHub, although there are better options
  • archive.org, is a foundation
  • Zenodo
  • figshare
  • pure.tudelft.nl

The paper lists some negative findings, some of these are contradictions to previous research. This might have happened because of a different dataset.

Paper discussion: Lewis et al [3]

Discussion summary

by Joost Wooning

The moderator asks what everyone thought about the developer study. It was strange that the paper stated it had developers which were authoritative over the code, however, they had no experience with quite some files which were in the study.

The listed required characteristics for bug-prediction software was pretty much agreed on, however, the bias towards the new is not always a good thing, older bugs may still not be solved and are therefor still important.

The display of the single flag that a file is bug-prone could be improved. With this message a developer still doesn’t know what to do. This could be fixed by focusing on smaller code parts, for example showing that a method of code section is bug-prone.

Another problem with the solution provided in the paper is with the soundness of the algorithm. Developers tend to ignore warnings if there are to many false positives.

The moderator asks if people think how the results of the paper would change if the code is deployed to an IDE instead of the code review tool. It might change in that a developer working on a file is extra careful when changing a file. However, this could also be a problem, a developer might choose to not change a bug-prone file and create some kind of workaround. Another problem could be that developers still won’t ‘fix’ the file, this could be because it is not an actionable message and thus requires extra work.

This tool might be of better use to quality assurance developers instead of front-line developers. However, for this to work the quality assurance developers should be able to decide changing a file.

Would a machine learning technique provide better results then this solution. One problem is that the insights from machine learning aren’t that good, you still don’t know how to fix the problem. Another problem is that is currently is difficult to train a system on another projects data.

Other problems noted from the paper are:

  • The number of participants is quite low (only 19)
  • All participants were volunteers, this could have created a bias
  • Files related to bug-prone files are also bug-prone (translations, constants)

Another question is if it would be useful if a file shown a number about the bug-proneness. However, for this the be useful it would also need some kind of timestamp with the number. And while the number provides more insights it still is no actionable message.

References

[1] M. Hilton, T. Tunnell, K. Huang, D. Marinov, and D. Dig, “Usage, costs, and benefits of continuous integration in open-source projects,” in Proceedings of the 31st ieee/acm international conference on automated software engineering, 2016, pp. 426–437.

[2] T. Gyimothy, R. Ferenc, and I. Siket, “Empirical validation of object-oriented metrics on open source software for fault prediction,” IEEE Transactions on Software Engineering, vol. 31, no. 10, pp. 897–910, Oct. 2005.

[3] C. Lewis, Z. Lin, C. Sadowski, X. Zhu, R. Ou, and E. J. Whitehead Jr, “Does bug prediction support human developers? Findings from a google case study,” in Proceedings of the 2013 international conference on software engineering, 2013, pp. 372–381.

[4] M. D’Ambros, M. Lanza, and R. Robbes, “An extensive comparison of bug prediction approaches,” in 2010 7th ieee working conference on mining software repositories (msr 2010), 2010, pp. 31–41.

[5] T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell, “A systematic literature review on fault prediction performance in software engineering,” IEEE Transactions on Software Engineering, vol. 38, no. 6, pp. 1276–1304, Nov. 2012.

[6] C. Catal and B. Diri, “A systematic review of software fault prediction studies,” Expert Systems with Applications, vol. 36, no. 4, pp. 7346–7354, 2009.

[7] E. Arisholm, L. C. Briand, and E. B. Johannessen, “A systematic and comprehensive investigation of methods to build and evaluate fault prediction models,” Journal of Systems and Software, vol. 83, no. 1, pp. 2–17, 2010.