Paper Discusion: Pinto et al. [1]

We discuss the paper “Understanding myths and realities of test-suite evolution[1] by Pinto et al.

Paper

  • Appeared in 2012
  • Published at FSE – top-tier conference in SE
  • Cited 69 times
  • Seems to be picking up a little more citation momentum over time

Citations over time

People

  • Leandro Sales Pinto – works at Booking.com; from his Google Scholar profile might have done a PhD (LinkedIn confirms that). His most cited paper as a first author.

Leandro Sales Pinto

  • Saurabh Sinha – there are many people with the name. Who is our guy? Go to the Google Scholar profile linked with the paper. Now: Research Staff Member, Thomas J. Watson Research Center, Yorktown Heights, NY USA.
  • Alessandro Orso – Professor and Associate School Chair at Georgia Tech.

Remark: Interesting collaboration (Italy, India, US)

Motivation

Why do this research?

  • Test suites evovle over time.
  • Doing this evoluation (particularly, repairing broken tests) manually is extremely time consuming.
  • Automated test repair techinques save time and money
  • However, to develop these techniques, we need a thorough understanding of test suite evolution
  • “Unfortunately, to date there are no studies in the literature that investigate how test suites evolve”

Research method

What does the paper do?

The paper (the authors) invents a techinque for studying test-suite evolution and a concrete tool that implements the techinque called TestEvol. TestEvol works for Java and JUnit.

Q Why is “the authors …” bad and “the paper …” better?

What does TestEvol do? It combines static and dynamic analysis techniques to compute the differences between the test suites associated with two versions of a program and categorize such changes.

In particular, given two versions of a program and its test suite, TestEvol

  1. computes the differences in behavior between the test suites
  2. classifies the actual source code repairs performed between the versions
  3. and computes the coverage attained by the tests on the two program versions.

TestEvol is applied on 6 popular OSS systems from SourceForge, including 2 Apache projects. Then, each subsequent version pair of every program was fed into the differencer.

Results

What does the paper find?

Most changes involve deletions (15%) and additions (56%) of test cases, not modifications (29%). Of the modifications, only 22% or (6% of all test evolutions in total!) repaired broken tests.

Even test modifications tend to involve complex, and hard-to-automate, changes to test cases.

Discussion / Implications

Why are the results important?

It scopes the applicability of test repairs.

Previous research somewhat misguided: “existing test-repair techniques that focus exclusively on assertions may have limited practical applicability.

Questions

  • Is the premise of the paper fulfilled? Do we have a solid evidence to believe that “Repairing existing test cases manually, however, can be extremely time consuming, especially for large test suites” (Abstract) is true? If we do not, what does our intuition tell us? Is this actually the premise of the paper?

  • After saying “automated test repairs do not frequently occur in practice,” the paper (one paragraph later!) comes to the conclusion that “Test repairs occur often enough in practice to justify the development of automated repair techniques.” In fact, in the conclusions, the paper goes ever further by saying “First, we found that test repair does occur in practice: on average, we observed 16 instances of test repairs per program version, and a total of 1,121 test repairs. Therefore, automating test repair can clearly be useful.” Strong lingo! What is the scientific basis on which this statement stands? Speculative: Why would the paper contain such an inference?

  • What was the impact of a call to “Stop test repair research?”

  • Future work: extending our empirical study by considering additional programs. How, ideally, should it be extended?

Meta questions

Discussion summary

by Mark Haakman

Q: Are modifications really the least common of the test changes? It could be that additions and deletions are just modifications of the name and not really additions and deletions.

A: Version history does not provide everything. An improvement would be looking at abstract syntax trees.

Q: What did you like about the paper?

A: Paper is complete, they criticize their own validity. This paper really has a scientific backing in why this problem should be looked at. This is a trend in Software Engineering. The paper was not hard to read and the general methodology is made very clear in the paper.

Q: Is the premise of the paper true at all?

A: To check this, you could check if the premise is backed by previous research, you could find proof for the premise in the paper itself and using your own intuition by looking at own experiences in software engineering.

Q: They only looked at six open source projects. Is this enough?

A: These six are easy to work with, because they have not much dependencies and no GUI. This means they are easier to work with, but do not reflect all open source projects. This are only Java (and jUnit) projects. Good about these projects is that they are of high quality and contain a lot of tests. Close source products could also be looked at, as well as smaller projects.

Q: Did the authors use these projects because their technique could not be used for other projects?

A: They stated the projects had to be popular, maintained and using jUnit.

Q: “First, we found that test repair does occur in practice: on average, we observed 16 instances of test repairs per program version, and a total of 1,121 test repairs. Therefore, automating test repair can clearly be useful.” This is a contradiction with everything they said before.

A: They probably want to justify their work and may have a justification bias themselves. Their view might have changed or they want to leave the door open for further research. Another study should be done on if it is worth to automate the test repairs.

Q: What do we think about the abstract?

  • “Without such knowledge, we risk to develop techniques that may work well for only a small number of tests or, worse, that may not work at all in most realistic cases.” Here they do contradict themselves.
  • The authors are putting their opinions in the abstract.
  • The abstract is very long. The abstract should be limited to 250 words.

Paper Discusion: Beller et al. [2]

We discuss the paper “Developer Testing in The IDE: Patterns, Beliefs, And Behavior[2] by Beller et al.

Q: What did you like about the paper?

The paper had a clear structure, the paper had a nice flow. Previous research for Watchdog is improved. The fact that the IDE is looked at, not repository mining this time.

Test Driven Development (TDD)

Q: Why do you use TDD?

A: It has its place in some project, in some it doesn’t. TDD has its benefits and disadvantages. When the requirements are not fully clear, TDD costs much because tests could need to be changed due to change in requirements. One bad thing about TDD: a passing test can make you overconfident about your production code.

Q: What is the common belief about TDD. Is it practiced a lot?

A: The group thinks it is not practiced a lot. Maybe the broughter definition is used, but not the strict definitions given in the paper.

Q: What could we do to check if the TDD model in the paper is correct?

A: Other big researchers in the field agreed with this model.

Discussing the results

Q: Why do you think people overestimated their time in testing?

  • Testing can be tedious, so it seems to take more time.
  • Bias in anchoring and people know they should test, so they fill in a large percentage.

Q: How could the anchoring by the slider be prevented?

A: Using a text field, but this is visually less attractive. The slider could be put in a random position each time. The pointer could be hidden at first, so the user has to put a pointer on the line itself.

Q: How would you interpret “Only 25% of test cases are responsible for 75% of test execution failures in the IDE.” ?

A: The CI tells you a particular test is failing, and the software engineer runs the faulty test in the IDE. A passing test will not be looked at, but a failing test will be modified, so the change is higher this test will fail again.

Q: Why do you think the numbers of students ended up different?

A: Students may be forced to have 80% line test coverage or don’t care testing at all because their projects are so small.

Discussing the methods

Q: What kind of projects are looked at?

A: This should be stated in the ‘participants’ section of any paper. In this case, participants often wanted to stay anonymous, so commercial projects could not risk being judged upon the results of this paper.

Q: How would you replicate this paper differently?

A: The fact that TDD is not used could be researched in a different paper.

Q: How to perform a research about why TDD is not used? How to take into account the huge amount of factors in different software projects?

A: Observe difference teams which do not use TDD and which do not TDD. (an experiment). It would actually be very hard to study this.

Q: What is the relevance of this paper in the field of software analytics?

A: IDE developers could use this paper to think about TDD in the IDE again. (more answers are given)

References

[1]
L. S. Pinto, S. Sinha, and A. Orso, “Understanding myths and realities of test-suite evolution,” in Proceedings of the ACM SIGSOFT 20th international symposium on the foundations of software engineering, 2012, p. 33.
[2]
M. Beller, G. Georgios, A. Panichella, S. Proksch, S. Amann, and A. Zaidman, “Developer testing in the IDE: Patterns, beliefs, and behavior,” IEEE Transactions on Software Engineering, p. 1.
[3]
A. Zaidman, B. Van Rompaey, A. van Deursen, and S. Demeyer, “Studying the co-evolution of production and test code in open source and industrial developer test processes through repository mining,” Empirical Software Engineering, vol. 16, no. 3, pp. 325–364, Jun. 2011.
[4]
T. B. Noor and H. Hemmati, “Test case analytics: Mining test case traces to improve risk-driven testing,” in 2015 IEEE 1st international workshop on software analytics (SWAN), 2015, pp. 13–16.
[5]
H. K. N. Leung and K. M. Lui, “Testing analytics on software variability,” in 2015 IEEE 1st international workshop on software analytics (SWAN), 2015, pp. 17–20.