We discuss the paper “Understanding myths and realities of test-suite evolution” [1] by Pinto et al.
Remark: Interesting collaboration (Italy, India, US)
Why do this research?
What does the paper do?
The paper (the authors) invents a techinque for studying test-suite evolution and a concrete tool that implements the techinque called TestEvol
. TestEvol works for Java and JUnit.
Q Why is “the authors …” bad and “the paper …” better?
What does TestEvol do? It combines static and dynamic analysis techniques to compute the differences between the test suites associated with two versions of a program and categorize such changes.
In particular, given two versions of a program and its test suite, TestEvol
TestEvol is applied on 6 popular OSS systems from SourceForge, including 2 Apache projects. Then, each subsequent version pair of every program was fed into the differencer.
What does the paper find?
Most changes involve deletions (15%) and additions (56%) of test cases, not modifications (29%). Of the modifications, only 22% or (6% of all test evolutions in total!) repaired broken tests.
Even test modifications tend to involve complex, and hard-to-automate, changes to test cases.
Why are the results important?
It scopes the applicability of test repairs.
Previous research somewhat misguided: “existing test-repair techniques that focus exclusively on assertions may have limited practical applicability.”
Is the premise of the paper fulfilled? Do we have a solid evidence to believe that “Repairing existing test cases manually, however, can be extremely time consuming, especially for large test suites” (Abstract) is true? If we do not, what does our intuition tell us? Is this actually the premise of the paper?
After saying “automated test repairs do not frequently occur in practice,” the paper (one paragraph later!) comes to the conclusion that “Test repairs occur often enough in practice to justify the development of automated repair techniques.” In fact, in the conclusions, the paper goes ever further by saying “First, we found that test repair does occur in practice: on average, we observed 16 instances of test repairs per program version, and a total of 1,121 test repairs. Therefore, automating test repair can clearly be useful.” Strong lingo! What is the scientific basis on which this statement stands? Speculative: Why would the paper contain such an inference?
What was the impact of a call to “Stop test repair research?”
Future work: extending our empirical study by considering additional programs. How, ideally, should it be extended?
by Mark Haakman
Q: Are modifications really the least common of the test changes? It could be that additions and deletions are just modifications of the name and not really additions and deletions.
A: Version history does not provide everything. An improvement would be looking at abstract syntax trees.
Q: What did you like about the paper?
A: Paper is complete, they criticize their own validity. This paper really has a scientific backing in why this problem should be looked at. This is a trend in Software Engineering. The paper was not hard to read and the general methodology is made very clear in the paper.
Q: Is the premise of the paper true at all?
A: To check this, you could check if the premise is backed by previous research, you could find proof for the premise in the paper itself and using your own intuition by looking at own experiences in software engineering.
Q: They only looked at six open source projects. Is this enough?
A: These six are easy to work with, because they have not much dependencies and no GUI. This means they are easier to work with, but do not reflect all open source projects. This are only Java (and jUnit) projects. Good about these projects is that they are of high quality and contain a lot of tests. Close source products could also be looked at, as well as smaller projects.
Q: Did the authors use these projects because their technique could not be used for other projects?
A: They stated the projects had to be popular, maintained and using jUnit.
Q: “First, we found that test repair does occur in practice: on average, we observed 16 instances of test repairs per program version, and a total of 1,121 test repairs. Therefore, automating test repair can clearly be useful.” This is a contradiction with everything they said before.
A: They probably want to justify their work and may have a justification bias themselves. Their view might have changed or they want to leave the door open for further research. Another study should be done on if it is worth to automate the test repairs.
Q: What do we think about the abstract?
We discuss the paper “Developer Testing in The IDE: Patterns, Beliefs, And Behavior” [2] by Beller et al.
Q: What did you like about the paper?
The paper had a clear structure, the paper had a nice flow. Previous research for Watchdog is improved. The fact that the IDE is looked at, not repository mining this time.
Q: Why do you use TDD?
A: It has its place in some project, in some it doesn’t. TDD has its benefits and disadvantages. When the requirements are not fully clear, TDD costs much because tests could need to be changed due to change in requirements. One bad thing about TDD: a passing test can make you overconfident about your production code.
Q: What is the common belief about TDD. Is it practiced a lot?
A: The group thinks it is not practiced a lot. Maybe the broughter definition is used, but not the strict definitions given in the paper.
Q: What could we do to check if the TDD model in the paper is correct?
A: Other big researchers in the field agreed with this model.
Q: Why do you think people overestimated their time in testing?
Q: How could the anchoring by the slider be prevented?
A: Using a text field, but this is visually less attractive. The slider could be put in a random position each time. The pointer could be hidden at first, so the user has to put a pointer on the line itself.
Q: How would you interpret “Only 25% of test cases are responsible for 75% of test execution failures in the IDE.” ?
A: The CI tells you a particular test is failing, and the software engineer runs the faulty test in the IDE. A passing test will not be looked at, but a failing test will be modified, so the change is higher this test will fail again.
Q: Why do you think the numbers of students ended up different?
A: Students may be forced to have 80% line test coverage or don’t care testing at all because their projects are so small.
Q: What kind of projects are looked at?
A: This should be stated in the ‘participants’ section of any paper. In this case, participants often wanted to stay anonymous, so commercial projects could not risk being judged upon the results of this paper.
Q: How would you replicate this paper differently?
A: The fact that TDD is not used could be researched in a different paper.
Q: How to perform a research about why TDD is not used? How to take into account the huge amount of factors in different software projects?
A: Observe difference teams which do not use TDD and which do not TDD. (an experiment). It would actually be very hard to study this.
Q: What is the relevance of this paper in the field of software analytics?
A: IDE developers could use this paper to think about TDD in the IDE again. (more answers are given)