Software analytics is a modern term for the use of empirical (mostly quantitative) research methods on software data.
In this lecture, we will:
Quantitative software engineering is a subset of empirical software engineering, a discipline that
D: Can you identify potential applications of quantitative Software Engineering?
Empiricism is a philosophical theory that states that true knoweledge can only arise by systematically observing the world.
Types of empirical research:
Empirical research requires the collection of data to answer research questions (RQs).
Qualitative research methods collect non-numerical data
Quantitative methods using mathematical, statistical or numerical techniques to process numerical data:
A hypothesis proposes an explanation to a phenomenon
Defined in pairs
A good hypothesis is readily falsifiable.
Most statistical tests return the probability (\(p\)) that \(H_0\) is true.
To interpret a test, we set a threshold (usually, 0.05) for \(p\)
If \(p <\) threshold, then the null hypothesis is rejected and the default one is accepted
Need to know before hand what statistical tests do
A theory is a proposed explanation for an observed phenomenon. It (usually) specifies entities and prescribes their interactions. Using a theoretical model, we can explain and predict
Q: How can we build or dismantle a theory?
Theories are built by generalizing over consecutive research results.
A single contradicting data point is enough to reject a theory.
Extract samples of data for a running process. Data types:
McCabe’s complexity [1]: Attempt to quantify complexity at the function level by counting number of branches.
Halstead software science [2]: Attempt to generate laws of software growth
Curtis et al. [3] found that: “All three metrics (Halstead volume, McCabe complexity, LoCs) correlated with both the accuracy of the modification and the time to completion.”
they just work!
Boehm [4] defined the COCOMO model, and effort to quantify and predict software cost:
\(a, b, c\) and \(d\) were collected through case studies.
Both COCOMO and function points are widely used today for cost estimation.
Manny Lehman [5] defined a set of laws that characterise how software evolves (and ultimately predict its demise)
Using metrics to define product and process quality
Basili [7]: The Goal-Question-Metric approach:
A goal is stated as follows:
what | example |
---|---|
Object of study | A tool or a practice |
Purpose | Characterize, improve, predict etc |
Focus | prespective to study the problem from |
Stackeholder | Who is concerned with the result? |
Context | Confouding factors (e.g. company, environment) |
The GQM approach is another way of describing the scientific method.
Mockus et al: “Two case studies of open source software development: Apache and mozilla” [9]
Not the first to use OSS data, but:
von Krogh et al.: “Community, joining, and specialization in open source software innovation: a case study” [10]
Defined the, now obvious, vocabulary of OSS research:
Herbsleb and Mockus: “An empirical study of speed and communication in globally distributed software development” [11]
Zimmerman et al. “Mining Version Histories to Guide Software Changes” [12]
Very important work because:
Nagappan et al.: “Mining Metrics to Predict Component Failures” [13]
Heitlager et al.: “A Practical Model for Measuring Maintainability” [14]
Noteworthy findings (at the file level):
Predicting component failures: Hassan [15] found a connection between process metrics and bugs
Distributed software development: Bird et al. [16] found that software quality is not affected by distance
No model to rule them all: Zimmerman et al. [17] established that software projects are different and therefore models need to be localised and specialised.
Naturalness: Hindle et al. [18] found that “code is very repetitive, and in fact even more so than natural languages”
In the early 10s, the velocity of software production increased at a breakneck rate
GitHub revolutionalized OSS by centralizing it. Anyone can contribute (and contribute they do!).
AppStores made discoverability and distribution to the end client trivial.
The cloud transfored hardware into software.
Software analytics coined as a term to help teams improve their performance
Big Software: GHTorrent (Gousios [19]) made TBs of GitHub data available to researchers. Inspired TravisTorrent [20] and SOTorrent [21]
Big testing: Herzig et al. [22] developed “a cost model, which dynamically skips tests when the expected cost of running the test exceeds the expected cost of removing it. ”
Big security: Gorla et al. [23] “after clustering Android apps by their description topics, (we) identified outliers in each cluster with respect to their API usage.”
Code summarization Allamanis et al. [24] use CNNs to automatically give names to methods based on their contents
Code search Gu et al. [25] search for code snippers using natural language queries
PR Duplicates: Nijessen [26] used deep learning to find duplicate PRs
An overview can be seen in this taxonomy.
In this course, we will focus on state of the art research in the areas of:
Ref | Who? | Definition |
---|---|---|
[27] | Hassan | [Software Intelligence] offers software practitioners (not just developers) up-to-date and pertinent information to support their daily decision-making processes. |
[28] | Buse | The idea of analytics is to leverage potentially large amounts of data into real and actionable insights. |
[29] | Zhang | Software analytics is to enable software practitioners to perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services. |
[31] | Menzies | Software analytics is analytics on software data for managers and software engineers with the aim of empowering software development individuals and teams to gain and share insight from their data to make better decisions. |
D: So what are software analytics?
The broader goal of software analytics is to extract value from data traces residing in software repositories, in order to assist developers to write better software.
The course contents are copyrighted (c) 2018 - onwards by TU Delft and their respective authors and licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.