Empirical research methods

Experiment
Case study – most MSR studies
Field study
Ethnography

Both types can be: Qualitative and Quantitative

Quantitative research

Extract answers to research questions using mathematical, statistical or numerical techniques

Generation of models, theories and hypotheses
Development of instruments and methods for measurement
(Experimental control and manipulation of variables)
Collection of empirical data
Modeling and analysis of data

Hypotheses

Propose an explanation to a phenomenon

Defined in pairs

\(H_1\): Default hypothesis
\(H_0\): Null hypothesis

A good hypothesis is readily falsifiable

Hypotheses | p-values

Most statistical tests return the probability (\(p\)) that \(H_0\) is true.
To interpret a test, we set a threshold (usually, 0.05) for \(p\)
If \(p <\) threshold, then the null hypothesis is rejected and the default one is accepted
Need to know before hand what statistical tests do

Measurement

Extract samples of data for a running process. Data types:

Continuous or Ratio: Can calculate a degree of difference between measurements
Interval: Cannot calculate a degree of difference between measurements
Categorical: One of a predefined set (sorting does not make sense)
Ordinal: One of a predefined set (sorting makes sense)

R in a nutshell

A brief introduction to R

Programming environment tuned to statistics
Inspired by Lisp, FP-based
Interactive workspace
3 base data types: vector, list, data.frame

Vectors

Vectors contain items of single type

a <- c(1:5)
a

## [1] 1 2 3 4 5

length(a) # Length of a vector

## [1] 5

a * 2     # Operations apply on all elements

## [1]  2  4  6  8 10

Vectors

b <- c(3:9) # Longer vector items not used
a * b

## [1]  3  8 15 24 35  8 18

a < 4

## [1]  TRUE  TRUE  TRUE FALSE FALSE

Lists

Lists contain items of multiple types, indexed

a <- list(first=c(1,2), second=4)
a

## $first
## [1] 1 2
## 
## $second
## [1] 4

a$first

## [1] 1 2

Data frames

data.frame: A table of typed data
- Rows are measurements.
- Columns are variables

mtcars[,c(1:4)]

##                      mpg cyl  disp  hp
## Mazda RX4           21.0   6 160.0 110
## Mazda RX4 Wag       21.0   6 160.0 110
## Datsun 710          22.8   4 108.0  93
## Hornet 4 Drive      21.4   6 258.0 110
## Hornet Sportabout   18.7   8 360.0 175
## Valiant             18.1   6 225.0 105
## Duster 360          14.3   8 360.0 245
## Merc 240D           24.4   4 146.7  62
## Merc 230            22.8   4 140.8  95
## Merc 280            19.2   6 167.6 123
## Merc 280C           17.8   6 167.6 123
## Merc 450SE          16.4   8 275.8 180
## Merc 450SL          17.3   8 275.8 180
## Merc 450SLC         15.2   8 275.8 180
## Cadillac Fleetwood  10.4   8 472.0 205
## Lincoln Continental 10.4   8 460.0 215
## Chrysler Imperial   14.7   8 440.0 230
## Fiat 128            32.4   4  78.7  66
## Honda Civic         30.4   4  75.7  52
## Toyota Corolla      33.9   4  71.1  65
## Toyota Corona       21.5   4 120.1  97
## Dodge Challenger    15.5   8 318.0 150
## AMC Javelin         15.2   8 304.0 150
## Camaro Z28          13.3   8 350.0 245
## Pontiac Firebird    19.2   8 400.0 175
## Fiat X1-9           27.3   4  79.0  66
## Porsche 914-2       26.0   4 120.3  91
## Lotus Europa        30.4   4  95.1 113
## Ford Pantera L      15.8   8 351.0 264
## Ferrari Dino        19.7   6 145.0 175
## Maserati Bora       15.0   8 301.0 335
## Volvo 142E          21.4   4 121.0 109

Data frames

Information about a data frame

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Data frames | indexing

Data frame items by column. Returns vector of values

mtcars$mpg # or mtcars[, 1]

##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4

Data frames | indexing

Data frame items by row. Returns a new data.frame

mtcars[c(1,3), ] # First and third row

##             mpg cyl disp  hp drat   wt  qsec vs am gear carb
## Mazda RX4  21.0   6  160 110 3.90 2.62 16.46  0  1    4    4
## Datsun 710 22.8   4  108  93 3.85 2.32 18.61  1  1    4    1

Data frames | searching

Searching in a data frame, returns a data frame

subset(mtcars, cyl == 6 & wt > 2.8)

##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4

Loading and saving data

R can read tabular data from any data source (incl. databases). By default, it supports CSV.

data <- read.csv("input.csv", sep = ",")
str(data)

Writing data also creates CSV files

write.csv(data, "data.csv")

Statistical data visualization

Histogram

Probability distribution of 1 variable

hist(mtcars$mpg)

Scatter plot

Actual values of 2 variable on 2D plot

plot(mtcars$mpg, mtcars$qsec)

Box plot

Summaries of 1 variable grouped by another

boxplot(mpg ~ cyl, mtcars)

Bar plot

Frequencies of 1 group of data

counts <- table(mtcars$gear)
barplot(counts, main="Car Distribution", xlab="Number of Gears", ylab="Number of cars")

Grouped bar plot

Frequencies of >1 groups of data

counts <- table(mtcars$vs, mtcars$gear)
barplot(counts, main="Car Distribution by Gears and VS", xlab="Number of Gears",
        col=c("darkblue","red"), legend = rownames(counts), beside=TRUE)

Line/Area chart

ggplot(mtcars1) + aes(x = mpg, fill = gear) + geom_density() + defaults

Facets

Split visualization in groups based on factors

ggplot(mtcars1) + aes(x = hp, y = mpg, shape=am, color=am) + facet_grid(gear~cyl) +
   xlab("Horsepower") + ylab("Miles per Gallon") + geom_point(size = 4) + defaults

Distributions

Distributions | Normal

Identified by the characteristic ‘bell curve’ histogram

## Warning: Removed 2 rows containing missing values (geom_bar).

Distributions | Non-normal

Histograms are left or right skewed

Statistical techniques

Correlations

Correlation is the dependence of two variables.

Pearson correlation examines linear dependencies, i.e. whether in a scatter plot all dots lie on the same line.
Rank correlations (e.g. Spearman \(\rho\)) examine the extend of if one variable increases then the other increases as well.

Correlations

Correlation is the dependence of two variables.

Pearson correlation examine linear dependencies, i.e. whether in a scatter plot all dots lie on the same line.
Rank correlations (e.g. Spearman \(\rho\)) examine the extend of if one variable increases then the other increases as well.

Correlations | Pearson

Used for normally distributed data

cor.test(mtcars$mpg, mtcars$hp, method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  mtcars$mpg and mtcars$hp
## t = -6.7424, df = 30, p-value = 1.788e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8852686 -0.5860994
## sample estimates:
##        cor 
## -0.7761684

\(p << 0.05\) so correlation is statistically significant
cor is -0.77, so strong

Correlations | Spearman

Used for non-normally distributed data (remember, calculates a different thing than Pearson)

cor.test(mtcars$mpg, mtcars$hp, method = "spearman")

## 
##  Spearman's rank correlation rho
## 
## data:  mtcars$mpg and mtcars$hp
## S = 10337, p-value = 5.086e-12
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.8946646

\(p << 0.05\) so correlation is statistically significant
\(\rho = -0.89\), so strong

Correlations | Things to consider

Correlation is not causation! Just because two variables are strongly correlated, does not mean that one causes changes to the other. See Spurious Correlations for more.

Correlation does not entail collinearity

anscombe_quartret

Checking data for normality

Plot a histogram

hist(mtcars$mpg)

Shapiro-Wilk test: check whether a distribution is NOT normal

shapiro.test(mtcars$mpg)

## 
##  Shapiro-Wilk normality test
## 
## data:  mtcars$mpg
## W = 0.94756, p-value = 0.1229

Checking whether two groups differ

Useful when we want to compare if parts of a dataset exhibit different behaviour wrt 1 variable. Example questions:

Do 4 and 8 cylinder cars consume the same?
Do males and females eat the same amount of vegetables?
Do conservatives and liberals use their guns at the same rate?

Process

draw a boxplot to see if you can spot a difference
use a statistical test to test for statistical significance

use t.test for normally distributed data
use wilcox.test for non-normally distributed data

use an effect size metric (cliffs.d or cohen.d) to examine how pronounced the difference is.

Checking whether two groups differ

boxplot(mpg ~ cyl, mtcars)

shapiro.test(mtcars$mpg)

## 
##  Shapiro-Wilk normality test
## 
## data:  mtcars$mpg
## W = 0.94756, p-value = 0.1229

\(p = 0.12\), so data is not normally distributed.

Checking whether two groups differ

four.cyl <- subset(mtcars, cyl == 4)
eight.cyl <- subset(mtcars, cyl == 8)
wilcox.test(four.cyl$mpg, eight.cyl$mpg)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  four.cyl$mpg and eight.cyl$mpg
## W = 154, p-value = 2.775e-05
## alternative hypothesis: true location shift is not equal to 0

\(p < 0.05\) so difference is statistically significant

library(cliffsd)
cliffs.d(four.cyl$mpg, eight.cyl$mpg)

## [1] 1

Cliff’s \(\delta\) score of 1 (ranges from \(-1 .. 1\)), so groups are totaly different

Learning from data

Machine learning

Suppose we have the following

\(A = \{a_1..a_n\}\), where \(a_n\) is a tuple of continuous and categorical data of length \(len(a_n)\))
\(B = \{b_1..b_n\}\), where \(b_n\) is a set of either continuous or categorical data
a function \(F: A \rightarrow B\).

Then machine learning can be (loosely!) defined as

Supervised learning: approximate function \(F\).
Unsupervised learning: we do not know \(B\) (and therefore cannot approximate \(F\)). Find patterns in \(A\).
Reinforced learning: \(A\) and \(B\) gets updated as we learn \(F\): re-learn \(F\) in every step.

Machine learning

Suppose we have two sets

\(A = \{a_1..a_n\}\), where \(a_n\) is a tuple of continuous and categorical data of length \(len(a_n)\))
\(B = \{b_1..b_n\}\) (where \(b_n\) is a set of either continuous or categorical data)
a function \(F: A \rightarrow B\).

Types of machine learning

Classification: \(B\) is a categorical variable. Binary (\(B = {TRUE, FALSE}\)) or multiclass (\(card(B) > 2\)).
Regression: \(B\) is an continuous variable
Clustering: Find similar groups in \(A\)
Dimensionality reduction: Map \(A = \{a_1..a_n\}\) to \(C = \{c_1..c_n\}\) where \(len(a_n) < len(c_n)\)

Supervised learning process

Extract 1 set of measurements (features) per case
Construct a model:
- independent variables (one or more)
- dependent variable as the variable we want to predict
Split the data in training and testing set (usually 90%-10% split)
- stratified sampling: sliding windows of train and test data
- randomized sampling: split data randomly
Run a learning algorithm to learn \(F\) on test data
Evaluate \(F\) on test data
Calculate performance metrics, repeat steps 1 – 4 to improve

Supervised learning example

Train a binary logistic regression learner to predict whether a pull request will be merged. Data can be found at gousiosg/pullreqs

# The model to train against
model <- merged ~ team_size + num_commits + files_changed +
  perc_external_contribs + test_churn + num_comments +
  commits_on_files_touched +  test_lines_per_kloc + prev_pullreqs +
  requester_succ_rate + num_participants
dependent <- all.vars(model)[1]

# Random sampling of training data
sample.train <- sample(nrow(data), size = 0.8 * nrow(data))

# Split input dataset to training and testing
train.data <- data[sample.train,]
test.data  <- data[-sample.train,]

# Train a learner on the training data
blr.trained <- glm(model, data = train.data, family="binomial")

Supervised learning | Example

blr.trained

## 
## Call:  glm(formula = model, family = "binomial", data = train.data)
## 
## Coefficients:
##              (Intercept)                 team_size               num_commits  
##                2.1065310                 0.0024397                -0.0097357  
##            files_changed    perc_external_contribs                test_churn  
##               -0.0028880                -0.0079801                 0.2988392  
##             num_comments  commits_on_files_touched       test_lines_per_kloc  
##                0.0010373                 0.0029545                -0.0716948  
##            prev_pullreqs       requester_succ_rate          num_participants  
##                0.0004607                 0.9327851                -0.0710522  
## 
## Degrees of Freedom: 2577 Total (i.e. Null);  2566 Residual
## Null Deviance:       2138 
## Residual Deviance: 2056  AIC: 2080

Supervised learning | Example

Predict with trained model, extract scores

predictions <- predict(blr.trained, newdata = test.data)
pred.obj <- prediction(predictions, test.data[,dependent])
classification.perf.metrics(pred.obj)

##        auc       tnr       tpr    g.mean     w.acc      prec       rec
## 1 0.686488 0.7263158 0.5381818 0.6252119 0.5852153 0.9192547 0.5381818
##   f.measure       acc
## 1 0.6788991 0.5658915

Supervised learning | Algorithms

Logistic regression: Computes coefficients for linear polynomials
Naive Bayes: Assume variables are independent. Compute prior probability of each variable predicting the output class. Compute conditional probability of output class based on input variables.
Random Forests: Generate many decision trees using random samples of training data. Compute output class probability based on majority votes from individual trees.
Ensemble methods: Partition data based on input variable ranges and use appropriate algorithms for each partition.

Supervised learning evaluation | Performance Metrics

The confusion matrix

real/predicted	True	False
True	TP	FN
False	FP	TN

Performance metrics

Precision: \(prec = tp / (tp + fp)\)
Recall: \(rec = tp / (tp + fn)\)
Accuracy: \(acc = (tp + tn)/(fp + fn + tp + tn)\)
F-measure: \(f = (2 * prec * rec) / (prec + rec)\)

Supervised learning evaluation | The ROC curve

perf <- performance(pred.obj, measure = "tpr", x.measure = "fpr")
plot(perf, col=rainbow(10))

Supervised learning | A real-life example

see: gist:778f7ecad919a2adb1e4

Unsupervised learning | kmeans

# Create 4 clusters of flowers based on characteristics
fit <-kmeans(iris[,c(1:4)], 4)

# View clusters on a plot. Dimensions are the result of dimentionality reduction.
library(cluster)
clusplot(iris[,c(1:4)], fit$cluster, color=TRUE, shade=TRUE, lines=0)

Unsupervised learning | Association rules

Given a set of items \(I=\{i_1...i_n\}\) and a set of transactions \(T=\{t_1...t_n\}\) where \(t_n = \langle a,... \rangle, a \in I\), return a set of rules like \(\langle i_1,i_3,... \rangle \Rightarrow \langle i_5 \rangle\)

Rules are evaluated by

Support: % of cases the rule applies in \(T\)
Confidence: Probability that a rule is correct given a new \(t_n\)

Used for:

“Customers who bought this also bought… (Amazon)”
“You may like this song… (Spotify)”

library(arules)
library(arulesViz)
library(datasets)
data(Groceries)

rules <- apriori(Groceries, parameter = list(supp = 0.001, conf = 0.8))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 done [0.02s].
## writing ... [410 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Unsupervised learning | Association rules

inspect(rules[1:8])

##     lhs                         rhs                support confidence    coverage      lift count
## [1] {liquor,                                                                                     
##      red/blush wine}         => {bottled beer} 0.001931876  0.9047619 0.002135231 11.235269    19
## [2] {curd,                                                                                       
##      cereals}                => {whole milk}   0.001016777  0.9090909 0.001118454  3.557863    10
## [3] {yogurt,                                                                                     
##      cereals}                => {whole milk}   0.001728521  0.8095238 0.002135231  3.168192    17
## [4] {butter,                                                                                     
##      jam}                    => {whole milk}   0.001016777  0.8333333 0.001220132  3.261374    10
## [5] {soups,                                                                                      
##      bottled beer}           => {whole milk}   0.001118454  0.9166667 0.001220132  3.587512    11
## [6] {napkins,                                                                                    
##      house keeping products} => {whole milk}   0.001321810  0.8125000 0.001626843  3.179840    13
## [7] {whipped/sour cream,                                                                         
##      house keeping products} => {whole milk}   0.001220132  0.9230769 0.001321810  3.612599    12
## [8] {pastry,                                                                                     
##      sweet spreads}          => {whole milk}   0.001016777  0.9090909 0.001118454  3.557863    10

Unsupervised learning | Association rules

plot(rules[1:8],method="graph")

Introduction to data science

Georgios Gousios

09 September 2021

Empirical research methods

Quantitative research

Hypotheses

Hypotheses | p-values

Measurement

R in a nutshell

A brief introduction to R

Vectors

Vectors

Lists

Data frames

Data frames

Data frames | indexing

Data frames | indexing

Data frames | searching

Loading and saving data

Statistical data visualization

Histogram

Scatter plot

Box plot

Bar plot

Grouped bar plot

Line/Area chart

Facets

Distributions

Distributions | Normal

Distributions | Non-normal

Statistical techniques

Correlations

Correlations

Correlations | Pearson

Correlations | Spearman

Correlations | Things to consider

Checking data for normality

Checking whether two groups differ

Checking whether two groups differ

Checking whether two groups differ

Learning from data

Machine learning

Machine learning

Supervised learning process

Supervised learning example

Supervised learning | Example

Supervised learning | Example

Supervised learning | Algorithms

Supervised learning evaluation | Performance Metrics

Supervised learning evaluation | The ROC curve

Supervised learning | A real-life example

Unsupervised learning | kmeans

Unsupervised learning | Association rules

Unsupervised learning | Association rules

Unsupervised learning | Association rules

Bibliography

Copyright