Empirical research methods

Both types can be: Qualitative and Quantitative

Quantitative research

Extract answers to research questions using mathematical, statistical or numerical techniques

Hypotheses

Propose an explanation to a phenomenon

Defined in pairs

A good hypothesis is readily falsifiable

Hypotheses | p-values

Measurement

Extract samples of data for a running process. Data types:

R in a nutshell

A brief introduction to R

  • Programming environment tuned to statistics
  • Inspired by Lisp, FP-based
  • Interactive workspace
  • 3 base data types: vector, list, data.frame

Vectors

Vectors contain items of single type

a <- c(1:5)
a
## [1] 1 2 3 4 5
length(a) # Length of a vector
## [1] 5
a * 2     # Operations apply on all elements
## [1]  2  4  6  8 10

Vectors

b <- c(3:9) # Longer vector items not used
a * b
## [1]  3  8 15 24 35  8 18
a < 4
## [1]  TRUE  TRUE  TRUE FALSE FALSE

Lists

Lists contain items of multiple types, indexed

a <- list(first=c(1,2), second=4)
a
## $first
## [1] 1 2
## 
## $second
## [1] 4
a$first
## [1] 1 2

Data frames

  • data.frame: A table of typed data
    • Rows are measurements.
    • Columns are variables
mtcars[,c(1:4)]
##                      mpg cyl  disp  hp
## Mazda RX4           21.0   6 160.0 110
## Mazda RX4 Wag       21.0   6 160.0 110
## Datsun 710          22.8   4 108.0  93
## Hornet 4 Drive      21.4   6 258.0 110
## Hornet Sportabout   18.7   8 360.0 175
## Valiant             18.1   6 225.0 105
## Duster 360          14.3   8 360.0 245
## Merc 240D           24.4   4 146.7  62
## Merc 230            22.8   4 140.8  95
## Merc 280            19.2   6 167.6 123
## Merc 280C           17.8   6 167.6 123
## Merc 450SE          16.4   8 275.8 180
## Merc 450SL          17.3   8 275.8 180
## Merc 450SLC         15.2   8 275.8 180
## Cadillac Fleetwood  10.4   8 472.0 205
## Lincoln Continental 10.4   8 460.0 215
## Chrysler Imperial   14.7   8 440.0 230
## Fiat 128            32.4   4  78.7  66
## Honda Civic         30.4   4  75.7  52
## Toyota Corolla      33.9   4  71.1  65
## Toyota Corona       21.5   4 120.1  97
## Dodge Challenger    15.5   8 318.0 150
## AMC Javelin         15.2   8 304.0 150
## Camaro Z28          13.3   8 350.0 245
## Pontiac Firebird    19.2   8 400.0 175
## Fiat X1-9           27.3   4  79.0  66
## Porsche 914-2       26.0   4 120.3  91
## Lotus Europa        30.4   4  95.1 113
## Ford Pantera L      15.8   8 351.0 264
## Ferrari Dino        19.7   6 145.0 175
## Maserati Bora       15.0   8 301.0 335
## Volvo 142E          21.4   4 121.0 109

Data frames

Information about a data frame

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Data frames | indexing

Data frame items by column. Returns vector of values

mtcars$mpg # or mtcars[, 1]
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4

Data frames | indexing

Data frame items by row. Returns a new data.frame

mtcars[c(1,3), ] # First and third row
##             mpg cyl disp  hp drat   wt  qsec vs am gear carb
## Mazda RX4  21.0   6  160 110 3.90 2.62 16.46  0  1    4    4
## Datsun 710 22.8   4  108  93 3.85 2.32 18.61  1  1    4    1

Data frames | searching

Searching in a data frame, returns a data frame

subset(mtcars, cyl == 6 & wt > 2.8)
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4

Loading and saving data

R can read tabular data from any data source (incl. databases). By default, it supports CSV.

data <- read.csv("input.csv", sep = ",")
str(data)

Writing data also creates CSV files

write.csv(data, "data.csv")

Statistical data visualization

Histogram

Probability distribution of 1 variable

hist(mtcars$mpg)

Scatter plot

Actual values of 2 variable on 2D plot

plot(mtcars$mpg, mtcars$qsec)

Box plot

Summaries of 1 variable grouped by another

boxplot(mpg ~ cyl, mtcars)

Bar plot

Frequencies of 1 group of data

counts <- table(mtcars$gear)
barplot(counts, main="Car Distribution", xlab="Number of Gears", ylab="Number of cars")

Grouped bar plot

Frequencies of >1 groups of data

counts <- table(mtcars$vs, mtcars$gear)
barplot(counts, main="Car Distribution by Gears and VS", xlab="Number of Gears",
        col=c("darkblue","red"), legend = rownames(counts), beside=TRUE)

Line/Area chart

ggplot(mtcars1) + aes(x = mpg, fill = gear) + geom_density() + defaults

Facets

Split visualization in groups based on factors

ggplot(mtcars1) + aes(x = hp, y = mpg, shape=am, color=am) + facet_grid(gear~cyl) +
   xlab("Horsepower") + ylab("Miles per Gallon") + geom_point(size = 4) + defaults

Distributions

Distributions | Normal

Identified by the characteristic ‘bell curve’ histogram

## Warning: Removed 2 rows containing missing values (geom_bar).

Distributions | Non-normal

Histograms are left or right skewed

Statistical techniques

Correlations

Correlation is the dependence of two variables.

  • Pearson correlation examines linear dependencies, i.e. whether in a scatter plot all dots lie on the same line.
  • Rank correlations (e.g. Spearman \(\rho\)) examine the extend of if one variable increases then the other increases as well.

Correlations

Correlation is the dependence of two variables.

  • Pearson correlation examine linear dependencies, i.e. whether in a scatter plot all dots lie on the same line.
  • Rank correlations (e.g. Spearman \(\rho\)) examine the extend of if one variable increases then the other increases as well.

Correlations | Pearson

Used for normally distributed data

cor.test(mtcars$mpg, mtcars$hp, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  mtcars$mpg and mtcars$hp
## t = -6.7424, df = 30, p-value = 1.788e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8852686 -0.5860994
## sample estimates:
##        cor 
## -0.7761684
  • \(p << 0.05\) so correlation is statistically significant
  • cor is -0.77, so strong

Correlations | Spearman

Used for non-normally distributed data (remember, calculates a different thing than Pearson)

cor.test(mtcars$mpg, mtcars$hp, method = "spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  mtcars$mpg and mtcars$hp
## S = 10337, p-value = 5.086e-12
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.8946646
  • \(p << 0.05\) so correlation is statistically significant
  • \(\rho = -0.89\), so strong

Correlations | Things to consider

Correlation is not causation! Just because two variables are strongly correlated, does not mean that one causes changes to the other. See Spurious Correlations for more.

Correlation does not entail collinearity

anscombe_quartret

Checking data for normality

  • Plot a histogram
hist(mtcars$mpg)

  • Shapiro-Wilk test: check whether a distribution is NOT normal
shapiro.test(mtcars$mpg)
## 
##  Shapiro-Wilk normality test
## 
## data:  mtcars$mpg
## W = 0.94756, p-value = 0.1229

Checking whether two groups differ

Useful when we want to compare if parts of a dataset exhibit different behaviour wrt 1 variable. Example questions:

  • Do 4 and 8 cylinder cars consume the same?
  • Do males and females eat the same amount of vegetables?
  • Do conservatives and liberals use their guns at the same rate?

Process

  1. draw a boxplot to see if you can spot a difference
  2. use a statistical test to test for statistical significance
  • use t.test for normally distributed data
  • use wilcox.test for non-normally distributed data
  1. use an effect size metric (cliffs.d or cohen.d) to examine how pronounced the difference is.

Checking whether two groups differ

boxplot(mpg ~ cyl, mtcars)

shapiro.test(mtcars$mpg)
## 
##  Shapiro-Wilk normality test
## 
## data:  mtcars$mpg
## W = 0.94756, p-value = 0.1229

\(p = 0.12\), so data is not normally distributed.

Checking whether two groups differ

four.cyl <- subset(mtcars, cyl == 4)
eight.cyl <- subset(mtcars, cyl == 8)
wilcox.test(four.cyl$mpg, eight.cyl$mpg)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  four.cyl$mpg and eight.cyl$mpg
## W = 154, p-value = 2.775e-05
## alternative hypothesis: true location shift is not equal to 0

\(p < 0.05\) so difference is statistically significant

library(cliffsd)
cliffs.d(four.cyl$mpg, eight.cyl$mpg)
## [1] 1

Cliff’s \(\delta\) score of 1 (ranges from \(-1 .. 1\)), so groups are totaly different

Learning from data

Machine learning

Suppose we have the following

  • \(A = \{a_1..a_n\}\), where \(a_n\) is a tuple of continuous and categorical data of length \(len(a_n)\))
  • \(B = \{b_1..b_n\}\), where \(b_n\) is a set of either continuous or categorical data
  • a function \(F: A \rightarrow B\).

Then machine learning can be (loosely!) defined as

  • Supervised learning: approximate function \(F\).
  • Unsupervised learning: we do not know \(B\) (and therefore cannot approximate \(F\)). Find patterns in \(A\).
  • Reinforced learning: \(A\) and \(B\) gets updated as we learn \(F\): re-learn \(F\) in every step.

Machine learning

Suppose we have two sets

  • \(A = \{a_1..a_n\}\), where \(a_n\) is a tuple of continuous and categorical data of length \(len(a_n)\))
  • \(B = \{b_1..b_n\}\) (where \(b_n\) is a set of either continuous or categorical data)
  • a function \(F: A \rightarrow B\).

Types of machine learning

  • Classification: \(B\) is a categorical variable. Binary (\(B = {TRUE, FALSE}\)) or multiclass (\(card(B) > 2\)).
  • Regression: \(B\) is an continuous variable
  • Clustering: Find similar groups in \(A\)
  • Dimensionality reduction: Map \(A = \{a_1..a_n\}\) to \(C = \{c_1..c_n\}\) where \(len(a_n) < len(c_n)\)

Supervised learning process

  1. Extract 1 set of measurements (features) per case
  2. Construct a model:
    • independent variables (one or more)
    • dependent variable as the variable we want to predict
  3. Split the data in training and testing set (usually 90%-10% split)
    • stratified sampling: sliding windows of train and test data
    • randomized sampling: split data randomly
  4. Run a learning algorithm to learn \(F\) on test data
  5. Evaluate \(F\) on test data
  6. Calculate performance metrics, repeat steps 1 – 4 to improve

Supervised learning example

Train a binary logistic regression learner to predict whether a pull request will be merged. Data can be found at gousiosg/pullreqs

# The model to train against
model <- merged ~ team_size + num_commits + files_changed +
  perc_external_contribs + test_churn + num_comments +
  commits_on_files_touched +  test_lines_per_kloc + prev_pullreqs +
  requester_succ_rate + num_participants
dependent <- all.vars(model)[1]

# Random sampling of training data
sample.train <- sample(nrow(data), size = 0.8 * nrow(data))

# Split input dataset to training and testing
train.data <- data[sample.train,]
test.data  <- data[-sample.train,]

# Train a learner on the training data
blr.trained <- glm(model, data = train.data, family="binomial")

Supervised learning | Example

blr.trained
## 
## Call:  glm(formula = model, family = "binomial", data = train.data)
## 
## Coefficients:
##              (Intercept)                 team_size               num_commits  
##                2.1065310                 0.0024397                -0.0097357  
##            files_changed    perc_external_contribs                test_churn  
##               -0.0028880                -0.0079801                 0.2988392  
##             num_comments  commits_on_files_touched       test_lines_per_kloc  
##                0.0010373                 0.0029545                -0.0716948  
##            prev_pullreqs       requester_succ_rate          num_participants  
##                0.0004607                 0.9327851                -0.0710522  
## 
## Degrees of Freedom: 2577 Total (i.e. Null);  2566 Residual
## Null Deviance:       2138 
## Residual Deviance: 2056  AIC: 2080

Supervised learning | Example

Predict with trained model, extract scores

predictions <- predict(blr.trained, newdata = test.data)
pred.obj <- prediction(predictions, test.data[,dependent])
classification.perf.metrics(pred.obj)
##        auc       tnr       tpr    g.mean     w.acc      prec       rec
## 1 0.686488 0.7263158 0.5381818 0.6252119 0.5852153 0.9192547 0.5381818
##   f.measure       acc
## 1 0.6788991 0.5658915

Supervised learning | Algorithms

  • Logistic regression: Computes coefficients for linear polynomials
  • Naive Bayes: Assume variables are independent. Compute prior probability of each variable predicting the output class. Compute conditional probability of output class based on input variables.
  • Random Forests: Generate many decision trees using random samples of training data. Compute output class probability based on majority votes from individual trees.
  • Ensemble methods: Partition data based on input variable ranges and use appropriate algorithms for each partition.

Supervised learning evaluation | Performance Metrics

The confusion matrix

real/predicted True False
True TP FN
False FP TN

Performance metrics

  • Precision: \(prec = tp / (tp + fp)\)
  • Recall: \(rec = tp / (tp + fn)\)
  • Accuracy: \(acc = (tp + tn)/(fp + fn + tp + tn)\)
  • F-measure: \(f = (2 * prec * rec) / (prec + rec)\)

Supervised learning evaluation | The ROC curve

perf <- performance(pred.obj, measure = "tpr", x.measure = "fpr")
plot(perf, col=rainbow(10))

Supervised learning | A real-life example

see: gist:778f7ecad919a2adb1e4

Unsupervised learning | kmeans

# Create 4 clusters of flowers based on characteristics
fit <-kmeans(iris[,c(1:4)], 4)

# View clusters on a plot. Dimensions are the result of dimentionality reduction.
library(cluster)
clusplot(iris[,c(1:4)], fit$cluster, color=TRUE, shade=TRUE, lines=0)

Unsupervised learning | Association rules

Given a set of items \(I=\{i_1...i_n\}\) and a set of transactions \(T=\{t_1...t_n\}\) where \(t_n = \langle a,... \rangle, a \in I\), return a set of rules like \(\langle i_1,i_3,... \rangle \Rightarrow \langle i_5 \rangle\)

Rules are evaluated by

  • Support: % of cases the rule applies in \(T\)
  • Confidence: Probability that a rule is correct given a new \(t_n\)

Used for:

  • “Customers who bought this also bought… (Amazon)”
  • “You may like this song… (Spotify)”
library(arules)
library(arulesViz)
library(datasets)
data(Groceries)

rules <- apriori(Groceries, parameter = list(supp = 0.001, conf = 0.8))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 done [0.02s].
## writing ... [410 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Unsupervised learning | Association rules

inspect(rules[1:8])
##     lhs                         rhs                support confidence    coverage      lift count
## [1] {liquor,                                                                                     
##      red/blush wine}         => {bottled beer} 0.001931876  0.9047619 0.002135231 11.235269    19
## [2] {curd,                                                                                       
##      cereals}                => {whole milk}   0.001016777  0.9090909 0.001118454  3.557863    10
## [3] {yogurt,                                                                                     
##      cereals}                => {whole milk}   0.001728521  0.8095238 0.002135231  3.168192    17
## [4] {butter,                                                                                     
##      jam}                    => {whole milk}   0.001016777  0.8333333 0.001220132  3.261374    10
## [5] {soups,                                                                                      
##      bottled beer}           => {whole milk}   0.001118454  0.9166667 0.001220132  3.587512    11
## [6] {napkins,                                                                                    
##      house keeping products} => {whole milk}   0.001321810  0.8125000 0.001626843  3.179840    13
## [7] {whipped/sour cream,                                                                         
##      house keeping products} => {whole milk}   0.001220132  0.9230769 0.001321810  3.612599    12
## [8] {pastry,                                                                                     
##      sweet spreads}          => {whole milk}   0.001016777  0.9090909 0.001118454  3.557863    10

Unsupervised learning | Association rules

plot(rules[1:8],method="graph")

Bibliography