Both types can be: Qualitative and Quantitative
Extract answers to research questions using mathematical, statistical or numerical techniques
Propose an explanation to a phenomenon
Defined in pairs
A good hypothesis is readily falsifiable
Most statistical tests return the probability (\(p\)) that \(H_0\) is true.
To interpret a test, we set a threshold (usually, 0.05) for \(p\)
If \(p <\) threshold, then the null hypothesis is rejected and the default one is accepted
Need to know before hand what statistical tests do
Extract samples of data for a running process. Data types:
vector
, list
, data.frame
Vectors contain items of single type
a <- c(1:5)
a
## [1] 1 2 3 4 5
length(a) # Length of a vector
## [1] 5
a * 2 # Operations apply on all elements
## [1] 2 4 6 8 10
b <- c(3:9) # Longer vector items not used
a * b
## [1] 3 8 15 24 35 8 18
a < 4
## [1] TRUE TRUE TRUE FALSE FALSE
Lists contain items of multiple types, indexed
a <- list(first=c(1,2), second=4)
a
## $first
## [1] 1 2
##
## $second
## [1] 4
a$first
## [1] 1 2
data.frame
: A table of typed data
mtcars[,c(1:4)]
## mpg cyl disp hp
## Mazda RX4 21.0 6 160.0 110
## Mazda RX4 Wag 21.0 6 160.0 110
## Datsun 710 22.8 4 108.0 93
## Hornet 4 Drive 21.4 6 258.0 110
## Hornet Sportabout 18.7 8 360.0 175
## Valiant 18.1 6 225.0 105
## Duster 360 14.3 8 360.0 245
## Merc 240D 24.4 4 146.7 62
## Merc 230 22.8 4 140.8 95
## Merc 280 19.2 6 167.6 123
## Merc 280C 17.8 6 167.6 123
## Merc 450SE 16.4 8 275.8 180
## Merc 450SL 17.3 8 275.8 180
## Merc 450SLC 15.2 8 275.8 180
## Cadillac Fleetwood 10.4 8 472.0 205
## Lincoln Continental 10.4 8 460.0 215
## Chrysler Imperial 14.7 8 440.0 230
## Fiat 128 32.4 4 78.7 66
## Honda Civic 30.4 4 75.7 52
## Toyota Corolla 33.9 4 71.1 65
## Toyota Corona 21.5 4 120.1 97
## Dodge Challenger 15.5 8 318.0 150
## AMC Javelin 15.2 8 304.0 150
## Camaro Z28 13.3 8 350.0 245
## Pontiac Firebird 19.2 8 400.0 175
## Fiat X1-9 27.3 4 79.0 66
## Porsche 914-2 26.0 4 120.3 91
## Lotus Europa 30.4 4 95.1 113
## Ford Pantera L 15.8 8 351.0 264
## Ferrari Dino 19.7 6 145.0 175
## Maserati Bora 15.0 8 301.0 335
## Volvo 142E 21.4 4 121.0 109
Information about a data frame
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Data frame items by column. Returns vector of values
mtcars$mpg # or mtcars[, 1]
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
Data frame items by row. Returns a new data.frame
mtcars[c(1,3), ] # First and third row
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.62 16.46 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
Searching in a data frame, returns a data frame
subset(mtcars, cyl == 6 & wt > 2.8)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
R can read tabular data from any data source (incl. databases). By default, it supports CSV.
data <- read.csv("input.csv", sep = ",")
str(data)
Writing data also creates CSV files
write.csv(data, "data.csv")
Probability distribution of 1 variable
hist(mtcars$mpg)
Actual values of 2 variable on 2D plot
plot(mtcars$mpg, mtcars$qsec)
Summaries of 1 variable grouped by another
boxplot(mpg ~ cyl, mtcars)
Frequencies of 1 group of data
counts <- table(mtcars$gear)
barplot(counts, main="Car Distribution", xlab="Number of Gears", ylab="Number of cars")
Frequencies of >1 groups of data
counts <- table(mtcars$vs, mtcars$gear)
barplot(counts, main="Car Distribution by Gears and VS", xlab="Number of Gears",
col=c("darkblue","red"), legend = rownames(counts), beside=TRUE)
ggplot(mtcars1) + aes(x = mpg, fill = gear) + geom_density() + defaults
Split visualization in groups based on factors
ggplot(mtcars1) + aes(x = hp, y = mpg, shape=am, color=am) + facet_grid(gear~cyl) +
xlab("Horsepower") + ylab("Miles per Gallon") + geom_point(size = 4) + defaults
Identified by the characteristic ‘bell curve’ histogram
## Warning: Removed 2 rows containing missing values (geom_bar).
Histograms are left or right skewed
Correlation is the dependence of two variables.
Correlation is the dependence of two variables.
Used for normally distributed data
cor.test(mtcars$mpg, mtcars$hp, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: mtcars$mpg and mtcars$hp
## t = -6.7424, df = 30, p-value = 1.788e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.8852686 -0.5860994
## sample estimates:
## cor
## -0.7761684
cor
is -0.77, so strongUsed for non-normally distributed data (remember, calculates a different thing than Pearson)
cor.test(mtcars$mpg, mtcars$hp, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: mtcars$mpg and mtcars$hp
## S = 10337, p-value = 5.086e-12
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.8946646
Correlation is not causation! Just because two variables are strongly correlated, does not mean that one causes changes to the other. See Spurious Correlations for more.
Correlation does not entail collinearity
hist(mtcars$mpg)
shapiro.test(mtcars$mpg)
##
## Shapiro-Wilk normality test
##
## data: mtcars$mpg
## W = 0.94756, p-value = 0.1229
Useful when we want to compare if parts of a dataset exhibit different behaviour wrt 1 variable. Example questions:
Process
t.test
for normally distributed datawilcox.test
for non-normally distributed datacliffs.d
or cohen.d
) to examine how pronounced the difference is.boxplot(mpg ~ cyl, mtcars)
shapiro.test(mtcars$mpg)
##
## Shapiro-Wilk normality test
##
## data: mtcars$mpg
## W = 0.94756, p-value = 0.1229
\(p = 0.12\), so data is not normally distributed.
four.cyl <- subset(mtcars, cyl == 4)
eight.cyl <- subset(mtcars, cyl == 8)
wilcox.test(four.cyl$mpg, eight.cyl$mpg)
##
## Wilcoxon rank sum test with continuity correction
##
## data: four.cyl$mpg and eight.cyl$mpg
## W = 154, p-value = 2.775e-05
## alternative hypothesis: true location shift is not equal to 0
\(p < 0.05\) so difference is statistically significant
library(cliffsd)
cliffs.d(four.cyl$mpg, eight.cyl$mpg)
## [1] 1
Cliff’s \(\delta\) score of 1 (ranges from \(-1 .. 1\)), so groups are totaly different
Suppose we have the following
Then machine learning can be (loosely!) defined as
Suppose we have two sets
Types of machine learning
Train a binary logistic regression learner to predict whether a pull request will be merged. Data can be found at gousiosg/pullreqs
# The model to train against
model <- merged ~ team_size + num_commits + files_changed +
perc_external_contribs + test_churn + num_comments +
commits_on_files_touched + test_lines_per_kloc + prev_pullreqs +
requester_succ_rate + num_participants
dependent <- all.vars(model)[1]
# Random sampling of training data
sample.train <- sample(nrow(data), size = 0.8 * nrow(data))
# Split input dataset to training and testing
train.data <- data[sample.train,]
test.data <- data[-sample.train,]
# Train a learner on the training data
blr.trained <- glm(model, data = train.data, family="binomial")
blr.trained
##
## Call: glm(formula = model, family = "binomial", data = train.data)
##
## Coefficients:
## (Intercept) team_size num_commits
## 2.1065310 0.0024397 -0.0097357
## files_changed perc_external_contribs test_churn
## -0.0028880 -0.0079801 0.2988392
## num_comments commits_on_files_touched test_lines_per_kloc
## 0.0010373 0.0029545 -0.0716948
## prev_pullreqs requester_succ_rate num_participants
## 0.0004607 0.9327851 -0.0710522
##
## Degrees of Freedom: 2577 Total (i.e. Null); 2566 Residual
## Null Deviance: 2138
## Residual Deviance: 2056 AIC: 2080
Predict with trained model, extract scores
predictions <- predict(blr.trained, newdata = test.data)
pred.obj <- prediction(predictions, test.data[,dependent])
classification.perf.metrics(pred.obj)
## auc tnr tpr g.mean w.acc prec rec
## 1 0.686488 0.7263158 0.5381818 0.6252119 0.5852153 0.9192547 0.5381818
## f.measure acc
## 1 0.6788991 0.5658915
The confusion matrix
real/predicted | True | False |
---|---|---|
True | TP | FN |
False | FP | TN |
Performance metrics
perf <- performance(pred.obj, measure = "tpr", x.measure = "fpr")
plot(perf, col=rainbow(10))
# Create 4 clusters of flowers based on characteristics
fit <-kmeans(iris[,c(1:4)], 4)
# View clusters on a plot. Dimensions are the result of dimentionality reduction.
library(cluster)
clusplot(iris[,c(1:4)], fit$cluster, color=TRUE, shade=TRUE, lines=0)
Given a set of items \(I=\{i_1...i_n\}\) and a set of transactions \(T=\{t_1...t_n\}\) where \(t_n = \langle a,... \rangle, a \in I\), return a set of rules like \(\langle i_1,i_3,... \rangle \Rightarrow \langle i_5 \rangle\)
Rules are evaluated by
Used for:
library(arules)
library(arulesViz)
library(datasets)
data(Groceries)
rules <- apriori(Groceries, parameter = list(supp = 0.001, conf = 0.8))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 done [0.02s].
## writing ... [410 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(rules[1:8])
## lhs rhs support confidence coverage lift count
## [1] {liquor,
## red/blush wine} => {bottled beer} 0.001931876 0.9047619 0.002135231 11.235269 19
## [2] {curd,
## cereals} => {whole milk} 0.001016777 0.9090909 0.001118454 3.557863 10
## [3] {yogurt,
## cereals} => {whole milk} 0.001728521 0.8095238 0.002135231 3.168192 17
## [4] {butter,
## jam} => {whole milk} 0.001016777 0.8333333 0.001220132 3.261374 10
## [5] {soups,
## bottled beer} => {whole milk} 0.001118454 0.9166667 0.001220132 3.587512 11
## [6] {napkins,
## house keeping products} => {whole milk} 0.001321810 0.8125000 0.001626843 3.179840 13
## [7] {whipped/sour cream,
## house keeping products} => {whole milk} 0.001220132 0.9230769 0.001321810 3.612599 12
## [8] {pastry,
## sweet spreads} => {whole milk} 0.001016777 0.9090909 0.001118454 3.557863 10
plot(rules[1:8],method="graph")
This work is (c) 2017 - onwards by TU Delft and Georgios Gousios and licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.