Statistical Programming with R

Packages and functions that we use

library(dplyr)    # Data manipulation
library(magrittr) # Pipes
library(ggplot2)  # Plotting suite
library(MASS)     # Dataset
library(class)    # K-nearest Neighbour
library(mvtnorm)  # Multivariate Normal tools

Custom theme for plots

helpIAmColourblind <- scale_color_manual(values = c("orange", 
                                                    "blue", 
                                                    "dark green"))

What is statistical learning?

Statistics is everywhere

Several questions involving statistics:

  1. What is the relation between \(X\) and \(Y\)? (estimation)
  2. What is the uncertainty around this effect? (estimation/inference)
  3. What can I conclude about my population? (inference/hypothesis testing)

  4. How can I best predict new observations? (prediction)
  5. How can I show relevant patterns in my data? (dimension reduction / pattern recognition)

Examples

  • Radiologists use statistics to sharpen their images (e.g. MRI) and improve their diagnosis.
  • Doctors use statistics to target your treatment to your symptoms or body.
  • Physicists use statistics to find useful patterns in the huge data dumps by the Large Hadron Collider.
  • Insurance companies use statistics to model risks for potential clients.
  • Google uses statistics to serve you targeted ads.
  • Netflix uses statistics to create hit shows.
  • Spotify uses statistics to suggest music to you.

Supervised learning

In supervised learning we aim to quantify the relation between \(Y\) and \(X\).

\(Y\):

  • target
  • outcome
  • dependent
  • response

\(X\):

  • features
  • predictors
  • independent
  • input

Supervised learning

We want to find the predictive function:

\[Y = f(X) + \epsilon \]

That minimizes \(\epsilon\) with respect to our goal.

  • Function \(f\) is an unknown, but fixed function of \(X = X1, \dots, Xp\)
  • \(p\) is the number of predictors
  • \(Y\) is the quantitative response
  • \(\epsilon \sim N(0, \sigma_\epsilon^2)\) is a random error term
  • \(\epsilon\) does not depend on \(X\)

Our aim is to find the \(f(X)\) that best represent the systematic information that \(X\) yields about \(Y\).

Supervised learning

With supervised learning every observation on our predictor

\[x_i, i=1, \dots, n\]

has a corresponding outcome measurement

\[y_i\] such that

\[\hat{y_i}=f({\bf x_i})\quad \text{and} \quad y_i = f({\bf x_i})+\epsilon_i.\]

Examples:

  • linear regression
  • logistic regression
  • k-nearest neighbours classifying

Unsupervised learning

With unsupervised learning we have a vector of measurement \(\bf x_i\) for every unit \(i=1, \dots, n\), but we miss the associated response \(y_i\).

  1. There is no outcome to predict

    • Hence you cannot fit e.g. a linear regression model
  2. There is no outcome to verify the model

    • We lack the truth to supervise our analysis

What can we do?

Find patterns in \(\bf x_1, \dots, x_n\)

We can use this model to e.g. find out if some cases are more similar than other cases or which variables explain most of the variation

Examples:

  • Principal Components Analysis
  • k-means clustering

K-means and K-nearest neighbours

Two nonparametric algorithms

K-nearest neighbours (KNN)

  • supervised learning
  • prediction
  • classification

K-means clustering (kmeans)

  • unsupervised learning
  • dimension reduction / pattern recognition
  • clustering

Example dataset

Let's create some data from a multivariate normal distribution

We start with fixing the random seed

set.seed(123)

and specifying the variance covariance matrix:

sigma <- matrix(c(1, .5, .5, 1), 2, 2)
rownames(sigma) <- colnames(sigma) <- c("x1", "x2")

Data relations

sigma
##     x1  x2
## x1 1.0 0.5
## x2 0.5 1.0

Because the variances are 1, the resulting data will have a correlation of \[\rho = \frac{\text{cov}(y, x)}{\sigma_y\sigma_x} = \frac{.5}{1\times1} = .5.\]

Let's draw the data

sim.data <- mvtnorm::rmvnorm(n     = 100, 
                             mean  = c(5, 5), 
                             sigma = sigma)
colnames(sim.data) <- c("x1", "x2")

Plot the data

sim.data %>% 
  as_tibble %>%
  ggplot(aes(x1, x2)) +
  geom_point()

Now add some clustering

sim.data <- 
  sim.data %>%
  as_tibble %>%
  mutate(class = sample(c("A", "B", "C"), size = 100, replace = TRUE))

We have added a new column that randomly assigns rows to level A, B or C

sim.data %>% head
## # A tibble: 6 x 3
##      x1    x2 class
##   <dbl> <dbl> <chr>
## 1  4.40  4.63 C    
## 2  6.52  5.47 A    
## 3  5.57  6.69 C    
## 4  5.12  3.90 B    
## 5  4.22  4.39 B    
## 6  6.28  5.66 B

Plot the data again

sim.data %>%
  ggplot(aes(x1, x2,  colour = class)) +
  geom_point() + 
  helpIAmColourblind

Adjust the clusters to make them distinct

sim.data <- 
  sim.data %>%
  mutate(x2 = case_when(class == "A" ~ x2 + 1.5,
                        class == "B" ~ x2 - 1.5,
                        class == "C" ~ x2 + 1.5),
         x1 = case_when(class == "A" ~ x1 - 1.5,
                        class == "B" ~ x1 - 0,
                        class == "C" ~ x1 + 1.5))

The result: supervised

sim.data %>%
  ggplot(aes(x1, x2,  colour = class)) +
  geom_point() + 
  helpIAmColourblind

The result: unsupervised

sim.data %>%
  ggplot(aes(x1, x2)) +
  geom_point()

K-Nearest Neighbors

How does it work?

  1. For every test observation \(x_0\) the \(K\) points that are close to \(x_0\) are identified.
  2. These closest points form set \(\mathcal{N}_0\).
  3. We estimate the probability for \(x_0\) being part of class \(j\) as the fraction of points in \(\mathcal{N}_0\) for whom the response equals \(j\): \[P(Y = j | X = x_0) = \frac{1}{K}\sum_{i\in\mathcal{N}_0}I(y_i=j)\]

  4. The observation \(x_0\) is classified to the class with the largest probability (after Bayes rule is applied of course)

In short

An observation gets that class assigned to which most of its \(K\) neighbours belong

Why KNN?

Because \(X\) is assigned to the class to which most of the observations belong it is

  • non-parametric

  • no assumptions about the distributions, or the shape of the decision boundary

  • expected to be far better than logistic regression when decision boundaries are non-linear

However, we do not get parameters as with LDA and regression.

  • We thus cannot determine the relative importance of predictors
  • The "model" == the existing observations: instance-based learning

Fitting a K-NN model

First we need to determine a training set

set.seed(123)
sim.data %<>% 
  mutate(set = sample(c("Train", "Test"), size=100, prob = c(.25, .75), replace=TRUE))
sim.data
## # A tibble: 100 x 4
##       x1    x2 class set  
##    <dbl> <dbl> <chr> <chr>
##  1  5.90  6.13 C     Test 
##  2  5.02  6.97 A     Train
##  3  7.07  8.19 C     Test 
##  4  5.12  2.40 B     Train
##  5  4.22  2.89 B     Train
##  6  6.28  4.16 B     Test 
##  7  6.92  6.71 C     Test 
##  8  3.43  8.08 A     Train
##  9  4.97  1.73 B     Test 
## 10  7.06  6.22 C     Test 
## # ... with 90 more rows

Fitting a K-NN model

Then we split the data into a training (build the model) and a test (verify the model) set

train.data <- subset(sim.data, set == "Train", select = c(x1, x2))
test.data <-  subset(sim.data, set == "Test",  select = c(x1, x2))
obs.class <-  subset(sim.data, set == "Train", select = class)

Now we can fit the K-NN model

fit.knn <- knn(train = train.data,
               test = test.data, 
               cl = as.matrix(obs.class),
               k = 3)

Predictions

class.test <- subset(sim.data, set == "Test", select = class) %>%
  as.matrix()
correct <- fit.knn == class.test
mean(correct)
## [1] 0.8767123
table(fit.knn, class.test)
##        class.test
## fit.knn  A  B  C
##       A 25  1  3
##       B  0 15  0
##       C  0  5 24

The (in)correct responses KNN = 3

cbind(test.data, correct) %>%
  ggplot(aes(x1, x2,  colour = correct)) +
  geom_point() +
  scale_colour_manual(values = c("red", "black"))

Fewer neighbours

fit.knn <- knn(train = train.data,
               test = test.data, 
               cl = as.matrix(obs.class),
               k = 2)
correct <- fit.knn == class.test
mean(correct)
## [1] 0.8767123
table(fit.knn, class.test)
##        class.test
## fit.knn  A  B  C
##       A 24  1  3
##       B  0 17  1
##       C  1  3 23

More neighbours

fit.knn <- knn(train = train.data,
               test = test.data, 
               cl = as.matrix(obs.class),
               k = 4)
correct <- fit.knn == class.test
mean(correct)
## [1] 0.8493151
table(fit.knn, class.test)
##        class.test
## fit.knn  A  B  C
##       A 25  1  4
##       B  0 15  1
##       C  0  5 22

Even more neighbours

fit.knn <- knn(train = train.data,
               test = test.data, 
               cl = as.matrix(obs.class),
               k = 10)
correct <- fit.knn == class.test
mean(correct)
## [1] 0.9178082
table(fit.knn, class.test)
##        class.test
## fit.knn  A  B  C
##       A 25  2  2
##       B  0 18  1
##       C  0  1 24

The (in)correct responses KNN = 10

cbind(test.data, correct) %>%
  ggplot(aes(x1, x2,  colour = correct)) +
  geom_point() +
  scale_colour_manual(values = c("red", "black"))

Predicting a new observation

Let's make a new observation:

newObs <- data.frame(x1 = 5.5, x2 = 4.5)

Predicting a new observation

Predicting a new observation

Now we predict the class of this new observation, using the entire data for training our model

knn(train = sim.data[, 1:2], cl = sim.data$class, k = 10, test = newObs)
## [1] B
## Levels: A B C

K-means clustering

Remember: unsupervised

sim.data %>%
  ggplot(aes(x1, x2)) +
  geom_point()

Goal: finding clusters in our data

K-means clustering partitions our dataset into \(K\) distinct, non-overlapping clusters or subgroups.

What is a cluster?

A set of relatively similar observations.

What is "relatively similar"?

This is up to the programmer/researcher to decide. For example, we can say the "within-class" variance is as small as possible and the between-class variance as large as possible.

Why perform clustering?

We expect clusters in our data, but weren't able to measure them

  • potential new subtypes of cancer tissue

We want to summarise features into a categorical variable to use in further decisions/analysis

  • subgrouping people by their spending types

The k-means algorithm

  1. Randomly assign values to K classes
  2. Calculate the centroid (colMeans) for each class
  3. Assign each value to its closest centroid class
  4. If the assignments changed, go to step 2. else stop.

Source: James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112). New York: Springer.

The k-means algorithm

K is a tuning parameter (centers)

(fitkm <- kmeans(sim.data[, 1:2], centers = 3))
## K-means clustering with 3 clusters of sizes 26, 35, 39
## 
## Cluster means:
##         x1       x2
## 1 4.996311 3.279770
## 2 6.513104 6.611511
## 3 3.451391 6.416842
## 
## Clustering vector:
##   [1] 2 2 2 1 1 1 2 3 1 2 3 2 3 1 3 3 2 3 1 3 1 3 1 3 2 1 3 1 3 2 2 1 3 2 2
##  [36] 1 3 1 3 3 3 1 2 3 2 2 2 3 2 2 1 1 3 3 3 2 3 2 3 3 2 1 1 3 1 2 3 3 3 2
##  [71] 3 1 1 2 1 1 1 3 3 2 3 2 3 2 2 3 3 2 1 2 2 3 1 2 3 2 3 3 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 36.78489 53.98274 52.36237
##  (between_SS / total_SS =  72.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

The result:

sim.data$clust <- as.factor(fitkm$cluster)

sim.data %>% ggplot +
  geom_point(aes(x = x1, y = x2, colour = clust)) +
  helpIAmColourblind

Comparison

# this is the data-generating class

sim.data %>% ggplot +
  geom_point(aes(x = x1, y = x2, colour = class)) +
  helpIAmColourblind

Centroids

sim.data %>% ggplot +
  geom_point(aes(x = x1, y = x2, colour = clust)) +
  geom_point(aes(x = x1, y = x2), data = as.data.frame(fitkm$centers),
             size = 5, col = "red", alpha = 0.8) +
  helpIAmColourblind

K = 5

K = 2

Conclusion

  • Supervised learning: outcome / target available
  • Unsupervised learning: no outcome / target
  • prediction & pattern recognition vs. estimation, inference, testing
  • knn: nonparametric classification
  • kmeans: clustering algorithm

  • With the power of R we can generate any data we want and know the 'truth'!