Bootstrapping

Statistical Programming with R

We use the following packages

library(dplyr)    # Data manipulation
library(magrittr) # Pipes
library(ggplot2)  # Plotting suite
library(boot)     # Bootstrapping functions

Running example

a country called sumR
3 million adult humans inhabit it
3 cities Rnhem, AlkmR, and LeeuwRden
height
shoe size

set.seed(54321)
 
height   <- rnorm(3e6, mean = 180, sd = 7.5)
shoesize <- 35 + 0.02*height + runif(3e6, -1, 1)
city     <- sample(c("Rnhem", "AlkmR", "LeeuwRden"), size = 3e6, 
                   prob = c(0.6, 0.3, 0.1), replace = TRUE)

sumR <- data.frame(height, shoesize, city)

Running example

par(mfrow = c(1,3))
hist(height,   breaks = 80, col = "light green", border = "transparent")
hist(shoesize, breaks = 80, col = "light blue",  border = "transparent")
table(city) %>% barplot(col = "pink")

(In a real country we will not know these things)

Let's do statistics!

We randomly sample 100 persons and "measure" their height, shoesize and city:

samp100 <- sample(3e6, size = 100)
sdat100 <- sumR[samp100, ]
head(sdat100)

##           height shoesize      city
## 2908895 177.8716 39.28683     Rnhem
## 236907  170.2528 38.97513     Rnhem
## 1230916 195.9043 38.27168 LeeuwRden
## 1937234 177.4607 38.39331     Rnhem
## 2782743 172.3269 38.89885     AlkmR
## 2551759 169.6376 39.02313     Rnhem

Let's use this sample to estimate quantities and infer things about our population!

Goal

We want to know the mean height of our population

The sample mean \(\bar{x}\) is an estimator for the population mean \(\mu\)

m <- mean(sdat100$height)
m

## [1] 179.7258

The sampling distribution

The sampling distribution: parametric

This statistic has a sampling distribution: taking a different sample generates a different mean value

Thus we have uncertainty about our estimated mean

The sampling distribution: parametric

The sampling distribution of the mean is a normal distribution with parameters:
- mean \(\mu_\bar{x}\)
- standard deviation \(\sigma_\bar{x}\)
We can estimate these quantities
Mean estimator: sample mean \[\hat{\mu}_\bar{x} = \bar{x}\]
Standard deviation estimator: the standard error \[\hat{\sigma}_\bar{x} = \frac{s_x}{\sqrt{n}}\]

se <- sd(sdat100$height)/sqrt(100)
se

## [1] 0.6673471

The sampling distribution: parametric

Parametric Confidence Interval

Using the standard error and the assumption of normality, we can create a confidence interval (CI):

(ci <- c(m - 1.96*se, m + 1.96*se))

## [1] 178.4178 181.0338

The sampling distribution: parametric

Interim conclusion

Any sample statistic has uncertainty due to sampling
This uncertainty is captured in the sampling distribution
Asymptotically (as sample size grows) we assume it is a normal distribution
The standard error is a measure of uncertainty
Using the standard error, we can create a 95% CI

The sampling distribution: algorithmic

Let's take a step back

The sampling distribution of a statistic is the distribution of its estimate upon infinitely repeated sampling.
Since we have the full population, we can generate the sampling distribution ourselves!
(but instead of infinite sampling, we sample 10 000 times)

means <- numeric(10000)

for (i in 1:10000) {
  
  means[i] <- mean(sumR$height[sample(3e6, size = 100)])
  
}

The sampling distribution: algorithmic

Jackknife

In the 50s and 60s, computers enabled another measure of uncertainty

The Jackknife

take each person out of your sample once
calculate the statistic on the remaining values
the result: 100 slightly different means

jknf <- numeric(100)

for (i in 1:100) jknf[i] <- mean(sdat100$height[-i])

Similar to LOOCV!

Jackknife

The Jackknife can be used to estimate the standard error:

\[ \begin{align} \hat{se}_{j} &= \sqrt{\frac{n-1}{n}\sum_{i=1}^n\left(\hat{\theta}_{(i)}-\overline{\hat{\theta}_{(\cdot)}}\right)^2} \end{align}\]

jkse <- sqrt(99/100*sum((jknf - mean(jknf))^2))

## Jackknife estimate: 0.667 
## True standarderror: 0.750

With this, we can once again create a sampling distribution and a CI based on normal theory.

Bootstrap

Introduced by Efron in 1979 as computers became more powerful.

As many times as you like (e.g. 1000 times)
Sample \(n\) values from your sample with replacement
Calculate the statistic of interest on this bootstrap sample

btsp <- numeric(1000)

for (i in 1:1000) btsp[i] <- mean(sdat100$height[sample(100, replace = TRUE)])

We sample from our sample, pulling ourselves "up" into the space of population by our bootstraps

Bootstrap

Bootstrap approximation

Why?

No more math!

The standard error calculation does not need to be known

Relaxes the assumption of normality

For some quantities, unreasonable sample sizes are needed before normality (e.g. product of two regression coefficients in mediation analysis)

Q: What is the sampling distribution of the log mean shoe size?

log(mean(sdat100$shoesize))

## [1] 3.653499

Log mean shoe size

Proportion of humans in LeeuwRden

The `boot` package

mystat <- function(dat, index) {
  median(dat$height[index])/mean(dat$shoesize[index])
}

bootstat <- sdat100 %>% boot(mystat, R = 1000)

# SE
sd(bootstat$t)

## [1] 0.02899687

# 95% CI
bootstat$t %>% quantile(c(0.025, 0.975))

##     2.5%    97.5% 
## 4.596886 4.698511

The `boot` package

# Distribution
data.frame(mystat = bootstat$t) %>% 
  ggplot(aes(x = mystat)) +
  geom_density(fill = "light seagreen") +
  xlim(4.5, 4.8) +
  theme_minimal()

Conclusion

Sampling distribution indicates uncertainty
Bootstrapping approximates the sampling distribution
Treats sample as population to sample from
Works when the shape of the sampling distribution is unknown
Package boot can be used for bootstrapping

We use the following packages

Running example

Running example

Let's do statistics!

Goal

The sampling distribution

The sampling distribution: parametric

The sampling distribution: parametric

The sampling distribution: parametric

Parametric Confidence Interval

The sampling distribution: parametric

Interim conclusion

The sampling distribution: algorithmic

Let's take a step back

The sampling distribution: algorithmic

Jackknife

The Jackknife

Jackknife

Jackknife

Bootstrap

Bootstrap

Bootstrap approximation

Why?

No more math!

Relaxes the assumption of normality

Q: What is the sampling distribution of the log mean shoe size?

Log mean shoe size

Proportion of humans in LeeuwRden

The boot package

The boot package

Conclusion

The `boot` package

The `boot` package