Practical

We use the following packages:

library(boot)
library(mice)
library(dplyr)
library(magrittr)
library(ggplot2)

I fix the random seed to 123

set.seed(123)

Use the following function to calculate the bootstrapped estimate for the correlation between age and weight on the boys data set. Use R=1000 bootstrap samples.

corfun <- function(data, i){
  data.sample <- data[i,]
  stat <- cor(data.sample$age, data.sample$wgt, use = "pairwise.complete.obs")
    return(stat)
}

To apply the corfun() function in the bootstrap function boot(), we run:

bootstr.cor <- 
  boys %>%
  boot(corfun, R = 1000)

Our object bootstr.cor contains all the bootstrap information about the correlation parameter

Explore the contents of the bootstr.cor object. For example, use function attributes() to see the listed dimensions that are within the bootstr.cor object and the class of the object.

attributes(bootstr.cor)

## $names
##  [1] "t0"        "t"         "R"         "data"      "seed"     
##  [6] "statistic" "sim"       "call"      "stype"     "strata"   
## [11] "weights"  
## 
## $class
## [1] "boot"
## 
## $boot_type
## [1] "boot"

we see that the bootstr.cor object has class boot.

head(bootstr.cor$data)

##      age  hgt   wgt   bmi   hc  gen  phb tv   reg
## 3  0.035 50.1 3.650 14.54 33.7 <NA> <NA> NA south
## 4  0.038 53.5 3.370 11.77 35.0 <NA> <NA> NA south
## 18 0.057 50.0 3.140 12.56 35.2 <NA> <NA> NA south
## 23 0.060 54.5 4.270 14.37 36.7 <NA> <NA> NA south
## 28 0.062 57.5 5.030 15.21 37.3 <NA> <NA> NA south
## 36 0.068 55.5 4.655 15.11 37.0 <NA> <NA> NA south

The most interesting listed dimensions are the following:

The $t0 contains the overall bootstrapped estimate for the correlations

bootstr.cor$t0

## [1] 0.9505762

The $t contains the individual estimate for every bootstrapped sample

head(bootstr.cor$t)

##           [,1]
## [1,] 0.9479980
## [2,] 0.9471073
## [3,] 0.9486481
## [4,] 0.9532484
## [5,] 0.9550052
## [6,] 0.9505718

The $data has the original data set that has been served to function boot():

head(bootstr.cor$data)

##      age  hgt   wgt   bmi   hc  gen  phb tv   reg
## 3  0.035 50.1 3.650 14.54 33.7 <NA> <NA> NA south
## 4  0.038 53.5 3.370 11.77 35.0 <NA> <NA> NA south
## 18 0.057 50.0 3.140 12.56 35.2 <NA> <NA> NA south
## 23 0.060 54.5 4.270 14.37 36.7 <NA> <NA> NA south
## 28 0.062 57.5 5.030 15.21 37.3 <NA> <NA> NA south
## 36 0.068 55.5 4.655 15.11 37.0 <NA> <NA> NA south

Plot the histogram of the individual bootstrapped estimates for every bootstrapped sample (i.e. $t).

plotdata <- data.frame(t = bootstr.cor$t)

plotdata %>%
  ggplot(aes(t)) + 
  geom_histogram(bins = 20) + 
  geom_density(col = "orange") + 
  geom_vline(xintercept = bootstr.cor$t0, col = "orange", linetype = "dotted")

Add a column to boys that indicates whether boys are overweight by taking the boundary bmi > 25.

Remember that body mass index takes the ratio weight-over-heigth-squared into account and the validity of the interpretation does change for children vs. adults.

We can use function mutate() to assign values to a new column (variable) ovw for boys that have a bmi <= 25 (smaller or equal) or bmi > 25 (larger). The if_else() function can be directly applied to call if_else(condition, true, false).

boys2 <- boys %>%
    mutate(ovw = if_else(bmi <= 25, "Not Overweight", "Overweight"))

There are now 20 Overweight boys:

boys2 %$%
  table(ovw)

## ovw
## Not Overweight     Overweight 
##            707             20

Bootstrap a $X^2$-test that evaluates the distribution of overweight ovw boys over the regions reg. We can copy the previous function and replace the line for our test-statistic:

X2fun <- function(data, i){
  data.sample <- data[i,]
  tab <- data.sample %$%
    table(ovw, reg) 
  X2 <- suppressWarnings(chisq.test(tab))
    return(X2$statistic)
}

I use suppressWarnings() to suppress the warnings the $X^2$ test is going to give us, because some of the expected cell-frequencies are below 5. I only return the chi-squared estimate by asking for X2$statistic. This is because function boot expects an estimate as return, not a whole list of estimates.

bootstr.X2 <- 
  boys2 %>%
  boot(X2fun, R = 1000)

bootstr.X2

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = ., statistic = X2fun, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original   bias    std. error
## t1* 3.650944 3.769159    4.986556

Plot the histogram of the individual bootstrapped estimates for every bootstrapped sample (i.e. $t).

plotdata <- data.frame(t = bootstr.X2$t)

plotdata %>%
  ggplot(aes(t)) + 
  geom_histogram(bins = 40) + 
  geom_vline(xintercept = bootstr.X2$t0, col = "orange", linetype = "dotted")

The sampling distribution of the estimate is quite skewed.

Do a bootstrap of the regression estimates of the following model:

lm(wgt ~ age + hgt + I(hgt^2))

The function that would give us the regression estimates for the parameters is e.g. :

lmfun <- function(data, i){
  data.sample <- data[i,]
  coef <- data.sample %$%
    lm(wgt ~ age + hgt + I(hgt^2)) %>%
    coef()
    return(coef)
}

and if we plug that function into boot() for R = 1000 bootstrap samples, we obtain

bootstr.lm <- 
  boys %>%
  boot(lmfun, R = 1000)

bootstr.lm

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = ., statistic = lmfun, R = 1000)
## 
## 
## Bootstrap Statistics :
##         original        bias     std. error
## t1* 23.853118370  1.386391e-01 1.8572544564
## t2*  0.710865240  8.265404e-04 0.2334275999
## t3* -0.498229914 -2.482243e-03 0.0366745633
## t4*  0.003722907  9.426072e-06 0.0002206243

Again, the average (bootstrapped) regression estimate is:

bootstr.lm$t0

##  (Intercept)          age          hgt     I(hgt^2) 
## 23.853118370  0.710865240 -0.498229914  0.003722907

** Create a histogram of the individual bootstrapped estimates **

est <- bootstr.lm$t
colnames(est) <- (c("Intercept", "wgt", "hgt", "hgt^2"))

reshape2::melt(est) %>%
  ggplot(aes(value)) + facet_wrap(~Var2, scales = 'free_x') +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Calculate the confidence intervals for We can use function boot.ci() to automatically obtain the confidence intervals from our estimates.

boot.ci(bootstr.lm)

## Warning in sqrt(tv[, 2L]): NaNs produced

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = bootstr.lm)
## 
## Intervals : 
## Level      Normal              Basic             Studentized     
## 95%   (20.07, 27.35 )   (20.05, 27.54 )   (20.09, 27.99 )  
## 
## Level     Percentile            BCa          
## 95%   (20.16, 27.66 )   (19.91, 27.28 )  
## Calculations and Intervals on Original Scale

We can also use the quantiles() for the $t dimensions. For example, to obtain the cut-offs for .025 and .975 from the empirical distribution for the intercept, use:

quantile(est[, "Intercept"], c(0.025,0.975))

##     2.5%    97.5% 
## 20.19144 27.62233

or, for all estimates when using apply():

apply(est, 2, function(x) quantile(x, c(0.025,0.975)))

##       Intercept       wgt        hgt       hgt^2
## 2.5%   20.19144 0.2511471 -0.5692692 0.003302579
## 97.5%  27.62233 1.1622391 -0.4285475 0.004166733

End of Practical

Practical_N

Gerko Vink

Statistical Programming in R