We use the following packages:
library(boot)
library(mice)
library(dplyr)
library(magrittr)
library(ggplot2)
I fix the random seed to 123
set.seed(123)
age
and weight
on the boys
data set. Use R=1000
bootstrap samples.corfun <- function(data, i){
data.sample <- data[i,]
stat <- cor(data.sample$age, data.sample$wgt, use = "pairwise.complete.obs")
return(stat)
}
To apply the corfun()
function in the bootstrap function boot()
, we run:
bootstr.cor <-
boys %>%
boot(corfun, R = 1000)
Our object bootstr.cor
contains all the bootstrap information about the correlation parameter
bootstr.cor
object. For example, use function attributes()
to see the listed dimensions that are within the bootstr.cor
object and the class
of the object.attributes(bootstr.cor)
## $names
## [1] "t0" "t" "R" "data" "seed"
## [6] "statistic" "sim" "call" "stype" "strata"
## [11] "weights"
##
## $class
## [1] "boot"
##
## $boot_type
## [1] "boot"
we see that the bootstr.cor
object has class boot
.
head(bootstr.cor$data)
## age hgt wgt bmi hc gen phb tv reg
## 3 0.035 50.1 3.650 14.54 33.7 <NA> <NA> NA south
## 4 0.038 53.5 3.370 11.77 35.0 <NA> <NA> NA south
## 18 0.057 50.0 3.140 12.56 35.2 <NA> <NA> NA south
## 23 0.060 54.5 4.270 14.37 36.7 <NA> <NA> NA south
## 28 0.062 57.5 5.030 15.21 37.3 <NA> <NA> NA south
## 36 0.068 55.5 4.655 15.11 37.0 <NA> <NA> NA south
The most interesting listed dimensions are the following:
$t0
contains the overall bootstrapped estimate for the correlationsbootstr.cor$t0
## [1] 0.9505762
$t
contains the individual estimate for every bootstrapped samplehead(bootstr.cor$t)
## [,1]
## [1,] 0.9479980
## [2,] 0.9471073
## [3,] 0.9486481
## [4,] 0.9532484
## [5,] 0.9550052
## [6,] 0.9505718
$data
has the original data set that has been served to function boot()
:head(bootstr.cor$data)
## age hgt wgt bmi hc gen phb tv reg
## 3 0.035 50.1 3.650 14.54 33.7 <NA> <NA> NA south
## 4 0.038 53.5 3.370 11.77 35.0 <NA> <NA> NA south
## 18 0.057 50.0 3.140 12.56 35.2 <NA> <NA> NA south
## 23 0.060 54.5 4.270 14.37 36.7 <NA> <NA> NA south
## 28 0.062 57.5 5.030 15.21 37.3 <NA> <NA> NA south
## 36 0.068 55.5 4.655 15.11 37.0 <NA> <NA> NA south
$t
).plotdata <- data.frame(t = bootstr.cor$t)
plotdata %>%
ggplot(aes(t)) +
geom_histogram(bins = 20) +
geom_density(col = "orange") +
geom_vline(xintercept = bootstr.cor$t0, col = "orange", linetype = "dotted")
bmi > 25
.Remember that body mass index takes the ratio weight-over-heigth-squared into account and the validity of the interpretation does change for children vs. adults.
We can use function mutate()
to assign values to a new column (variable) ovw
for boys that have a bmi <= 25
(smaller or equal) or bmi > 25
(larger). The if_else()
function can be directly applied to call if_else(condition, true, false)
.
boys2 <- boys %>%
mutate(ovw = if_else(bmi <= 25, "Not Overweight", "Overweight"))
There are now 20 Overweight
boys:
boys2 %$%
table(ovw)
## ovw
## Not Overweight Overweight
## 707 20
ovw
boys over the regions reg
. We can copy the previous function and replace the line for our test-statistic:X2fun <- function(data, i){
data.sample <- data[i,]
tab <- data.sample %$%
table(ovw, reg)
X2 <- suppressWarnings(chisq.test(tab))
return(X2$statistic)
}
I use suppressWarnings()
to suppress the warnings the \(X^2\) test is going to give us, because some of the expected cell-frequencies are below 5. I only return the chi-squared estimate by asking for X2$statistic
. This is because function boot
expects an estimate as return, not a whole list of estimates.
bootstr.X2 <-
boys2 %>%
boot(X2fun, R = 1000)
bootstr.X2
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = ., statistic = X2fun, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 3.650944 3.769159 4.986556
$t
).plotdata <- data.frame(t = bootstr.X2$t)
plotdata %>%
ggplot(aes(t)) +
geom_histogram(bins = 40) +
geom_vline(xintercept = bootstr.X2$t0, col = "orange", linetype = "dotted")
The sampling distribution of the estimate is quite skewed.
lm(wgt ~ age + hgt + I(hgt^2))
The function that would give us the regression estimates for the parameters is e.g. :
lmfun <- function(data, i){
data.sample <- data[i,]
coef <- data.sample %$%
lm(wgt ~ age + hgt + I(hgt^2)) %>%
coef()
return(coef)
}
and if we plug that function into boot()
for R = 1000
bootstrap samples, we obtain
bootstr.lm <-
boys %>%
boot(lmfun, R = 1000)
bootstr.lm
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = ., statistic = lmfun, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 23.853118370 1.386391e-01 1.8572544564
## t2* 0.710865240 8.265404e-04 0.2334275999
## t3* -0.498229914 -2.482243e-03 0.0366745633
## t4* 0.003722907 9.426072e-06 0.0002206243
Again, the average (bootstrapped) regression estimate is:
bootstr.lm$t0
## (Intercept) age hgt I(hgt^2)
## 23.853118370 0.710865240 -0.498229914 0.003722907
est <- bootstr.lm$t
colnames(est) <- (c("Intercept", "wgt", "hgt", "hgt^2"))
reshape2::melt(est) %>%
ggplot(aes(value)) + facet_wrap(~Var2, scales = 'free_x') +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
boot.ci()
to automatically obtain the confidence intervals from our estimates.boot.ci(bootstr.lm)
## Warning in sqrt(tv[, 2L]): NaNs produced
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = bootstr.lm)
##
## Intervals :
## Level Normal Basic Studentized
## 95% (20.07, 27.35 ) (20.05, 27.54 ) (20.09, 27.99 )
##
## Level Percentile BCa
## 95% (20.16, 27.66 ) (19.91, 27.28 )
## Calculations and Intervals on Original Scale
We can also use the quantiles()
for the $t
dimensions. For example, to obtain the cut-offs for .025
and .975
from the empirical distribution for the intercept, use:
quantile(est[, "Intercept"], c(0.025,0.975))
## 2.5% 97.5%
## 20.19144 27.62233
or, for all estimates when using apply()
:
apply(est, 2, function(x) quantile(x, c(0.025,0.975)))
## Intercept wgt hgt hgt^2
## 2.5% 20.19144 0.2511471 -0.5692692 0.003302579
## 97.5% 27.62233 1.1622391 -0.4285475 0.004166733
End of Practical