Statistical Programming in R

I use the following package in this lecture

library(MASS)

Statistical models

Statistical model

The mathematical formulation of relationship between variables can be written as

\[ \mbox{observed}=\mbox{predicted}+\mbox{error} \]

or (for the greek people) in notation as \[y=\mu+\varepsilon\]

where

  • \(\mu\) (mean) is the part of \(Y\) that is explained by model
  • \(\varepsilon\) (residual) is the part of \(Y\) that is not explained by model

A simple example

Regression model:

  • Model individual age from weight

\[ \text{age}_i=\alpha+\beta\cdot{\text{weight}}_i+\varepsilon_i \]

where

  • \(\alpha+\beta{x}_i\) is the mean of age, \(y\), conditional on weight, \(x\).
  • \(\varepsilon_i\) is random variation

The linear model

The function lm() is a base function in R and allows you to pose a variety of linear models.

args(lm)
## function (formula, data, subset, weights, na.action, method = "qr", 
##     model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, 
##     contrasts = NULL, offset, ...) 
## NULL

If we want to know what these arguments do we can ask R:

?lm

This will open a help page on the lm() function.

Continuous predictors

To obtain a linear model with a main effects for wgt, we formulate \(age\sim wgt\)

library(mice)
fit <- with(boys, lm(age ~ wgt))
fit
## 
## Call:
## lm(formula = age ~ wgt)
## 
## Coefficients:
## (Intercept)          wgt  
##     -0.1924       0.2517

Continuous predictors: more detail

summary(fit)
## 
## Call:
## lm(formula = age ~ wgt)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.0907  -1.1686  -0.5342   1.4286   5.5143 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.192377   0.136878  -1.405     0.16    
## wgt          0.251673   0.003018  83.395   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.142 on 742 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.9036, Adjusted R-squared:  0.9035 
## F-statistic:  6955 on 1 and 742 DF,  p-value: < 2.2e-16

Continuous predictors

To obtain a linear model with just main effects for wgt and hgt, we formulate \(age\sim wgt + hgt\)

require(mice)
fit <- with(boys, lm(age ~ wgt + hgt))
fit
## 
## Call:
## lm(formula = age ~ wgt + hgt)
## 
## Coefficients:
## (Intercept)          wgt          hgt  
##    -7.45952      0.07138      0.10660

Continuous predictors: more detail

summary(fit)
## 
## Call:
## lm(formula = age ~ wgt + hgt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1627 -0.8836 -0.1656  0.8765  4.4365 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.459525   0.246782  -30.23   <2e-16 ***
## wgt          0.071376   0.005949   12.00   <2e-16 ***
## hgt          0.106602   0.003336   31.96   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.391 on 724 degrees of freedom
##   (21 observations deleted due to missingness)
## Multiple R-squared:  0.9592, Adjusted R-squared:  0.9591 
## F-statistic:  8510 on 2 and 724 DF,  p-value: < 2.2e-16

Continuous predictors: interaction effects

To predict \(age\) from \(wgt\), \(hgt\) and the interaction \(wgt*hgt\) we formulate \(age\sim wgt * hgt\)

fit <- with(boys, lm(age ~ wgt * hgt))
fit
## 
## Call:
## lm(formula = age ~ wgt * hgt)
## 
## Coefficients:
## (Intercept)          wgt          hgt      wgt:hgt  
##  -7.2833480   -0.0196161    0.1120325    0.0004142

and with more detail

summary(fit)
## 
## Call:
## lm(formula = age ~ wgt * hgt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2067 -0.8603 -0.1419  0.8452  4.7526 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.2833480  0.2510218 -29.015  < 2e-16 ***
## wgt         -0.0196161  0.0284921  -0.688  0.49137    
## hgt          0.1120325  0.0037076  30.217  < 2e-16 ***
## wgt:hgt      0.0004142  0.0001269   3.265  0.00115 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.382 on 723 degrees of freedom
##   (21 observations deleted due to missingness)
## Multiple R-squared:  0.9598, Adjusted R-squared:  0.9596 
## F-statistic:  5752 on 3 and 723 DF,  p-value: < 2.2e-16

Categorical predictors in the linear model

If a categorical variable is entered into function lm(), it is automatically converted to a dummy set in R. The first level is always taken as the reference category. If we want another reference category we can use the function relevel() to change this.

fit <- with(boys, lm(age ~ reg))

and again with more detail

summary(fit)
## 
## Call:
## lm(formula = age ~ reg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.805  -7.376   1.339   5.955  11.935 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.8984     0.7595  15.666  < 2e-16 ***
## regeast      -2.6568     0.9312  -2.853 0.004449 ** 
## regwest      -2.9008     0.8788  -3.301 0.001011 ** 
## regsouth     -3.3074     0.9064  -3.649 0.000282 ***
## regcity      -3.6266     1.1031  -3.288 0.001058 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.836 on 740 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.02077,    Adjusted R-squared:  0.01548 
## F-statistic: 3.925 on 4 and 740 DF,  p-value: 0.003678

Components of the linear model

names(fit) # the names of the list with output objects
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "na.action"     "contrasts"     "xlevels"       "call"         
## [13] "terms"         "model"
fit$coef  # show the estimated coefficients
## (Intercept)     regeast     regwest    regsouth     regcity 
##   11.898420   -2.656786   -2.900792   -3.307388   -3.626625
coef(fit) # alternative
## (Intercept)     regeast     regwest    regsouth     regcity 
##   11.898420   -2.656786   -2.900792   -3.307388   -3.626625

Distributions

Distributions

Distributions and models for the random component

  • Normal distributions

    • Linear model
    • error distribution is normal
  • Discrete distributions (counts)

    • Bernoulli, Binomial or Poisson model
    • error distribution not normal
  • Other continuous distributions

    • Uniform, exponential
    • error distribution not normal

Discrete distributions

Binomial distribution

Probability of \(y\) successes in \(n\) trials with probability of success \(\pi\)

\[ P(y|n,\pi)={n\choose y}\pi^y(1-\pi)^{(n-y)} \]

For \(n=5\) trials:

Code that generates plot of previous slide

par(mfrow=c(1,2), bg = NA)
y <- factor(0:5)
x <- 0:5
barplot(dbinom(x,5,.1),ylim=c(0,.6),names.arg=y,ylab="P(Y=y)",xlab="y")
text(5,.5,expression(paste(pi," = 0.1")))
barplot(dbinom(x,5,.5),ylim=c(0,.6),names.arg=y,ylab="P(Y=y)",xlab="y")
text(5,.5,expression(paste(pi," = 0.5")))

Poisson distribution

The probability of \(y\) events within a time period (e.g. number of earthquakes in a year) \[ P(y|\lambda)=\frac{e^{-\lambda}\lambda^y}{y!}, \text{where } lambda=\text{rate parameter} \]

Code that generates plot of previous slide

par(mfrow = c(1, 2), bg = NA)
y <- factor(0:5)
x <- 0:5
barplot(dpois(x, .5), ylim = c(0, .6), names.arg = y, 
        ylab = "P(Y=y)", xlab = "y")
text(5, .5, expression(paste(lambda, " = 0.5")))
barplot(dpois(x, 2), ylim = c(0, .6), names.arg = y, 
        ylab = "P(Y = y)",xlab="y")
text(5,.5,expression(paste(lambda," = 2")))

Continuous distributions

Normal distribution: pdf en cdf

The probability density function (pdf):

par(mfrow=c(1,2), bg=NA)
curve(dnorm(x,0,1),xlim=c(-3,3),ylab="f(x)",main='') 
curve(pnorm(x,0,1),xlim=c(-3,3),ylab="F(X < x)",main='')

Exponential distribution

  • Waiting times (e.g. time until occurrence eartquake) \[ f(t|\lambda)=\lambda{e}^{-\lambda{t}} \]

Some more modeling

Performing a t-test

t.test(Hwt ~ Sex, data = cats)
## 
##  Welch Two Sample t-test
## 
## data:  Hwt by Sex
## t = -6.5179, df = 140.61, p-value = 1.186e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.763753 -1.477352
## sample estimates:
## mean in group F mean in group M 
##        9.202128       11.322680

Performing a chi-square test

head(bacteria, n=8)
##   y ap hilo week  ID     trt
## 1 y  p   hi    0 X01 placebo
## 2 y  p   hi    2 X01 placebo
## 3 y  p   hi    4 X01 placebo
## 4 y  p   hi   11 X01 placebo
## 5 y  a   hi    0 X02   drug+
## 6 y  a   hi    2 X02   drug+
## 7 n  a   hi    6 X02   drug+
## 8 y  a   hi   11 X02   drug+

Column y:

  • n: Disease absent.
  • y: Disease present.

Column ap:

  • a: Active medicine.
  • p: Placebo.

Performing a chi-square test

table(bacteria$ap, bacteria$y)
##    
##      n  y
##   a 31 93
##   p 12 84
x2.table <- table(bacteria$ap, bacteria$y)
x2.table <- cbind(x2.table, rowSums(x2.table))
rbind(x2.table, colSums(x2.table))
##    n   y    
## a 31  93 124
## p 12  84  96
##   43 177 220

chi-square tests

\[\chi^2 = \sum \frac{(O - E)^2}{E}\]

  • Observed:
##    n   y    
## a 31  93 124
## p 12  84  96
##   43 177 220
  • Expected:
##            bacteria$y
## bacteria$ap        n        y
##           a 24.23636 99.76364
##           p 18.76364 77.23636

Performing a chi-square test

prop.table(table(bacteria$ap, bacteria$y), margin = 1)
##    
##         n     y
##   a 0.250 0.750
##   p 0.125 0.875
  • In the treatment condition, 1/4 is free of disease.

  • In the placebo condition 1/8 is free of the disease.

Should we assume this means the medicine works?

Performing a chi-square test

chisq.test(bacteria$y, bacteria$ap)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  bacteria$y and bacteria$ap
## X-squared = 4.6109, df = 1, p-value = 0.03177

or the other way around

chisq.test(bacteria$ap, bacteria$y)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  bacteria$ap and bacteria$y
## X-squared = 4.6109, df = 1, p-value = 0.03177

Or, equivalently:

chisq.test(table(bacteria$y, bacteria$ap))
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(bacteria$y, bacteria$ap)
## X-squared = 4.6109, df = 1, p-value = 0.03177

Visualizing this

plot(table(bacteria$y, bacteria$ap))

Visualizing this

barplot(table(bacteria$y, bacteria$ap), 
        beside = TRUE, 
        legend = c("No infection", "Infection"), 
        col = c("blue", "orange"),
        args.legend = list(x=4.5, y=80))

Fisher’s exact test

With small cell sizes (roughly if any is below 5), we should use Fishers exact test.

short.bac <- bacteria[bacteria$week <= 2, ]
short.tab <- table(short.bac$y, short.bac$ap)
short.tab 
##    
##      a  p
##   n  6  3
##   y 47 38

Fisher’s exact test

chisq.test(short.tab)
## Warning in chisq.test(short.tab): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  short.tab
## X-squared = 0.090475, df = 1, p-value = 0.7636

Fisher’s exact test

fisher.test(table(short.bac$y, short.bac$ap))
## 
##  Fisher's Exact Test for Count Data
## 
## data:  table(short.bac$y, short.bac$ap)
## p-value = 0.7268
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##   0.3182193 10.5996850
## sample estimates:
## odds ratio 
##   1.609049

What do we conclude?

Visualizing this

plot(table(short.bac$y, short.bac$ap))

Visualizing this

barplot(table(short.bac$y, short.bac$ap), 
        beside = TRUE, 
        legend = c("No infection", "Infection"), 
        col = c("blue", "orange"),
        args.legend = list(x=4.5, y=40))