Statistical Programming in R

Generalized linear models

GLM’s

We knew \[y=\alpha+\beta x+\epsilon\]

Now we consider \[\mathbb{E}[y] = \alpha + \beta x\]

They’re the same. Different notation, different framework.

The upside is that we can now use a function for the expectation \(\mathbb{E}\) to allow for transformations. This would enable us to change \(\mathbb{E}[y]\) such that \(f(\mathbb{E}[y])\) has a linear relation with \(x\).

The function \(f()\) we call the link function. This function tranforms the scale of the response/outcome to the scale of the linear predictor.

Examples: \(f(x) = x, \\ f(x) = 1/x, \\ f(x) = \log(x), \\ f(x) = \log(x/(1-x)).\)

Link functions

GLM’s continued

Remember that \[y=\alpha+\beta x+\epsilon,\]

and that \[\mathbb{E}[y] = \alpha + \beta x.\]

As a result \[y = \mathbb{E}[y] + \epsilon.\]

and residuals do not need to be normal (heck, \(y\) probably isn’t, so why should \(\epsilon\) be?)

Logit

\[\text{logit}(p) = \log(\frac{p}{1-p}) = \log(p) - \log(1-p)\]

Logit continued

Logit models work on the \(\log(\text{odds})\) scale

\[\log(\text{odds}) = \log(\frac{p}{1-p}) = \log(p) - \log(1-p) = \text{logit}(p)\]

The logit of the probability is the log of the odds.

Logistic regression allows us to model the \(\log(\text{odds})\) as a function of other, linear predictors.

\(\log(\text{odds})\) explained

logodds <- rep(NA, 1001)
for(i in 0:1000){
  p <- i / 1000 
  logodds[i + 1] <- log(p / (1 - p))
}
head(logodds)
## [1]      -Inf -6.906755 -6.212606 -5.806138 -5.517453 -5.293305
tail(logodds)
## [1] 5.293305 5.517453 5.806138 6.212606 6.906755      Inf

\(\log(\text{odds})\) explained

plot(0:1000, logodds, xlab = "x", ylab = "log(odds)", type = "l")

logit\(^{-1}\) explained

invlogit <- exp(logodds) / (1 + exp(logodds))
head(invlogit)
## [1] 0.000 0.001 0.002 0.003 0.004 0.005
tail(invlogit)
## [1] 0.995 0.996 0.997 0.998 0.999   NaN

logit\(^{-1}\) explained

plot(0:1000, invlogit, xlab = "x", ylab = "log(odds)", type = "l")

Logistic regression

Logistic regression

With linear regression we had the Sum of Squares (SS). Its logistic counterpart is the Deviance (D).

  • Deviance is the fit of the observed values to the expected values.

With logistic regression we aim to maximize the likelihood, which is equivalent to minimizing the deviance.

The likelihood is the (joint) probability of the observed values, given the current model parameters.

In normally distributed data: \(\text{SS}=\text{D}\).

Logistic regression

require(DAAG) 
anesthetic
##    move conc    logconc nomove
## 1     0  1.0  0.0000000      1
## 2     1  1.2  0.1823216      0
## 3     0  1.4  0.3364722      1
## 4     1  1.4  0.3364722      0
## 5     1  1.2  0.1823216      0
## 6     0  2.5  0.9162907      1
## 7     0  1.6  0.4700036      1
## 8     1  0.8 -0.2231436      0
## 9     0  1.6  0.4700036      1
## 10    1  1.4  0.3364722      0
## 11    1  0.8 -0.2231436      0
## 12    0  1.6  0.4700036      1
## 13    0  2.5  0.9162907      1
## 14    0  1.4  0.3364722      1
## 15    0  1.6  0.4700036      1
## 16    0  1.4  0.3364722      1
## 17    0  1.4  0.3364722      1
## 18    1  0.8 -0.2231436      0
## 19    0  0.8 -0.2231436      1
## 20    0  1.2  0.1823216      1
## 21    1  0.8 -0.2231436      0
## 22    1  0.8 -0.2231436      0
## 23    1  1.0  0.0000000      0
## 24    1  0.8 -0.2231436      0
## 25    1  1.0  0.0000000      0
## 26    0  1.2  0.1823216      1
## 27    1  1.0  0.0000000      0
## 28    0  1.2  0.1823216      1
## 29    1  1.0  0.0000000      0
## 30    0  1.2  0.1823216      1

Logistic regression

fit <- glm(nomove ~ conc, family = binomial(link="logit"), data = anesthetic)
summary(fit)
## 
## Call:
## glm(formula = nomove ~ conc, family = binomial(link = "logit"), 
##     data = anesthetic)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.76666  -0.74407   0.03413   0.68666   2.06900  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   -6.469      2.418  -2.675  0.00748 **
## conc           5.567      2.044   2.724  0.00645 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 41.455  on 29  degrees of freedom
## Residual deviance: 27.754  on 28  degrees of freedom
## AIC: 31.754
## 
## Number of Fisher Scoring iterations: 5

Logistic regression

anestot <- aggregate(anesthetic[, c("move","nomove")],  
                     by = list(conc = anesthetic$conc), FUN = sum) 
anestot$conc <- as.numeric(as.character(anestot$conc)) 
anestot$total <- apply(anestot[, c("move","nomove")], 1 , sum) 
anestot$prop <- anestot$nomove / anestot$total 
anestot
##   conc move nomove total      prop
## 1  0.8    6      1     7 0.1428571
## 2  1.0    4      1     5 0.2000000
## 3  1.2    2      4     6 0.6666667
## 4  1.4    2      4     6 0.6666667
## 5  1.6    0      4     4 1.0000000
## 6  2.5    0      2     2 1.0000000

Logistic regression

fit <- glm(prop ~ conc, family = binomial(link="logit"), weights = total, data = anestot)
summary(fit)
## 
## Call:
## glm(formula = prop ~ conc, family = binomial(link = "logit"), 
##     data = anestot, weights = total)
## 
## Deviance Residuals: 
##        1         2         3         4         5         6  
##  0.20147  -0.45367   0.56890  -0.70000   0.81838   0.04826  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   -6.469      2.419  -2.675  0.00748 **
## conc           5.567      2.044   2.724  0.00645 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 15.4334  on 5  degrees of freedom
## Residual deviance:  1.7321  on 4  degrees of freedom
## AIC: 13.811
## 
## Number of Fisher Scoring iterations: 5

Logistic multiple regression

Always try to make the relation as linear as possible

  • after all you are assessing a linear model.

Do not forget that you use transformations to “make” things more linear

Logistic multiple regression

Logistic multiple regression

Logistic multiple regression

Logistic multiple regression

## 
## Call:
## glm(formula = pres.abs ~ log(distance) + log(NoOfPools) + NoOfSites + 
##     avrain + I(meanmax + meanmin) + I(meanmax - meanmin), family = binomial, 
##     data = frogs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9763  -0.7189  -0.2786   0.7970   2.5745  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          18.2688999 16.1381912   1.132  0.25762    
## log(distance)        -0.7583198  0.2558117  -2.964  0.00303 ** 
## log(NoOfPools)        0.5708953  0.2153335   2.651  0.00802 ** 
## NoOfSites            -0.0036201  0.1061469  -0.034  0.97279    
## avrain                0.0007003  0.0411710   0.017  0.98643    
## I(meanmax + meanmin)  1.4958055  0.3153152   4.744  2.1e-06 ***
## I(meanmax - meanmin) -3.8582668  1.2783921  -3.018  0.00254 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 279.99  on 211  degrees of freedom
## Residual deviance: 197.65  on 205  degrees of freedom
## AIC: 211.65
## 
## Number of Fisher Scoring iterations: 5

Logistic multiple regression

## 
## Call:
## glm(formula = pres.abs ~ log(distance) + log(NoOfPools) + I(meanmax + 
##     meanmin) + I(meanmax - meanmin), family = binomial, data = frogs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9753  -0.7224  -0.2780   0.7974   2.5736  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           18.5268     5.2673   3.517 0.000436 ***
## log(distance)         -0.7547     0.2261  -3.338 0.000844 ***
## log(NoOfPools)         0.5707     0.2152   2.652 0.007999 ** 
## I(meanmax + meanmin)   1.4985     0.3088   4.853 1.22e-06 ***
## I(meanmax - meanmin)  -3.8806     0.9002  -4.311 1.63e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 279.99  on 211  degrees of freedom
## Residual deviance: 197.66  on 207  degrees of freedom
## AIC: 207.66
## 
## Number of Fisher Scoring iterations: 5

Fitted values

head(fitted(frogs.glm))
##         2         3         4         5         6         7 
## 0.9416691 0.9259228 0.9029415 0.8119619 0.9314070 0.7278038
head(predict(frogs.glm, type = "response"))
##         2         3         4         5         6         7 
## 0.9416691 0.9259228 0.9029415 0.8119619 0.9314070 0.7278038
head(predict(frogs.glm, type = "link"))  # Scale of linear predictor 
##         2         3         4         5         6         7 
## 2.7815212 2.5256832 2.2303441 1.4628085 2.6085055 0.9835086

Fitted values with approximate SE’s

pred <- predict(frogs.glm, type = "link", se.fit = TRUE)
head(data.frame(fit = pred$fit, se = pred$se.fit))
##         fit        se
## 2 2.7815212 0.6859612
## 3 2.5256832 0.4851882
## 4 2.2303441 0.4381720
## 5 1.4628085 0.4805175
## 6 2.6085055 0.5290599
## 7 0.9835086 0.3900549

Confidence intervals for the \(\beta\)

confint(frogs.glm)
## Waiting for profiling to be done...
##                           2.5 %     97.5 %
## (Intercept)           8.5807639 29.3392400
## log(distance)        -1.2157643 -0.3247928
## log(NoOfPools)        0.1628001  1.0106909
## I(meanmax + meanmin)  0.9219804  2.1385678
## I(meanmax - meanmin) -5.7299615 -2.1868793
coef(frogs.glm)
##          (Intercept)        log(distance)       log(NoOfPools) 
##           18.5268418           -0.7546587            0.5707174 
## I(meanmax + meanmin) I(meanmax - meanmin) 
##            1.4984905           -3.8805856

Cross validating predictive power

frogs.glm <- glm(pres.abs ~ log(distance) + log(NoOfPools), 
                 family = binomial, data = frogs)
cv <- CVbinary(frogs.glm)
## 
## Fold:  6 1 3 2 7 5 4 8 10 9
## Internal estimate of accuracy = 0.759
## Cross-validation estimate of accuracy = 0.75

The cross-validation measure is the proportion of predictions over the folds that are correct.