Statistical Programming in R
We knew \[y=\alpha+\beta x+\epsilon\]
Now we consider \[\mathbb{E}[y] = \alpha + \beta x\]
They’re the same. Different notation, different framework.
The upside is that we can now use a function for the expectation \(\mathbb{E}\) to allow for transformations. This would enable us to change \(\mathbb{E}[y]\) such that \(f(\mathbb{E}[y])\) has a linear relation with \(x\).
The function \(f()\) we call the link function
. This function tranforms the scale of the response/outcome to the scale of the linear predictor.
Examples: \(f(x) = x, \\ f(x) = 1/x, \\ f(x) = \log(x), \\ f(x) = \log(x/(1-x)).\)
Remember that \[y=\alpha+\beta x+\epsilon,\]
and that \[\mathbb{E}[y] = \alpha + \beta x.\]
As a result \[y = \mathbb{E}[y] + \epsilon.\]
and residuals do not need to be normal (heck, \(y\) probably isn’t, so why should \(\epsilon\) be?)
\[\text{logit}(p) = \log(\frac{p}{1-p}) = \log(p) - \log(1-p)\]
Logit models work on the \(\log(\text{odds})\) scale
\[\log(\text{odds}) = \log(\frac{p}{1-p}) = \log(p) - \log(1-p) = \text{logit}(p)\]
The logit of the probability is the log of the odds.
Logistic regression allows us to model the \(\log(\text{odds})\) as a function of other, linear predictors.
logodds <- rep(NA, 1001) for(i in 0:1000){ p <- i / 1000 logodds[i + 1] <- log(p / (1 - p)) } head(logodds)
## [1] -Inf -6.906755 -6.212606 -5.806138 -5.517453 -5.293305
tail(logodds)
## [1] 5.293305 5.517453 5.806138 6.212606 6.906755 Inf
plot(0:1000, logodds, xlab = "x", ylab = "log(odds)", type = "l")
invlogit <- exp(logodds) / (1 + exp(logodds)) head(invlogit)
## [1] 0.000 0.001 0.002 0.003 0.004 0.005
tail(invlogit)
## [1] 0.995 0.996 0.997 0.998 0.999 NaN
plot(0:1000, invlogit, xlab = "x", ylab = "log(odds)", type = "l")
With linear regression we had the Sum of Squares (SS)
. Its logistic counterpart is the Deviance (D)
.
With logistic regression we aim to maximize the likelihood
, which is equivalent to minimizing the deviance.
The likelihood is the (joint) probability of the observed values, given the current model parameters.
In normally distributed data: \(\text{SS}=\text{D}\).
require(DAAG) anesthetic
## move conc logconc nomove ## 1 0 1.0 0.0000000 1 ## 2 1 1.2 0.1823216 0 ## 3 0 1.4 0.3364722 1 ## 4 1 1.4 0.3364722 0 ## 5 1 1.2 0.1823216 0 ## 6 0 2.5 0.9162907 1 ## 7 0 1.6 0.4700036 1 ## 8 1 0.8 -0.2231436 0 ## 9 0 1.6 0.4700036 1 ## 10 1 1.4 0.3364722 0 ## 11 1 0.8 -0.2231436 0 ## 12 0 1.6 0.4700036 1 ## 13 0 2.5 0.9162907 1 ## 14 0 1.4 0.3364722 1 ## 15 0 1.6 0.4700036 1 ## 16 0 1.4 0.3364722 1 ## 17 0 1.4 0.3364722 1 ## 18 1 0.8 -0.2231436 0 ## 19 0 0.8 -0.2231436 1 ## 20 0 1.2 0.1823216 1 ## 21 1 0.8 -0.2231436 0 ## 22 1 0.8 -0.2231436 0 ## 23 1 1.0 0.0000000 0 ## 24 1 0.8 -0.2231436 0 ## 25 1 1.0 0.0000000 0 ## 26 0 1.2 0.1823216 1 ## 27 1 1.0 0.0000000 0 ## 28 0 1.2 0.1823216 1 ## 29 1 1.0 0.0000000 0 ## 30 0 1.2 0.1823216 1
fit <- glm(nomove ~ conc, family = binomial(link="logit"), data = anesthetic) summary(fit)
## ## Call: ## glm(formula = nomove ~ conc, family = binomial(link = "logit"), ## data = anesthetic) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.76666 -0.74407 0.03413 0.68666 2.06900 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -6.469 2.418 -2.675 0.00748 ** ## conc 5.567 2.044 2.724 0.00645 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 41.455 on 29 degrees of freedom ## Residual deviance: 27.754 on 28 degrees of freedom ## AIC: 31.754 ## ## Number of Fisher Scoring iterations: 5
anestot <- aggregate(anesthetic[, c("move","nomove")], by = list(conc = anesthetic$conc), FUN = sum) anestot$conc <- as.numeric(as.character(anestot$conc)) anestot$total <- apply(anestot[, c("move","nomove")], 1 , sum) anestot$prop <- anestot$nomove / anestot$total anestot
## conc move nomove total prop ## 1 0.8 6 1 7 0.1428571 ## 2 1.0 4 1 5 0.2000000 ## 3 1.2 2 4 6 0.6666667 ## 4 1.4 2 4 6 0.6666667 ## 5 1.6 0 4 4 1.0000000 ## 6 2.5 0 2 2 1.0000000
fit <- glm(prop ~ conc, family = binomial(link="logit"), weights = total, data = anestot) summary(fit)
## ## Call: ## glm(formula = prop ~ conc, family = binomial(link = "logit"), ## data = anestot, weights = total) ## ## Deviance Residuals: ## 1 2 3 4 5 6 ## 0.20147 -0.45367 0.56890 -0.70000 0.81838 0.04826 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -6.469 2.419 -2.675 0.00748 ** ## conc 5.567 2.044 2.724 0.00645 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 15.4334 on 5 degrees of freedom ## Residual deviance: 1.7321 on 4 degrees of freedom ## AIC: 13.811 ## ## Number of Fisher Scoring iterations: 5
Always try to make the relation as linear as possible
Do not forget that you use transformations to “make” things more linear
## ## Call: ## glm(formula = pres.abs ~ log(distance) + log(NoOfPools) + NoOfSites + ## avrain + I(meanmax + meanmin) + I(meanmax - meanmin), family = binomial, ## data = frogs) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.9763 -0.7189 -0.2786 0.7970 2.5745 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 18.2688999 16.1381912 1.132 0.25762 ## log(distance) -0.7583198 0.2558117 -2.964 0.00303 ** ## log(NoOfPools) 0.5708953 0.2153335 2.651 0.00802 ** ## NoOfSites -0.0036201 0.1061469 -0.034 0.97279 ## avrain 0.0007003 0.0411710 0.017 0.98643 ## I(meanmax + meanmin) 1.4958055 0.3153152 4.744 2.1e-06 *** ## I(meanmax - meanmin) -3.8582668 1.2783921 -3.018 0.00254 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 279.99 on 211 degrees of freedom ## Residual deviance: 197.65 on 205 degrees of freedom ## AIC: 211.65 ## ## Number of Fisher Scoring iterations: 5
## ## Call: ## glm(formula = pres.abs ~ log(distance) + log(NoOfPools) + I(meanmax + ## meanmin) + I(meanmax - meanmin), family = binomial, data = frogs) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.9753 -0.7224 -0.2780 0.7974 2.5736 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 18.5268 5.2673 3.517 0.000436 *** ## log(distance) -0.7547 0.2261 -3.338 0.000844 *** ## log(NoOfPools) 0.5707 0.2152 2.652 0.007999 ** ## I(meanmax + meanmin) 1.4985 0.3088 4.853 1.22e-06 *** ## I(meanmax - meanmin) -3.8806 0.9002 -4.311 1.63e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 279.99 on 211 degrees of freedom ## Residual deviance: 197.66 on 207 degrees of freedom ## AIC: 207.66 ## ## Number of Fisher Scoring iterations: 5
head(fitted(frogs.glm))
## 2 3 4 5 6 7 ## 0.9416691 0.9259228 0.9029415 0.8119619 0.9314070 0.7278038
head(predict(frogs.glm, type = "response"))
## 2 3 4 5 6 7 ## 0.9416691 0.9259228 0.9029415 0.8119619 0.9314070 0.7278038
head(predict(frogs.glm, type = "link")) # Scale of linear predictor
## 2 3 4 5 6 7 ## 2.7815212 2.5256832 2.2303441 1.4628085 2.6085055 0.9835086
pred <- predict(frogs.glm, type = "link", se.fit = TRUE) head(data.frame(fit = pred$fit, se = pred$se.fit))
## fit se ## 2 2.7815212 0.6859612 ## 3 2.5256832 0.4851882 ## 4 2.2303441 0.4381720 ## 5 1.4628085 0.4805175 ## 6 2.6085055 0.5290599 ## 7 0.9835086 0.3900549
confint(frogs.glm)
## Waiting for profiling to be done...
## 2.5 % 97.5 % ## (Intercept) 8.5807639 29.3392400 ## log(distance) -1.2157643 -0.3247928 ## log(NoOfPools) 0.1628001 1.0106909 ## I(meanmax + meanmin) 0.9219804 2.1385678 ## I(meanmax - meanmin) -5.7299615 -2.1868793
coef(frogs.glm)
## (Intercept) log(distance) log(NoOfPools) ## 18.5268418 -0.7546587 0.5707174 ## I(meanmax + meanmin) I(meanmax - meanmin) ## 1.4984905 -3.8805856
frogs.glm <- glm(pres.abs ~ log(distance) + log(NoOfPools), family = binomial, data = frogs) cv <- CVbinary(frogs.glm)
## ## Fold: 6 1 3 2 7 5 4 8 10 9 ## Internal estimate of accuracy = 0.759 ## Cross-validation estimate of accuracy = 0.75
The cross-validation measure is the proportion of predictions over the folds that are correct.