We use the following packages:
library(mice)
library(dplyr)
library(magrittr)
library(DAAG)
The following table shows numbers of occasions when inhibition (i.e., no flow of current across a membrane) occurred within 120 s, for different concentrations of the protein peptide-C. The outcome yes
implies that inhibition has occurred.
conc 0.1 0.5 1 10 20 30 50 70 80 100 150
no 7 1 10 9 2 9 13 1 1 4 3
yes 0 0 3 4 0 6 7 0 0 1 7
R
data <- data.frame(conc = c(.1, .5, 1, 10, 20, 30, 50, 70, 80, 100, 150),
no = c(7, 1, 10, 9, 2, 9, 13, 1, 1, 4, 3),
yes = c(0, 0, 3, 4, 0, 6, 7, 0, 0, 1 ,7))
data
## conc no yes
## 1 0.1 7 0
## 2 0.5 1 0
## 3 1.0 10 3
## 4 10.0 9 4
## 5 20.0 2 0
## 6 30.0 9 6
## 7 50.0 13 7
## 8 70.0 1 0
## 9 80.0 1 0
## 10 100.0 4 1
## 11 150.0 3 7
no
and yes
over the rows (i.e. no
+ yes
)yes
over margin
)First, we create a function to calculate the logit (logodds):
logit <- function(p) log(p / (1 - p))
Then we add the new columns
data <-
data %>%
mutate(margin = no+yes,
prop = yes / margin,
logit = logit(prop))
data
## conc no yes margin prop logit
## 1 0.1 7 0 7 0.0000000 -Inf
## 2 0.5 1 0 1 0.0000000 -Inf
## 3 1.0 10 3 13 0.2307692 -1.2039728
## 4 10.0 9 4 13 0.3076923 -0.8109302
## 5 20.0 2 0 2 0.0000000 -Inf
## 6 30.0 9 6 15 0.4000000 -0.4054651
## 7 50.0 13 7 20 0.3500000 -0.6190392
## 8 70.0 1 0 1 0.0000000 -Inf
## 9 80.0 1 0 1 0.0000000 -Inf
## 10 100.0 4 1 5 0.2000000 -1.3862944
## 11 150.0 3 7 10 0.7000000 0.8472979
There are a lot of zero proportions, hence the \(-\infty\) in the logit. You can fix this (at least the interpretation of the logodds
) by adding a constant (usually 0.5) to all cells conform the empirical logodds
(see e.g. Cox and Snell 1989).
logitCandS <- function(yes, no) log((yes + .5) / (no + .5))
data <-
data %>%
mutate(logitCS = logitCandS(yes, no))
data
## conc no yes margin prop logit logitCS
## 1 0.1 7 0 7 0.0000000 -Inf -2.7080502
## 2 0.5 1 0 1 0.0000000 -Inf -1.0986123
## 3 1.0 10 3 13 0.2307692 -1.2039728 -1.0986123
## 4 10.0 9 4 13 0.3076923 -0.8109302 -0.7472144
## 5 20.0 2 0 2 0.0000000 -Inf -1.6094379
## 6 30.0 9 6 15 0.4000000 -0.4054651 -0.3794896
## 7 50.0 13 7 20 0.3500000 -0.6190392 -0.5877867
## 8 70.0 1 0 1 0.0000000 -Inf -1.0986123
## 9 80.0 1 0 1 0.0000000 -Inf -1.0986123
## 10 100.0 4 1 5 0.2000000 -1.3862944 -1.0986123
## 11 150.0 3 7 10 0.7000000 0.8472979 0.7621401
We can now see that the \(-\infty\) proportions are mostly gone
margin
as the weightsfit <-
data %$%
glm(prop ~ conc, family=binomial, weights=margin)
summary(fit)
##
## Call:
## glm(formula = prop ~ conc, family = binomial, weights = margin)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8159 -1.0552 -0.6878 0.3667 1.0315
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.32701 0.33837 -3.922 8.79e-05 ***
## conc 0.01215 0.00496 2.450 0.0143 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 16.683 on 10 degrees of freedom
## Residual deviance: 10.389 on 9 degrees of freedom
## AIC: 30.988
##
## Number of Fisher Scoring iterations: 4
fit
plot(fit, which = c(1, 5))
The data set is small, but case 11
stands out in the Residuals vs. Leverage
plot
conc
is somewhat skewed. Run the model again with a log-transformation for conc
. To apply this in the model directly:
fit.log <-
data %$%
glm(prop ~ I(log(conc)), family=binomial, weights=margin)
summary(fit.log)
##
## Call:
## glm(formula = prop ~ I(log(conc)), family = binomial, weights = margin)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2510 -1.0599 -0.5029 0.3152 1.3513
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.7659 0.5209 -3.390 0.000699 ***
## I(log(conc)) 0.3437 0.1440 2.387 0.016975 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 16.6834 on 10 degrees of freedom
## Residual deviance: 9.3947 on 9 degrees of freedom
## AIC: 29.994
##
## Number of Fisher Scoring iterations: 4
The logodds now depict the unit increase in log(conc)
, instead of conc
.
log(conc)
.plot(fit.log, which = c(1, 5))
Outliers are now less of an issue. This exercise demonstrates that data transformations may easily render our method more valid, but in exchange it makes our model interpretation more difficult. Parameters now have to be assessed in the log(conc)
parameter space.
brandsma
data from package mice
to fit a logistic regression model for sex
based on lpo
(Language Post Outcome).brandsma.subset <-
brandsma %>%
subset(!is.na(sex) & !is.na(lpo), select = c(sex, lpo))
fit <-
brandsma.subset %$%
glm(sex ~ lpo, family=binomial(link='logit'))
fit %>%
summary()
##
## Call:
## glm(formula = sex ~ lpo, family = binomial(link = "logit"))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.371 -1.152 -0.909 1.165 1.626
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.279636 0.156570 -8.173 3.01e-16 ***
## lpo 0.029727 0.003694 8.047 8.50e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5395.9 on 3893 degrees of freedom
## Residual deviance: 5329.5 on 3892 degrees of freedom
## AIC: 5333.5
##
## Number of Fisher Scoring iterations: 4
With every unit increase in lpo
, the logodds of gender increases by 0.0297266.
confint(fit)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -1.5879389 -0.97404439
## lpo 0.0225129 0.03699717
sex
variable and compare your predictions to the observed sex
. We can obtain predictions by using function predict
. The default predictions are on the scale of the linear predictors; the alternative “response” is on the scale of the response variable. Thus for a default binomial model the default predictions are of log-odds (probabilities on logit scale) and type = “response” gives the predicted probabilities.To obtain the predicted logodds:
pred.logodds <-
fit %>%
predict()
head(pred.logodds)
## 1 2 3 4 5 6
## 0.20669266 0.08778637 0.05805980 -0.29865908 0.08778637 -0.68510453
and the predicted probabilities
pred.prob <-
fit %>%
predict(type = "response")
head(pred.prob)
## 1 2 3 4 5 6
## 0.5514900 0.5219325 0.5145109 0.4258853 0.5219325 0.3351230
We can then use the decision boundary pred.prob > .5
to assign cases to sex == 1
and the others to sex == 0
.
pred <- numeric(length(pred.prob))
pred[pred.prob > .5] <- 1
pred <- unlist(pred)
To determine how many correct predictions we have, we can use
obs <-
brandsma %>%
subset(!is.na(sex) & !is.na(lpo), select = sex)
sum(pred == obs) #total correct
## [1] 2143
mean(pred == obs) #average correct
## [1] 0.5503338
So we succesfully predict about half. That is not so good, because, based on chance alone, we would expect to successfully predict about half:
set.seed(123)
table(obs) #Observed distribution
## obs
## 0 1
## 1995 1899
prop.obs <- #Observed proportions
obs %>%
table() %>%
prop.table()
prop.obs
## .
## 0 1
## 0.5123267 0.4876733
rand <- sample(c(0, 1), 3894, prob = c(.5123267, .4876733), replace = TRUE)
#random <- sample(0:1, nrow(obs), prob = prop.obs, replace = TRUE)
sum(rand == obs) #total correct
## [1] 1965
mean(rand == obs) #average correct
## [1] 0.5046225
An alternative way to test our model efficiency is with CVbinary()
:
CVbinary(glm(sex ~ lpo, family=binomial(link='logit'), data = brandsma.subset))
##
## Fold: 4 6 2 10 3 7 5 9 8 1
## Internal estimate of accuracy = 0.55
## Cross-validation estimate of accuracy = 0.549
The lesson here is that a significant paramater has no meaning if the substantive interpretation of the effect is ignored. There is almost no relation, whatsoever, there is just sufficient data to deem the influence of lpo
on sex
worthy of significance.
minor.head.injury
(from package DAAG
), obtain a logistic regression model relating clinically.important.brain.injury
to all the other variables.Let us fit the model, predict clinically.important.brain.injury
by all other variables in the data.
fit <- glm(clinically.important.brain.injury ~ ., family=binomial, data=head.injury)
summary(fit)
##
## Call:
## glm(formula = clinically.important.brain.injury ~ ., family = binomial,
## data = head.injury)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2774 -0.3511 -0.2095 -0.1489 3.0028
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.4972 0.1629 -27.611 < 2e-16 ***
## age.65 1.3734 0.1827 7.518 5.56e-14 ***
## amnesia.before 0.6893 0.1725 3.996 6.45e-05 ***
## basal.skull.fracture 1.9620 0.2064 9.504 < 2e-16 ***
## GCS.decrease -0.2688 0.3680 -0.730 0.465152
## GCS.13 1.0613 0.2820 3.764 0.000168 ***
## GCS.15.2hours 1.9408 0.1663 11.669 < 2e-16 ***
## high.risk 1.1115 0.1591 6.984 2.86e-12 ***
## loss.of.consciousness 0.9554 0.1959 4.877 1.08e-06 ***
## open.skull.fracture 0.6304 0.3151 2.001 0.045424 *
## vomiting 1.2334 0.1961 6.290 3.17e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1741.6 on 3120 degrees of freedom
## Residual deviance: 1201.3 on 3110 degrees of freedom
## AIC: 1223.3
##
## Number of Fisher Scoring iterations: 6
A risk of 2.5% corresponds to the cutoff for a CT scan. This translates to a logit of \(\log\left(\frac{.025}{1-.025}\right) = -3.663562\). In other words, any sum of variables that “lifts” the intercept above -3.66 would satisfy the cutoff.
End of Practical