Practical L

We use the following packages:

library(mice)
library(dplyr)
library(magrittr)
library(DAAG)

The following table shows numbers of occasions when inhibition (i.e., no flow of current across a membrane) occurred within 120 s, for different concentrations of the protein peptide-C. The outcome yes implies that inhibition has occurred.

conc 0.1 0.5  1 10 20 30 50 70 80 100 150 
no     7   1 10  9  2  9 13  1  1   4   3 
yes    0   0  3  4  0  6  7  0  0   1   7

Create this data in R

data <- data.frame(conc = c(.1, .5, 1, 10, 20, 30, 50, 70, 80, 100, 150),
                   no = c(7, 1, 10, 9, 2, 9, 13, 1, 1, 4, 3),
                   yes = c(0, 0, 3, 4, 0, 6, 7, 0, 0, 1 ,7)) 
data

##     conc no yes
## 1    0.1  7   0
## 2    0.5  1   0
## 3    1.0 10   3
## 4   10.0  9   4
## 5   20.0  2   0
## 6   30.0  9   6
## 7   50.0 13   7
## 8   70.0  1   0
## 9   80.0  1   0
## 10 100.0  4   1
## 11 150.0  3   7

Add three new variables (columns) to the data

the margin of no and yes over the rows (i.e. no + yes)
the proportion (yes over margin)
the logodds

First, we create a function to calculate the logit (logodds):

logit <- function(p) log(p / (1 - p))

Then we add the new columns

data <- 
  data %>% 
  mutate(margin = no+yes,
         prop = yes / margin,
         logit = logit(prop))

Inspect the expanded data. What do you see?

data

##     conc no yes margin      prop      logit
## 1    0.1  7   0      7 0.0000000       -Inf
## 2    0.5  1   0      1 0.0000000       -Inf
## 3    1.0 10   3     13 0.2307692 -1.2039728
## 4   10.0  9   4     13 0.3076923 -0.8109302
## 5   20.0  2   0      2 0.0000000       -Inf
## 6   30.0  9   6     15 0.4000000 -0.4054651
## 7   50.0 13   7     20 0.3500000 -0.6190392
## 8   70.0  1   0      1 0.0000000       -Inf
## 9   80.0  1   0      1 0.0000000       -Inf
## 10 100.0  4   1      5 0.2000000 -1.3862944
## 11 150.0  3   7     10 0.7000000  0.8472979

There are a lot of zero proportions, hence the \(-\infty\) in the logit. You can fix this (at least the interpretation of the logodds) by adding a constant (usually 0.5) to all cells conform the empirical logodds (see e.g. Cox and Snell 1989).

Add a new column where the log odds are calculated as: \[\log\text{odds} = \log\left(\frac{\text{yes} + 0.5}{\text{no} + 0.5}\right)\]

logitCandS <- function(yes, no) log((yes + .5) / (no + .5))
data <- 
  data %>% 
  mutate(logitCS = logitCandS(yes, no))
data

##     conc no yes margin      prop      logit    logitCS
## 1    0.1  7   0      7 0.0000000       -Inf -2.7080502
## 2    0.5  1   0      1 0.0000000       -Inf -1.0986123
## 3    1.0 10   3     13 0.2307692 -1.2039728 -1.0986123
## 4   10.0  9   4     13 0.3076923 -0.8109302 -0.7472144
## 5   20.0  2   0      2 0.0000000       -Inf -1.6094379
## 6   30.0  9   6     15 0.4000000 -0.4054651 -0.3794896
## 7   50.0 13   7     20 0.3500000 -0.6190392 -0.5877867
## 8   70.0  1   0      1 0.0000000       -Inf -1.0986123
## 9   80.0  1   0      1 0.0000000       -Inf -1.0986123
## 10 100.0  4   1      5 0.2000000 -1.3862944 -1.0986123
## 11 150.0  3   7     10 0.7000000  0.8472979  0.7621401

We can now see that the \(-\infty\) proportions are mostly gone

Fit the model with margin as the weights

fit <- 
  data %$%
  glm(prop ~ conc, family=binomial, weights=margin)

Look at the summary of the fitted object

summary(fit)

## 
## Call:
## glm(formula = prop ~ conc, family = binomial, weights = margin)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8159  -1.0552  -0.6878   0.3667   1.0315  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.32701    0.33837  -3.922 8.79e-05 ***
## conc         0.01215    0.00496   2.450   0.0143 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 16.683  on 10  degrees of freedom
## Residual deviance: 10.389  on  9  degrees of freedom
## AIC: 30.988
## 
## Number of Fisher Scoring iterations: 4

Inspect the plots number 1 and 5 for fit

plot(fit, which = c(1, 5))

The data set is small, but case 11 stands out in the Residuals vs. Leverage plot

conc is somewhat skewed. Run the model again with a log-transformation for conc.

To apply this in the model directly:

fit.log <- 
  data %$%
  glm(prop ~ I(log(conc)), family=binomial, weights=margin)

Look at the summary of the fitted object again

summary(fit.log)

## 
## Call:
## glm(formula = prop ~ I(log(conc)), family = binomial, weights = margin)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2510  -1.0599  -0.5029   0.3152   1.3513  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -1.7659     0.5209  -3.390 0.000699 ***
## I(log(conc))   0.3437     0.1440   2.387 0.016975 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 16.6834  on 10  degrees of freedom
## Residual deviance:  9.3947  on  9  degrees of freedom
## AIC: 29.994
## 
## Number of Fisher Scoring iterations: 4

The logodds now depict the unit increase in log(conc), instead of conc.

Inspects the plots number 1 and 5 of the fitted object based on log(conc).

plot(fit.log, which = c(1, 5))

Outliers are now less of an issue. This exercise demonstrates that data transformations may easily render our method more valid, but in exchange it makes our model interpretation more difficult. Parameters now have to be assessed in the log(conc) parameter space.

Use the brandsma data from package mice to fit a logistic regression model for sex based on lpo (Language Post Outcome).

brandsma.subset <- 
  brandsma %>%
  subset(!is.na(sex) & !is.na(lpo), select = c(sex, lpo))

fit <- 
  brandsma.subset %$%
  glm(sex ~ lpo, family=binomial(link='logit'))

fit %>%
  summary()

## 
## Call:
## glm(formula = sex ~ lpo, family = binomial(link = "logit"))
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.371  -1.152  -0.909   1.165   1.626  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.279636   0.156570  -8.173 3.01e-16 ***
## lpo          0.029727   0.003694   8.047 8.50e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5395.9  on 3893  degrees of freedom
## Residual deviance: 5329.5  on 3892  degrees of freedom
## AIC: 5333.5
## 
## Number of Fisher Scoring iterations: 4

With every unit increase in lpo, the logodds of gender increases by 0.0297266.

Obtain confidence intervals for the parameter estimates.

confint(fit)

## Waiting for profiling to be done...

##                  2.5 %      97.5 %
## (Intercept) -1.5879389 -0.97404439
## lpo          0.0225129  0.03699717

Use the model parameters to predict the sex variable and compare your predictions to the observed sex. We can obtain predictions by using function predict. The default predictions are on the scale of the linear predictors; the alternative “response” is on the scale of the response variable. Thus for a default binomial model the default predictions are of log-odds (probabilities on logit scale) and type = “response” gives the predicted probabilities.

To obtain the predicted logodds:

pred.logodds <- 
  fit %>%
  predict()
head(pred.logodds)

##           1           2           3           4           5           6 
##  0.20669266  0.08778637  0.05805980 -0.29865908  0.08778637 -0.68510453

and the predicted probabilities

pred.prob <- 
  fit %>%
  predict(type = "response")
head(pred.prob)

##         1         2         3         4         5         6 
## 0.5514900 0.5219325 0.5145109 0.4258853 0.5219325 0.3351230

We can then use the decision boundary pred.prob > .5 to assign cases to sex == 1 and the others to sex == 0.

pred <- numeric(length(pred.prob))
pred[pred.prob > .5] <- 1
pred <- unlist(pred)

To determine how many correct predictions we have, we can use

obs <- 
  brandsma %>%
  subset(!is.na(sex) & !is.na(lpo), select = sex)

sum(pred == obs) #total correct

## [1] 2143

mean(pred == obs) #average correct

## [1] 0.5503338

So we succesfully predict about half. That is not so good, because, based on chance alone, we would expect to successfully predict about half:

set.seed(123)

table(obs) #Observed distribution

## obs
##    0    1 
## 1995 1899

prop.obs <- #Observed proportions
  obs %>%
  table() %>%
  prop.table()

prop.obs

## .
##         0         1 
## 0.5123267 0.4876733

rand <- sample(c(0, 1), 3894, prob = c(.5123267, .4876733), replace = TRUE)
#random <- sample(0:1, nrow(obs), prob = prop.obs, replace = TRUE)

sum(rand == obs) #total correct

## [1] 1965

mean(rand == obs) #average correct

## [1] 0.5046225

An alternative way to test our model efficiency is with CVbinary():

CVbinary(glm(sex ~ lpo, family=binomial(link='logit'), data = brandsma.subset))

## 
## Fold:  4 6 2 10 3 7 5 9 8 1
## Internal estimate of accuracy = 0.55
## Cross-validation estimate of accuracy = 0.549

The lesson here is that a significant paramater has no meaning if the substantive interpretation of the effect is ignored. There is almost no relation, whatsoever, there is just sufficient data to deem the influence of lpo on sex worthy of significance.

In the data set minor.head.injury (from package DAAG), obtain a logistic regression model relating clinically.important.brain.injury to all the other variables.

Let us fit the model, predict clinically.important.brain.injury by all other variables in the data.

fit <- glm(clinically.important.brain.injury ~ ., family=binomial, data=head.injury) 
summary(fit)

## 
## Call:
## glm(formula = clinically.important.brain.injury ~ ., family = binomial, 
##     data = head.injury)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2774  -0.3511  -0.2095  -0.1489   3.0028  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -4.4972     0.1629 -27.611  < 2e-16 ***
## age.65                  1.3734     0.1827   7.518 5.56e-14 ***
## amnesia.before          0.6893     0.1725   3.996 6.45e-05 ***
## basal.skull.fracture    1.9620     0.2064   9.504  < 2e-16 ***
## GCS.decrease           -0.2688     0.3680  -0.730 0.465152    
## GCS.13                  1.0613     0.2820   3.764 0.000168 ***
## GCS.15.2hours           1.9408     0.1663  11.669  < 2e-16 ***
## high.risk               1.1115     0.1591   6.984 2.86e-12 ***
## loss.of.consciousness   0.9554     0.1959   4.877 1.08e-06 ***
## open.skull.fracture     0.6304     0.3151   2.001 0.045424 *  
## vomiting                1.2334     0.1961   6.290 3.17e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1741.6  on 3120  degrees of freedom
## Residual deviance: 1201.3  on 3110  degrees of freedom
## AIC: 1223.3
## 
## Number of Fisher Scoring iterations: 6

Patients whose risk is sufficiently high will be sent for CT (computed tomography). Using a risk threshold of 0.025 (2.5%), turn the result into a decision rule for use of CT.

A risk of 2.5% corresponds to the cutoff for a CT scan. This translates to a logit of \(\log\left(\frac{.025}{1-.025}\right) = -3.663562\). In other words, any sum of variables that “lifts” the intercept above -3.66 would satisfy the cutoff.

End of Practical

Practical L

Gerko Vink

Statistical Programming in R