Practical J&K

Exercises

The following packages are required for this practical:

library(dplyr)
library(magrittr)
library(mice)
library(ggplot2)
library(DAAG)
library(MASS)

The data sets elastic1 and elastic2 from the package DAAG were obtained using the same apparatus, including the same rubber band, as the data frame elasticband.

Using a different symbol and/or a different color, plot the data from the two data frames elastic1 and elastic2 on the same graph. Do the two sets of results appear consistent?

elastic <- rbind(elastic1, elastic2)
elastic$source <- c(rep("Elastic1", nrow(elastic1)), 
                    rep("Elastic2", nrow(elastic2)))

elastic %>%
  ggplot(aes(stretch, distance, colour = source)) +
  geom_point() + 
  geom_smooth(method = "lm")

The results seem very consistent: Data set elastic2 has more observations over a larger range, but both sets result in roughly the same regression line. Data set elastic1 seems to have an odd-one-out value.

For each of the data sets elastic1 and elastic2, determine the regression of distance on stretch. In each case determine:

fitted values and standard errors of fitted values and
the $R^2$ statistic.

Compare the two sets of results. What is the key difference between the two sets of data?

First we run the two models:

fit1 <- 
  elastic1 %$%
  lm(distance ~ stretch)

fit2 <- 
  elastic2 %$%
  lm(distance ~ stretch)

and then we compare the fitted values

fit1 %>% predict(se.fit = TRUE)

## $fit
##        1        2        3        4        5        6        7 
## 183.1429 235.7143 196.2857 209.4286 170.0000 156.8571 222.5714 
## 
## $se.fit
## [1]  6.586938 10.621119  5.891537  6.586938  8.331891 10.621119  8.331891
## 
## $df
## [1] 5
## 
## $residual.scale
## [1] 15.58754

fit2 %>% predict(se.fit = TRUE)

## $fit
##         1         2         3         4         5         6         7 
##  77.58333 196.58333 137.08333 166.83333 256.08333 226.33333 107.33333 
##         8         9 
## 226.33333 285.83333 
## 
## $se.fit
## [1] 6.740293 3.520003 4.358744 3.635444 5.060323 4.064550 5.453165 4.064550
## [9] 6.296773
## 
## $df
## [1] 7
## 
## $residual.scale
## [1] 10.44202

We see that fit1 (based on elastic1) has a larger residual standard deviation (i.e. $residual.scale).

To get the $R^2$ we can run a summary on the fitted models:

fit1 %>% summary()

## 
## Call:
## lm(formula = distance ~ stretch)
## 
## Residuals:
##        1        2        3        4        5        6        7 
##  -0.1429 -18.7143  -7.2857  -1.4286   8.0000  -6.8571  26.4286 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -119.143     70.943  -1.679  0.15391   
## stretch        6.571      1.473   4.462  0.00663 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.59 on 5 degrees of freedom
## Multiple R-squared:  0.7992, Adjusted R-squared:  0.7591 
## F-statistic: 19.91 on 1 and 5 DF,  p-value: 0.006631

fit2 %>% summary()

## 
## Call:
## lm(formula = distance ~ stretch)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.0833  -7.0833  -0.5833   5.1667  20.1667 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -100.9167    15.6102  -6.465 0.000345 ***
## stretch        5.9500     0.3148  18.899 2.89e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.44 on 7 degrees of freedom
## Multiple R-squared:  0.9808, Adjusted R-squared:  0.978 
## F-statistic: 357.2 on 1 and 7 DF,  p-value: 2.888e-07

Or we can grab the $R^2$ directly from the object without a pipe

summary(fit1)$r.squared

## [1] 0.7992446

summary(fit2)$r.squared

## [1] 0.9807775

The model based on elastic2 has smaller standard errors and a much larger $R^2$. This is due to the larger range of values in elastic2, and the absence of an outlier.

Study the residual vs leverage plots for both models. Hint use plot() on the fitted object

fit1 %>% plot(which = 5) #the fifth plot is the residuals vs leverage plot

fit2 %>% plot(which = 5)

For elastic1, case 2 has the largest influence on the estimation. However, it is not the case with the largest residual:

fit1$residuals

##           1           2           3           4           5           6 
##  -0.1428571 -18.7142857  -7.2857143  -1.4285714   8.0000000  -6.8571429 
##           7 
##  26.4285714

As we can see, case 7 has the largest residual.

Because there is a single value that influences the estimation and is somewhat different than the other values, a robust form of regression may be advisable to obtain more stable estimates. When robust methods are used, we refrain from omitting a suspected outlier from our analysis. In general, with robust analysis, influential cases that are not conform the other cases receive less weight in the estimation procedure then under non-robust analysis.

Use the robust regression function rlm() from the MASS package to fit lines to the data in elastic1 and elastic2. Compare the results with those from use of lm():

residuals
regression coefficients,
standard errors of coefficients,
plots of residuals against fitted values.

First, we run the same models again with rlm()

fit1.rlm <- 
  elastic1 %$%
  rlm(distance ~ stretch)

fit2.rlm <- 
  elastic2 %$%
  rlm(distance ~ stretch)

and then we look at the coefficients and the residuals

data.frame(lm = coef(fit1), 
           rlm = coef(fit1.rlm))

##                      lm        rlm
## (Intercept) -119.142857 -95.747207
## stretch        6.571429   6.039709

data.frame(lm = coef(fit2), 
           rlm = coef(fit2.rlm))

##                    lm         rlm
## (Intercept) -100.9167 -103.055008
## stretch        5.9500    5.975157

We see that the coefficients for elastic1 are different for lm() and rlm(). The coefficients for elastic2 are very similar.

To study the standard errors of the coefficients:

data.frame(lm = summary(fit1)$coefficients[, "Std. Error"], 
           rlm = summary(fit1.rlm)$coefficients[, "Std. Error"])

##                    lm       rlm
## (Intercept) 70.943496 60.690050
## stretch      1.472884  1.260009

data.frame(lm = summary(fit2)$coefficients[, "Std. Error"], 
           rlm = summary(fit2.rlm)$coefficients[, "Std. Error"])

##                     lm        rlm
## (Intercept) 15.6101986 14.4955054
## stretch      0.3148387  0.2923567

The standard errors for the estimates for elastic1 have become much smaller with rlm() compared to standard lm() estimation. The standard errors for elastic2 are very similar.

To study the residuals:

data.frame(lm = residuals(fit1), 
           rlm = residuals(fit1.rlm))

##            lm         rlm
## 1  -0.1428571   0.9205815
## 2 -18.7142857 -13.3970925
## 3  -7.2857143  -5.1588370
## 4  -1.4285714   1.7617445
## 5   8.0000000   8.0000000
## 6  -6.8571429  -7.9205815
## 7  26.4285714  30.6823260

data.frame(lm = residuals(fit2), 
           rlm = residuals(fit2.rlm))

##            lm        rlm
## 1  -6.5833333 -5.1997008
## 2  -0.5833333  0.2971601
## 3 -10.0833333 -8.9512703
## 4  20.1666667 21.1729449
## 5  -7.0833333 -6.4544094
## 6  -9.3333333 -8.5786247
## 7   6.6666667  7.9245144
## 8   1.6666667  2.4213753
## 9   5.1666667  5.6698058

The residual trend for both models is very similar. Remember that different values will still be different under robust analyses; they are only given less influence.

To plot the residuals against the fitted values:

plot(fit1, which = 1, add.smooth = "FALSE", col = "blue", main = "elastic1")
points(residuals(fit1.rlm) ~ fitted(fit1.rlm), col = "orange")

plot(fit2, which = 1, add.smooth = "FALSE", col = "blue", main = "elastic2")
points(residuals(fit2.rlm) ~ fitted(fit2.rlm), col = "orange")

The case 2 residual in elastic1 is smaller in the robust regression. This is because the case had less weight in the rlm() estimation of the coefficients than in the ordinary lm() regression.

Use the elastic2 variable stretch to obtain predictions on the model fitted on elastic1.

pred <- predict.lm(fit1, newdata = data.frame(stretch = elastic2$stretch))

Now make a scatterplot to investigate similarity between plot the predicted values against the observed values for elastic2

new.dat <- data.frame(stretch = elastic2$stretch, 
                      distance = c(elastic2$distance, pred))

new.dat$source <- c(rep("original", nrow(elastic2)), 
                    rep("predicted", nrow(elastic2)))

new.dat %>%
  ggplot(aes(stretch, distance, colour = source)) +
  geom_point() + 
  geom_smooth(method = "lm")

The predicted values are very similar to the observed values:

data.frame(distance = elastic2$distance, predicted = pred) %>%
  ggplot(aes(distance, predicted)) + 
  geom_point()

They do not strictly follow the straight line because there is some modeling error: we use elastic1’s model to predict elastic2’s distance [error source 1] and we compare those predictions to elastic2’s observed distance [error source 2]. However, if you consider the modeling, these predictions are very accurate and have high correlations with the observed values:

data.frame(distance = elastic2$distance, predicted = pred) %>%
  cor()

##            distance predicted
## distance  1.0000000 0.9903421
## predicted 0.9903421 1.0000000

A recruiter for a large company suspects that the process his company uses to hire new applicants is biased. To test this, he records the application numbers that have been successfully hired in the last hiring round. He finds the following pattern:

numbers <- data.frame(hired = c(11, 19, 13, 4, 8, 4),
                      not_hired = c(89, 81, 87, 96, 92, 11))
numbers$probability <- round(with(numbers, hired / (hired + not_hired)), 2)
rownames(numbers) <- c(paste("Application number starts with", 0:5))
numbers

##                                  hired not_hired probability
## Application number starts with 0    11        89        0.11
## Application number starts with 1    19        81        0.19
## Application number starts with 2    13        87        0.13
## Application number starts with 3     4        96        0.04
## Application number starts with 4     8        92        0.08
## Application number starts with 5     4        11        0.27

Investigate whether there is indeed a pattern: does the probability to be hired depend a posteriori on the job application number?

chisq <- chisq.test(numbers[, 1:2])

## Warning in chisq.test(numbers[, 1:2]): Chi-squared approximation may be
## incorrect

chisq$expected

##                                      hired not_hired
## Application number starts with 0 11.456311  88.54369
## Application number starts with 1 11.456311  88.54369
## Application number starts with 2 11.456311  88.54369
## Application number starts with 3 11.456311  88.54369
## Application number starts with 4 11.456311  88.54369
## Application number starts with 5  1.718447  13.28155

Expected cell frequencies are very low for starting number 5 Let’s run a Fisher exact test:

fisher.test(numbers[, 1:2])

## 
##  Fisher's Exact Test for Count Data
## 
## data:  numbers[, 1:2]
## p-value = 0.005174
## alternative hypothesis: two.sided

There seems to be a pattern: lower starting values for the application number have a higher probability of being hired. It seems that the difference in probability to be hired is very unlikely to occur given the assumption of independence.

The researcher knows that application numbers are assigned to new applications based on the time and date the application has been submitted. A colleague suggests that applicants who submit early on in the process tend to be better prepared than applicants who submit later on in the process. Test this assumption by running a $X^2$ test to compare the original data to the following pattern where a 2-percent drop over the starting numbers is expected.

decreasing <- data.frame(hired = c(16, 14, 12, 10, 8, 1),
                         not_hired = c(84, 86, 88, 91, 93, 14))
decreasing$probability <- round(with(decreasing, hired / (hired + not_hired)), 2)
decreasing

##   hired not_hired probability
## 1    16        84        0.16
## 2    14        86        0.14
## 3    12        88        0.12
## 4    10        91        0.10
## 5     8        93        0.08
## 6     1        14        0.07

To run this data, we can compute the chi-squared test by hand:

chisq.decreasing <- sum((numbers[, 1:2] - decreasing[, 1:2])^2 / (decreasing[, 1:2]))
chisq.decreasing

## [1] 17.55956

The chi-squared value for this test is 17.5595631. The corresponding p-value is

1 - pchisq(chisq.decreasing, df = 5)

## [1] 0.003552193

Not succesful: the decreasing probability model does not fit to the data the researcher has observed.

The board of the company would like to improve their process if the process is systematically biased. They tell the recruiter that their standard process in hiring people is as follows:

The secretary sorts the applications by application number
The board determines for every application if the applicant would be hired
If half the vacancies are filled they take a coffee break
After the coffee break they continue the same process to distribute the other applications over the remaining vacancies.

The recruiter suspects that the following psychological process is occuring: The board realized at the coffee break that they were running out of vacancies to award the remaining half of the applications, then became more conservative for a while and return to baseline in the end.

If that were true, the following expected cell frequencies might be observed:

oops <- data.frame(hired = c(14, 14, 14, 2, 12, 3),
                   not_hired = c(86, 86, 86, 98, 88, 12))
oops$probability <- round(with(oops, hired / (hired + not_hired)), 2)
oops

##   hired not_hired probability
## 1    14        86        0.14
## 2    14        86        0.14
## 3    14        86        0.14
## 4     2        98        0.02
## 5    12        88        0.12
## 6     3        12        0.20

Verify if the oops pattern would fit to the observed pattern from the numbers data. Again, use a chi-squared test.

To run this data, we can compute the chi-squared test by hand:

chisq.oops <- sum((numbers[, 1:2] - oops[, 1:2])^2 / (oops [, 1:2]))
chisq.oops

## [1] 6.879611

The chi-squared value for this test is 6.8796113. The corresponding p-value is

1 - pchisq(chisq.oops, df = 5)

## [1] 0.2297489

This model seems to fit to the observations.

Plot the probability against the starting numbers and use different colours for each of the following patterns:

the observed pattern
the independence pattern (equal probability)
the decreasing probability pattern
the oops pattern.

plotdata <- data.frame(count = c(numbers$hired,
                                 chisq$expected[, 1], 
                                 decreasing$hired,
                                 oops$hired),
                       total = rep(c(rep(100, 5), 15), 4),
                       start.nr = rep(0:5, 4),
                       model = rep(c("observed", 
                                     "independence",
                                     "decreasing", 
                                     "oops"),
                                   rep(6, 4)))
plotdata

##        count total start.nr        model
## 1  11.000000   100        0     observed
## 2  19.000000   100        1     observed
## 3  13.000000   100        2     observed
## 4   4.000000   100        3     observed
## 5   8.000000   100        4     observed
## 6   4.000000    15        5     observed
## 7  11.456311   100        0 independence
## 8  11.456311   100        1 independence
## 9  11.456311   100        2 independence
## 10 11.456311   100        3 independence
## 11 11.456311   100        4 independence
## 12  1.718447    15        5 independence
## 13 16.000000   100        0   decreasing
## 14 14.000000   100        1   decreasing
## 15 12.000000   100        2   decreasing
## 16 10.000000   100        3   decreasing
## 17  8.000000   100        4   decreasing
## 18  1.000000    15        5   decreasing
## 19 14.000000   100        0         oops
## 20 14.000000   100        1         oops
## 21 14.000000   100        2         oops
## 22  2.000000   100        3         oops
## 23 12.000000   100        4         oops
## 24  3.000000    15        5         oops

plotdata %>%
  mutate(proportion = count / total) %>%
  ggplot(aes(start.nr, proportion, colour = model)) + 
  geom_point() + 
  geom_smooth()

Write a function that chooses automatically whether to do the chisq.test() or the fisher.test(). Create the function such that it:

takes two vectors as input
creates a table from the two vectors
checks if there is any expected cell frequency that is smaller than 5
and that then performs and prints the results of the appropriate test.

contingencyTest <- function(x, y) {
  
  # Make a table out of the variables.
  tab <- table(x, y)
  
  #expected frequencies
  exp.freq <- (colSums(tab) %*% t(rowSums(tab))) / sum(tab)
  
  # Choose the correct test. 
  if (any(exp.freq < 5)) {
    results <- fisher.test(x, y)
    return(results)
  } else {
    results <- chisq.test(x, y)
    return(results)
  }
}

A more efficient function that tests whether the expected frequencies are smaller than a threshold $t$ is:

contingencyTest2 <- function(x, y, t = 5){
  test <- suppressWarnings(chisq.test(x, y))
  if (any(test$expected < t)){
    test <- fisher.test(x, y)
  }
  return(test)
}

The suppressWarnings() function is used to surpress printing of the warnings that function chi-square puts out to the console when too low expected cell-frequencies are encountered. Since we decide to use Fisher’s exact test in those situations anyway, it is redundant to print the message.

Test the function with the dataset bacteria (from MASS) by testing independence between compliance (hilo) and the presence or absence of disease (y).

bacteria %$%
  contingencyTest(ap, hilo)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  x and y
## X-squared = 2.9352, df = 1, p-value = 0.08667

bacteria %$%
  contingencyTest2(ap, hilo)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  x and y
## X-squared = 2.9352, df = 1, p-value = 0.08667

Does your function work differently if we only put in the first 25 rows of the bacteria dataset?

bacteria[1:25, ] %$%
  contingencyTest(ap, hilo)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  x and y
## p-value = 1
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##   0.1896616 10.6212859
## sample estimates:
## odds ratio 
##   1.408082

bacteria[1:25, ] %$%
  contingencyTest2(ap, hilo)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  x and y
## p-value = 1
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##   0.1896616 10.6212859
## sample estimates:
## odds ratio 
##   1.408082

Yes, it performs differently: expected cell frequencies are smaller than 5 for this subset. We can also verify the function on our numbers data from exercise 7, but then we would need to convert our data first into vectors for hired and starting_numbers:

hired <- c(rep(rep("YES", 6), c(numbers$hired)),
           rep(rep("NO", 6), c(numbers$not_hired)))
starting_numbers <- c(rep(0:5, c(numbers$hired)), 
                      rep(0:5, c(numbers$not_hired)))

contingencyTest(hired, starting_numbers)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  x and y
## p-value = 0.005174
## alternative hypothesis: two.sided

contingencyTest2(hired, starting_numbers)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  x and y
## p-value = 0.005174
## alternative hypothesis: two.sided

The mammalsleep dataset is part of mice. It contains the Allison and Cicchetti (1976) data for mammalian species. To learn more about this data, type

?mammalsleep

Fit and inspect a model where brw is modeled from bw

mammalsleep %$%
  lm(brw ~ bw) %>%
  anova()

## Analysis of Variance Table
## 
## Response: brw
##           Df   Sum Sq  Mean Sq F value    Pr(>F)    
## bw         1 46068314 46068314  411.19 < 2.2e-16 ***
## Residuals 60  6722239   112037                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It seems that we can model brain weight brw with body weight bw. If we inspect the linear model, we see that the $R^2$ is quite high:

mammalsleep %$%
  lm(brw ~ bw) %>%
  summary()

## 
## Call:
## lm(formula = brw ~ bw)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -810.07  -88.52  -79.64  -13.02 2050.33 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 91.00440   43.55258    2.09   0.0409 *  
## bw           0.96650    0.04766   20.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 334.7 on 60 degrees of freedom
## Multiple R-squared:  0.8727, Adjusted R-squared:  0.8705 
## F-statistic: 411.2 on 1 and 60 DF,  p-value: < 2.2e-16

Now fit and inspect a model where brw is predicted from both bw and species

mammalsleep %$%
  lm(brw ~ bw + species) %>%
  anova()

## Warning in anova.lm(.): ANOVA F-tests on an essentially perfect fit are
## unreliable

## Analysis of Variance Table
## 
## Response: brw
##           Df   Sum Sq  Mean Sq F value Pr(>F)
## bw         1 46068314 46068314               
## species   60  6722239   112037               
## Residuals  0        0

There seems to be a perfect fit and we don’t get any p-values. If we inspect the linear model summary(), we find that every animal only is observed once. Adding species as a predictor yields the most overfitted model we may obtain and our residuals drop effectively to zero.

mammalsleep %$%
  lm(brw ~ bw + species) %>%
  summary()

## 
## Call:
## lm(formula = brw ~ bw + species)
## 
## Residuals:
## ALL 62 residuals are 0: no residual degrees of freedom!
## 
## Coefficients: (1 not defined because of singularities)
##                                   Estimate Std. Error t value Pr(>|t|)
## (Intercept)                        13.5316         NA      NA       NA
## bw                                  0.8564         NA      NA       NA
## speciesAfrican giant pouched rat   -7.7880         NA      NA       NA
## speciesArctic Fox                  28.0695         NA      NA       NA
## speciesArctic ground squirrel      -8.6195         NA      NA       NA
## speciesAsian elephant            2408.2242         NA      NA       NA
## speciesBaboon                     156.9334         NA      NA       NA
## speciesBig brown bat              -13.2513         NA      NA       NA
## speciesBrazilian tapir             18.4448         NA      NA       NA
## speciesCat                          9.2423         NA      NA       NA
## speciesChimpanzee                 381.7987         NA      NA       NA
## speciesChinchilla                  -7.4956         NA      NA       NA
## speciesCow                         11.2436         NA      NA       NA
## speciesDesert hedgehog            -11.6026         NA      NA       NA
## speciesDonkey                     245.2365         NA      NA       NA
## speciesEastern American mole      -12.3958         NA      NA       NA
## speciesEchidna                      8.8992         NA      NA       NA
## speciesEuropean hedgehog          -10.7039         NA      NA       NA
## speciesGalago                      -8.7029         NA      NA       NA
## speciesGenet                        2.7609         NA      NA       NA
## speciesGiant armadillo             16.0846         NA      NA       NA
## speciesGiraffe                    213.4342         NA      NA       NA
## speciesGoat                        77.7805         NA      NA       NA
## speciesGolden hamster             -12.6344         NA      NA       NA
## speciesGorilla                    215.1941         NA      NA       NA
## speciesGray seal                  238.6746         NA      NA       NA
## speciesGray wolf                   74.8555         NA      NA       NA
## speciesGround squirrel             -9.6181         NA      NA       NA
## speciesGuinea pig                  -8.9222         NA      NA       NA
## speciesHorse                      195.2854         NA      NA       NA
## speciesJaguar                      57.8287         NA      NA       NA
## speciesKangaroo                    12.4945         NA      NA       NA
## speciesLesser short-tailed shrew  -13.3959         NA      NA       NA
## speciesLittle brown bat           -13.2902         NA      NA       NA
## speciesMan                       1253.3718         NA      NA       NA
## speciesMole rat                   -10.6361         NA      NA       NA
## speciesMountain beaver             -6.5877         NA      NA       NA
## speciesMouse                      -13.1513         NA      NA       NA
## speciesMusk shrew                 -13.2427         NA      NA       NA
## speciesN. American opossum         -8.6875         NA      NA       NA
## speciesNine-banded armadillo       -5.7290         NA      NA       NA
## speciesOkapi                      262.3691         NA      NA       NA
## speciesOwl monkey                   1.5573         NA      NA       NA
## speciesPatas monkey                92.9044         NA      NA       NA
## speciesPhanlanger                  -3.5190         NA      NA       NA
## speciesPig                          2.0401         NA      NA       NA
## speciesRabbit                      -3.5726         NA      NA       NA
## speciesRaccoon                     21.9962         NA      NA       NA
## speciesRat                        -11.8714         NA      NA       NA
## speciesRed fox                     33.2416         NA      NA       NA
## speciesRhesus monkey              159.6449         NA      NA       NA
## speciesRock hyrax (Hetero. b)      -1.8739         NA      NA       NA
## speciesRock hyrax (Procavia hab)    4.3854         NA      NA       NA
## speciesRoe deer                    71.9680         NA      NA       NA
## speciesSheep                      113.9384         NA      NA       NA
## speciesSlow loris                  -2.2305         NA      NA       NA
## speciesStar nosed mole            -12.5830         NA      NA       NA
## speciesTenrec                     -11.7023         NA      NA       NA
## speciesTree hyrax                  -2.9444         NA      NA       NA
## speciesTree shrew                 -11.1207         NA      NA       NA
## speciesVervet                      40.8801         NA      NA       NA
## speciesWater opossum              -12.6290         NA      NA       NA
## speciesYellow-bellied marmot            NA         NA      NA       NA
## 
## Residual standard error: NaN on 0 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:    NaN 
## F-statistic:   NaN on 61 and 0 DF,  p-value: NA

The analysis we ran is in fact equivalent to running a fixed effects model on clusters of size 1. Since in that scenario there is no clustering, we should omit the fixed effect for species and just model the random variation (in this case done by the residual variance).

Can you find a model that improves the $R^2$ in modeling brw? Since we’re considering linear models so far, I limit myself to linear models only. The basis of the linear model is the variance covariance matrix and how it translates to data relations. This is most easily linearly summarized in the correlation matrix:

mammalsleep %>%
  subset(select = -species) %>% #exclude factor species
  cor(use = "pairwise.complete.obs") #pairwise deletion

##              bw         brw        sws         ps         ts         mls
## bw   1.00000000  0.93416384 -0.3759462 -0.1093833 -0.3071859  0.30245056
## brw  0.93416384  1.00000000 -0.3692177 -0.1051388 -0.3581020  0.50925268
## sws -0.37594625 -0.36921766  1.0000000  0.5142539  0.9627147 -0.38443179
## ps  -0.10938331 -0.10513879  0.5142539  1.0000000  0.7270870 -0.29574535
## ts  -0.30718591 -0.35810203  0.9627147  0.7270870  1.0000000 -0.41020239
## mls  0.30245056  0.50925268 -0.3844318 -0.2957453 -0.4102024  1.00000000
## gt   0.65110218  0.74724248 -0.5947028 -0.4508987 -0.6313262  0.61484879
## pi   0.05949472  0.03385548 -0.3181846 -0.4474705 -0.3958350 -0.10254416
## sei  0.33827367  0.36780037 -0.5437566 -0.5372245 -0.6422845  0.36035221
## odi  0.13358123  0.14587888 -0.4838522 -0.5793365 -0.5877424  0.06177846
##             gt          pi        sei         odi
## bw   0.6511022  0.05949472  0.3382737  0.13358123
## brw  0.7472425  0.03385548  0.3678004  0.14587888
## sws -0.5947028 -0.31818462 -0.5437566 -0.48385220
## ps  -0.4508987 -0.44747050 -0.5372245 -0.57933653
## ts  -0.6313262 -0.39583497 -0.6422845 -0.58774241
## mls  0.6148488 -0.10254416  0.3603522  0.06177846
## gt   1.0000000  0.20050426  0.6382790  0.37861701
## pi   0.2005043  1.00000000  0.6182460  0.91604245
## sei  0.6382790  0.61824597  1.0000000  0.78720311
## odi  0.3786170  0.91604245  0.7872031  1.00000000

This matrix contains quite a few cells. To obtain only the correlations with brw we could select the respective column:

mammalsleep %>%
  subset(select = -species) %>% #exclude factor species
  cor(use = "pairwise.complete.obs") %>% #pairwise deletion
  subset(select = brw) #only column brw from the correlation matrix

##             brw
## bw   0.93416384
## brw  1.00000000
## sws -0.36921766
## ps  -0.10513879
## ts  -0.35810203
## mls  0.50925268
## gt   0.74724248
## pi   0.03385548
## sei  0.36780037
## odi  0.14587888

It seems that the following variables have a rather nice relation with brw:

sws: short wave sleep
ts : total sleep
mls: maximum life span
gt : gestation time
sei: sleep exposure index

However, from the larger correlation matrix we can also see that ts is highly colinear with sws - in fact, ts is calculated as the sum over sws and ps. Including both variables will not hurt our $R^2$ per se, but it will certainly trouble the precision of our estimation as including both variables will yield much larger standard errors. It may be wise to select sws as a predictor: ts contains a source of error in the form of ps, so its linear association with brw is slightly weaker. However, sws misses 14 cases and ts misses only 4.

mammalsleep %>%
  summary()

##                       species         bw                brw         
##  African elephant         : 1   Min.   :   0.005   Min.   :   0.14  
##  African giant pouched rat: 1   1st Qu.:   0.600   1st Qu.:   4.25  
##  Arctic Fox               : 1   Median :   3.342   Median :  17.25  
##  Arctic ground squirrel   : 1   Mean   : 198.790   Mean   : 283.13  
##  Asian elephant           : 1   3rd Qu.:  48.203   3rd Qu.: 166.00  
##  Baboon                   : 1   Max.   :6654.000   Max.   :5712.00  
##  (Other)                  :56                                       
##       sws               ps              ts             mls         
##  Min.   : 2.100   Min.   :0.000   Min.   : 2.60   Min.   :  2.000  
##  1st Qu.: 6.250   1st Qu.:0.900   1st Qu.: 8.05   1st Qu.:  6.625  
##  Median : 8.350   Median :1.800   Median :10.45   Median : 15.100  
##  Mean   : 8.673   Mean   :1.972   Mean   :10.53   Mean   : 19.878  
##  3rd Qu.:11.000   3rd Qu.:2.550   3rd Qu.:13.20   3rd Qu.: 27.750  
##  Max.   :17.900   Max.   :6.600   Max.   :19.90   Max.   :100.000  
##  NA's   :14       NA's   :12      NA's   :4       NA's   :4        
##        gt               pi             sei             odi       
##  Min.   : 12.00   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 35.75   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median : 79.00   Median :3.000   Median :2.000   Median :2.000  
##  Mean   :142.35   Mean   :2.871   Mean   :2.419   Mean   :2.613  
##  3rd Qu.:207.50   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :645.00   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##  NA's   :4

Therefore it is highly preferable to use ts in the model, despite its weaker association.

We run the new model:

fit <- 
  mammalsleep %$%
  lm(brw ~ bw + ts + mls + gt + sei)
  
fit %>%
  summary()

## 
## Call:
## lm(formula = brw ~ bw + ts + mls + gt + sei)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -485.56 -100.64    0.64  104.72 1434.27 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -197.18186  198.55848  -0.993 0.325987    
## bw             0.81023    0.05633  14.383  < 2e-16 ***
## ts             4.90058   11.88823   0.412 0.682134    
## mls           10.58363    2.69053   3.934 0.000287 ***
## gt             1.23352    0.58171   2.121 0.039509 *  
## sei          -43.36629   34.80674  -1.246 0.219243    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 268.8 on 45 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.9373, Adjusted R-squared:  0.9304 
## F-statistic: 134.6 on 5 and 45 DF,  p-value: < 2.2e-16

and we have obtained a very high $R^2$. If prediction was our goal, we are doing great.

Inspect the diagnostic plots for the model obtained under 11. What issues can you detect?

fit %>%
  plot(which = 1:6)

Some issues spring to mind:

There error variance seems to be heteroscedastic [but we have a rather small sample]
The residuals are not normally distributed in the extreme tails
The following case has a large leverage: 1
The following case has large residual: 5

mammalsleep$species[c(1, 5)]

## [1] African elephant Asian elephant  
## 62 Levels: African elephant African giant pouched rat ... Yellow-bellied marmot

If we sort the brw variable together with its strongest predictor

mammalsleep %>% 
  subset(select = c(species, brw, bw, ts, mls, gt, sei)) %>%
  arrange(desc(brw)) #sort the data in descending order based on brw

##                      species     brw       bw   ts   mls    gt sei
## 1           African elephant 5712.00 6654.000  3.3  38.6 645.0   5
## 2             Asian elephant 4603.00 2547.000  3.9  69.0 624.0   5
## 3                        Man 1320.00   62.000  8.0 100.0 267.0   1
## 4                    Giraffe  680.00  529.000   NA  28.0 400.0   5
## 5                      Horse  655.00  521.000  2.9  46.0 336.0   5
## 6                      Okapi  490.00  250.000   NA  23.6 440.0   5
## 7                 Chimpanzee  440.00   52.160  9.7  50.0 230.0   1
## 8                        Cow  423.00  465.000  3.9  30.0 281.0   5
## 9                     Donkey  419.00  187.100  3.1  40.0 365.0   5
## 10                   Gorilla  406.00  207.000 12.0  39.3 252.0   4
## 11                 Gray seal  325.00   85.000  6.2  41.0 310.0   3
## 12                       Pig  180.00  192.000  8.4  27.0 115.0   4
## 13                    Baboon  179.50   10.550  9.8  27.0 180.0   4
## 14             Rhesus monkey  179.00    6.800  9.6  29.0 164.0   3
## 15                     Sheep  175.00   55.500  3.8  20.0 151.0   5
## 16           Brazilian tapir  169.00  160.000  6.2  30.4 392.0   5
## 17                    Jaguar  157.00  100.000 10.8  22.4 100.0   1
## 18                 Gray wolf  119.50   36.330 13.0  16.2  63.0   1
## 19                      Goat  115.00   27.660  3.8  20.0 148.0   5
## 20              Patas monkey  115.00   10.000 10.9  20.2 170.0   4
## 21                  Roe deer   98.20   14.830  2.6  17.0 150.0   5
## 22           Giant armadillo   81.00   60.000 18.1   7.0    NA   1
## 23                    Vervet   58.00    4.190 10.3  24.0 210.0   3
## 24                  Kangaroo   56.00   35.000   NA  16.3  33.0   5
## 25                   Red fox   50.40    4.235  9.8   9.8  52.0   1
## 26                Arctic Fox   44.50    3.385 12.5  14.0  60.0   1
## 27                   Raccoon   39.20    4.288 12.5  13.7  63.0   2
## 28                       Cat   25.60    3.300 14.5  28.0  63.0   2
## 29                   Echidna   25.00    3.000  8.6  50.0  28.0   2
## 30 Rock hyrax (Procavia hab)   21.00    3.600  5.4   6.0 225.0   2
## 31                     Genet   17.50    1.410  6.1  34.0    NA   2
## 32     Yellow-bellied marmot   17.00    4.050   NA  13.0  38.0   1
## 33                Owl monkey   15.50    0.480 17.0  12.0 140.0   2
## 34                Slow loris   12.50    1.400 11.0  12.7  90.0   2
## 35    Rock hyrax (Hetero. b)   12.30    0.750  6.6   7.0 225.0   2
## 36                Tree hyrax   12.30    2.000  5.4   7.5 200.0   1
## 37                    Rabbit   12.10    2.500  8.4  18.0  31.0   5
## 38                Phanlanger   11.40    1.620 13.7  13.0  17.0   1
## 39     Nine-banded armadillo   10.80    3.500 17.4   6.5 120.0   1
## 40           Mountain beaver    8.10    1.350 11.2    NA  45.0   1
## 41 African giant pouched rat    6.60    1.000  8.3   4.5  42.0   1
## 42                Chinchilla    6.40    0.425 12.5   7.0 112.0   4
## 43       N. American opossum    6.30    1.700 19.4   5.0  12.0   1
## 44    Arctic ground squirrel    5.70    0.920 16.5    NA  25.0   2
## 45                Guinea pig    5.50    1.040  8.2   7.6  68.0   3
## 46                    Galago    5.00    0.200 10.7  10.4 120.0   2
## 47           Ground squirrel    4.00    0.101 13.8   9.0  28.0   1
## 48             Water opossum    3.90    3.500 19.4   3.0  14.0   1
## 49         European hedgehog    3.50    0.785 10.7   6.0  42.0   2
## 50                  Mole rat    3.00    0.122 10.6    NA  30.0   1
## 51                    Tenrec    2.60    0.900 13.3   4.5  60.0   1
## 52                Tree shrew    2.50    0.104 15.8   2.3  46.0   2
## 53           Desert hedgehog    2.40    0.550 10.3    NA    NA   1
## 54                       Rat    1.90    0.280 13.2   4.7  21.0   1
## 55     Eastern American mole    1.20    0.075  8.4   3.5  42.0   1
## 56            Golden hamster    1.00    0.120 14.4   3.9  16.0   1
## 57           Star nosed mole    1.00    0.060 10.3   3.5    NA   1
## 58                     Mouse    0.40    0.023 13.2   3.2  19.0   1
## 59                Musk shrew    0.33    0.048 12.8   2.0  30.0   1
## 60             Big brown bat    0.30    0.023 19.7  19.0  35.0   1
## 61          Little brown bat    0.25    0.010 19.9  24.0  50.0   1
## 62 Lesser short-tailed shrew    0.14    0.005  9.1   2.6  21.5   2

we see that Man has a large brw for small bw and that African elephant is so large that it massively influences the model. For Man we would expect a much lighter brain given its body weight. We can also see this from the residuals:

fit %>%
  residuals()

##            1            2            3            5            6 
## -485.5560450  106.2290991   -1.1334890 1434.2683263  -14.2173431 
##            7            8            9           10           11 
##  -99.9739636 -382.2870299 -138.2708329 -222.1389154  103.2062362 
##           12           14           15           16           17 
## -222.9813136 -207.3502817  111.6721285 -299.3810807  119.0327425 
##           18           22           23           24           25 
##  -21.7754459   93.7473544  109.8701136 -176.6595017 -263.2907467 
##           26           27           28           29           30 
##   17.7387559   47.0471717  127.4386770 -268.6348480  -96.8256901 
##           32           33           34           37           38 
##  185.4170567 -172.4144182   83.3979884  118.9373998  119.9390604 
##           39           40           42           43           44 
##   82.6791039  -53.5732527  -83.9801061    0.6414618   24.9406553 
##           45           46           47           48           49 
##  -73.6937449  154.1785525   35.6757062  101.8866990   71.6287560 
##           50           51           52           53           54 
##  -55.4961698  -88.3634693  -65.5082879  122.5071227  127.4900821 
##           55           57           58           59           60 
##   -4.0547720   55.6039203 -101.3157666  127.8168865 -181.6354614 
##           61 
##   97.5209191

from the influence statistics:

fit %>%
  influence()

## $hat
##          1          2          3          5          6          7 
## 0.90376652 0.09680085 0.03820138 0.27344636 0.05622069 0.11707227 
##          8          9         10         11         12         14 
## 0.19346612 0.06299083 0.13658848 0.09859497 0.07869705 0.12546971 
##         15         16         17         18         22         23 
## 0.09609692 0.24598832 0.03973070 0.03019447 0.12099853 0.04617505 
##         24         25         26         27         28         29 
## 0.10988139 0.09296178 0.03795117 0.04105129 0.05488669 0.09677947 
##         30         32         33         34         37         38 
## 0.04907583 0.06803407 0.13168851 0.54221947 0.04692941 0.04828122 
##         39         40         42         43         44         45 
## 0.10241613 0.11430997 0.11037221 0.06662589 0.04627464 0.07663133 
##         46         47         48         49         50         51 
## 0.20760998 0.02767828 0.04517846 0.06359325 0.02915966 0.13663756 
##         52         53         54         55         57         58 
## 0.15726711 0.13674991 0.11930401 0.02387457 0.04466402 0.19821965 
##         59         60         61 
## 0.06384060 0.04501389 0.10433937 
## 
## $coefficients
##      (Intercept)            bw           ts           mls            gt
## 1  -243.72029374 -7.891619e-01  8.300378873  1.961265e+00  1.213301e+00
## 2    25.21936417  1.413647e-03 -1.206573436 -7.335218e-02 -1.813002e-02
## 3    -0.08631166 -6.519435e-06  0.001850449 -8.986121e-05  6.447977e-05
## 5  -222.37630813 -3.864155e-03  9.482902916  1.328092e+00  1.088658e+00
## 6     1.23516961  1.979373e-04 -0.063616829 -4.345374e-03 -1.847447e-04
## 7    15.38832782 -5.432214e-04 -1.324156374 -9.241778e-02  8.203378e-04
## 8    60.48183044  2.901918e-02 -2.848182251  8.097885e-01 -3.426136e-01
## 9     8.08219132 -1.659263e-03 -0.708783091 -2.266736e-01  3.415038e-02
## 10  -14.83398360  5.099153e-03  0.549892324 -4.320579e-01 -5.267207e-02
## 11  -13.71999924 -9.170728e-04  0.867035841 -1.055549e-01  2.565440e-03
## 12   -1.05736367  2.688443e-03  0.579881544  5.396998e-02 -1.076290e-02
## 14    6.18986203  9.959258e-03  0.240514300  9.009225e-02 -8.913409e-02
## 15   26.18872034  1.394162e-03 -1.242033539 -8.959104e-02 -1.713371e-02
## 16  -50.31756815 -1.692566e-02  3.179504037 -1.672341e+00  3.352325e-01
## 17   10.32949936  1.392736e-03 -0.383105546 -4.747950e-02 -2.249698e-02
## 18   -0.98684647  2.228338e-04  0.015753756  1.986039e-02 -2.935611e-03
## 22    6.58455160  4.821783e-04 -0.638174756  1.542362e-02 -3.505366e-02
## 23    4.89650785  1.187494e-03  0.099911014 -4.859549e-02 -9.064587e-03
## 24   36.40960548  4.536888e-03 -2.056885018 -1.466931e-01 -4.167628e-02
## 25   -5.00633467  1.105221e-02  0.433175399 -2.350633e-02 -1.165109e-01
## 26    1.04073805  1.201451e-04 -0.010461765  5.052978e-03 -1.136219e-03
## 27    2.54661940  5.084744e-04  0.001183841 -2.497786e-03 -5.040769e-03
## 28   11.88649415  1.357163e-03 -0.633372764 -5.456064e-02 -3.247090e-02
## 29   -0.24983987  4.472249e-03  0.850871137 -2.243379e-01 -1.807907e-02
## 30  -10.57779705 -4.922710e-04  0.393768168 -5.105855e-02  5.179576e-03
## 32   28.53048895  3.570485e-03 -1.358210784 -9.209852e-02 -5.688596e-02
## 33   30.19122380 -5.338780e-04 -2.483945499 -2.252012e-01 -1.753357e-03
## 34   11.21681229 -1.485900e-04 -0.865126400  1.140465e+00 -5.130672e-02
## 37    9.71039304  1.351139e-03 -0.172191353 -6.465768e-02 -1.166554e-02
## 38   10.77073455  9.402313e-04 -0.228081945 -9.675266e-02 -5.032490e-03
## 39   -9.89672227  4.285150e-04  0.967966630 -2.807885e-02  4.079672e-03
## 40    5.96424958  1.452604e-03 -0.601268163  8.715142e-02 -2.837401e-02
## 42   14.81578562  2.318740e-03 -1.154080165  9.618280e-02 -3.784149e-02
## 43   -0.07314966 -1.065209e-05  0.004206938 -1.934365e-04  5.630308e-05
## 44    1.50944208  4.203137e-04 -0.014371516  1.266259e-02 -5.390016e-03
## 45    0.25180713 -1.320030e-03  0.064563528 -9.447721e-02  3.115094e-02
## 46   -4.27223545  5.971332e-03 -0.111275900  2.538248e-01 -1.365289e-01
## 47    0.53009765  2.395557e-04  0.037122001  4.450059e-03 -4.448297e-03
## 48    8.10621682  1.179136e-03 -0.141994812 -4.181767e-02 -1.091671e-02
## 49   12.33365624  7.470523e-04 -0.540374093 -1.717373e-02 -9.612321e-03
## 50    0.69553868  5.210781e-04 -0.047772953 -3.262922e-02  4.553270e-04
## 51  -12.34827888  3.811564e-03  0.555351409  2.515678e-01 -5.483223e-02
## 52  -12.25405271  2.786901e-03  0.603867004  1.979415e-01 -3.961636e-02
## 53   13.97782003  5.011877e-04 -1.167005941 -2.509449e-02 -4.491956e-02
## 54    8.86474762  7.004253e-04 -0.858771111  1.788077e-02 -4.676115e-02
## 55   -0.19784084  1.055703e-07  0.004440846  1.220367e-03  1.184715e-04
## 57    3.18619488 -5.497384e-05  0.004555809 -5.141468e-02  4.983695e-03
## 58  -28.56155358  3.739395e-03  1.395498566  2.787572e-01 -5.798662e-02
## 59   -9.63366731  4.197479e-05  0.961028516 -1.154272e-01  5.556994e-03
## 60    9.89309757  5.111625e-03 -0.700129056  9.372341e-02 -5.521970e-02
## 61  -11.66188823  3.311138e-04  1.152496607 -5.752401e-02  8.412294e-03
##            sei
## 1   9.81613002
## 2  -2.72951278
## 3   0.01557169
## 5  -7.18465245
## 6  -0.32753933
## 7  -0.83130974
## 8  -5.89595125
## 9  -1.44964422
## 10  7.80997880
## 11  3.65565323
## 12 -4.01918629
## 14 -1.92346523
## 15 -2.85032581
## 16 -0.20737504
## 17 -0.03834421
## 18  0.14299652
## 22  2.72227838
## 23 -0.68195638
## 24 -4.55163332
## 25  3.82628293
## 26 -0.21961298
## 27 -0.38536066
## 28  1.10977706
## 29 -3.38753239
## 30  2.00304757
## 32 -0.58636471
## 33 -1.41665399
## 34 -5.85319446
## 37 -1.15244417
## 38 -1.42227624
## 39  0.64273176
## 40  0.41679480
## 42 -0.73648205
## 43  0.01688128
## 44 -0.17609909
## 45 -1.92202225
## 46  9.05021867
## 47  0.11983892
## 48 -0.98553762
## 49 -1.50855921
## 50 -0.34524738
## 51  2.63589409
## 52  2.24214991
## 53  3.12404097
## 54  3.66670950
## 55  0.01182831
## 57 -0.71478964
## 58  5.47730002
## 59  1.58781903
## 60 -0.63930534
## 61  0.71772803
## 
## $sigma
##         1         2         3         5         6         7         8 
## 135.04872 271.35637 271.87903  97.82349 271.87013 271.40553 264.19703 
##         9        10        11        12        14        15        16 
## 271.02492 269.47973 271.38474 269.61397 269.81643 271.30182 266.86449 
##        17        18        22        23        24        25        26 
## 271.26167 271.85865 271.46086 271.34960 270.40967 268.66571 271.86541 
##        27        28        29        30        32        33        34 
## 271.78259 271.15991 268.51889 271.46669 270.33284 270.44438 271.24331 
##        37        38        39        40        42        43        44 
## 271.25800 271.24658 271.56058 271.74361 271.54753 271.87906 271.85182 
##        45        46        47        48        49        50        51 
## 271.63315 270.62231 271.82437 271.42429 271.64998 271.74646 271.50082 
##        52        53        54        55        57        58        59 
## 271.66616 271.15145 271.10660 271.87838 271.74378 271.34345 271.14870 
##        60        61 
## 270.43130 271.43491 
## 
## $wt.res
##            1            2            3            5            6 
## -485.5560450  106.2290991   -1.1334890 1434.2683263  -14.2173431 
##            7            8            9           10           11 
##  -99.9739636 -382.2870299 -138.2708329 -222.1389154  103.2062362 
##           12           14           15           16           17 
## -222.9813136 -207.3502817  111.6721285 -299.3810807  119.0327425 
##           18           22           23           24           25 
##  -21.7754459   93.7473544  109.8701136 -176.6595017 -263.2907467 
##           26           27           28           29           30 
##   17.7387559   47.0471717  127.4386770 -268.6348480  -96.8256901 
##           32           33           34           37           38 
##  185.4170567 -172.4144182   83.3979884  118.9373998  119.9390604 
##           39           40           42           43           44 
##   82.6791039  -53.5732527  -83.9801061    0.6414618   24.9406553 
##           45           46           47           48           49 
##  -73.6937449  154.1785525   35.6757062  101.8866990   71.6287560 
##           50           51           52           53           54 
##  -55.4961698  -88.3634693  -65.5082879  122.5071227  127.4900821 
##           55           57           58           59           60 
##   -4.0547720   55.6039203 -101.3157666  127.8168865 -181.6354614 
##           61 
##   97.5209191

From the influence we see:

the residual standard deviation $sigma would drop when the first case and the fifth case would be omitted.
the coefficients `$coefficients$ would dramatically change if cases 1 and 5 were omitted

The influence(fit)$coefficients is equivalent to the dfbeta() return:

head(influence(fit)$coefficients)

##     (Intercept)            bw           ts           mls            gt
## 1 -243.72029374 -7.891619e-01  8.300378873  1.961265e+00  1.213301e+00
## 2   25.21936417  1.413647e-03 -1.206573436 -7.335218e-02 -1.813002e-02
## 3   -0.08631166 -6.519435e-06  0.001850449 -8.986121e-05  6.447977e-05
## 5 -222.37630813 -3.864155e-03  9.482902916  1.328092e+00  1.088658e+00
## 6    1.23516961  1.979373e-04 -0.063616829 -4.345374e-03 -1.847447e-04
## 7   15.38832782 -5.432214e-04 -1.324156374 -9.241778e-02  8.203378e-04
##           sei
## 1  9.81613002
## 2 -2.72951278
## 3  0.01557169
## 5 -7.18465245
## 6 -0.32753933
## 7 -0.83130974

fit %>%
  dfbeta() %>%
  head()

##     (Intercept)            bw           ts           mls            gt
## 1 -243.72029374 -7.891619e-01  8.300378873  1.961265e+00  1.213301e+00
## 2   25.21936417  1.413647e-03 -1.206573436 -7.335218e-02 -1.813002e-02
## 3   -0.08631166 -6.519435e-06  0.001850449 -8.986121e-05  6.447977e-05
## 5 -222.37630813 -3.864155e-03  9.482902916  1.328092e+00  1.088658e+00
## 6    1.23516961  1.979373e-04 -0.063616829 -4.345374e-03 -1.847447e-04
## 7   15.38832782 -5.432214e-04 -1.324156374 -9.241778e-02  8.203378e-04
##           sei
## 1  9.81613002
## 2 -2.72951278
## 3  0.01557169
## 5 -7.18465245
## 6 -0.32753933
## 7 -0.83130974

End of Practical.

Practical J&K

Gerko Vink

Statistical Programming in R

Exercises