Exercises


The following packages are required for this practical:

library(dplyr)
library(magrittr)
library(mice)
library(ggplot2)
library(DAAG)
library(MASS)

The data sets elastic1 and elastic2 from the package DAAG were obtained using the same apparatus, including the same rubber band, as the data frame elasticband.

  1. Using a different symbol and/or a different color, plot the data from the two data frames elastic1 and elastic2 on the same graph. Do the two sets of results appear consistent?

  1. For each of the data sets elastic1 and elastic2, determine the regression of distance on stretch. In each case determine:

Compare the two sets of results. What is the key difference between the two sets of data?


  1. Study the residual vs leverage plots for both models. Hint use plot() on the fitted object

  1. Use the robust regression function rlm() from the MASS package to fit lines to the data in elastic1 and elastic2. Compare the results with those from use of lm():

  1. Use the elastic2 variable stretch to obtain predictions on the model fitted on elastic1.

  1. Now make a scatterplot to investigate similarity between plot the predicted values against the observed values for elastic2

A recruiter for a large company suspects that the process his company uses to hire new applicants is biased. To test this, he records the application numbers that have been successfully hired in the last hiring round. He finds the following pattern:

numbers <- data.frame(hired = c(11, 19, 13, 4, 8, 4),
                      not_hired = c(89, 81, 87, 96, 92, 11))
numbers$probability <- round(with(numbers, hired / (hired + not_hired)), 2)
rownames(numbers) <- c(paste("Application number starts with", 0:5))
numbers
##                                  hired not_hired probability
## Application number starts with 0    11        89        0.11
## Application number starts with 1    19        81        0.19
## Application number starts with 2    13        87        0.13
## Application number starts with 3     4        96        0.04
## Application number starts with 4     8        92        0.08
## Application number starts with 5     4        11        0.27

  1. Investigate whether there is indeed a pattern: does the probability to be hired depend a posteriori on the job application number?

  1. The recruiter knows that application numbers are assigned to new applications based on the time and date the application has been submitted. A colleague suggests that applicants who submit early on in the process tend to be better prepared than applicants who submit later on in the process. Test this assumption by running a \(X^2\) test to compare the original data to the following pattern where a 2-percent drop over the starting numbers is expected.
decreasing <- data.frame(hired = c(16, 14, 12, 10, 8, 1),
                         not_hired = c(84, 86, 88, 91, 93, 14))
decreasing$probability <- round(with(decreasing, hired / (hired + not_hired)), 2)
decreasing
##   hired not_hired probability
## 1    16        84        0.16
## 2    14        86        0.14
## 3    12        88        0.12
## 4    10        91        0.10
## 5     8        93        0.08
## 6     1        14        0.07

The board of the company would like to improve their process if the process is systematically biased. They tell the recruiter that their standard process in hiring people is as follows:

  1. The secretary sorts the applications by application number
  2. The board determines for every application if the applicant would be hired
  3. If half the vacancies are filled they take a coffee break
  4. After the coffee break they continue the same process to distribute the other applications over the remaining vacancies.

The recruiter suspects that the following psychological process is occuring: The board realized at the coffee break that they were running out of vacancies to award the remaining half of the applications, then became more conservative for a while and return to baseline in the end.

If that were true, the following expected cell frequencies might be observed:

oops <- data.frame(hired = c(14, 14, 14, 2, 12, 3),
                   not_hired = c(86, 86, 86, 98, 88, 12))
oops$probability <- round(with(oops, hired / (hired + not_hired)), 2)
oops
##   hired not_hired probability
## 1    14        86        0.14
## 2    14        86        0.14
## 3    14        86        0.14
## 4     2        98        0.02
## 5    12        88        0.12
## 6     3        12        0.20

  1. Verify if the oops pattern would fit to the observed pattern from the numbers data. Again, use a chi-squared test.

  1. Plot the probability against the starting numbers and use different colours for each of the following patterns:

  1. Write a function that chooses automatically whether to do the chisq.test() or the fisher.test(). Create the function such that it:

  1. Test the function with the dataset bacteria (from MASS) by testing independence between compliance (hilo) and the presence or absence of disease (y).

  1. Does your function work differently if we only put in the first 25 rows of the bacteria dataset?

The mammalsleep dataset is part of mice. It contains the Allison and Cicchetti (1976) data for mammalian species. To learn more about this data, type

?mammalsleep

  1. Fit and inspect a model where brw is modeled from bw

  1. Now fit and inspect a model where brw is predicted from both bw and species

  1. Can you find a model that improves the \(R^2\) in modeling brw?

  1. Inspect the diagnostic plots for the model obtained under 11. What issues can you detect?

End of Practical JK.