Practical G

If you need package plyr and package dplyr, always load package plyr first! If you load dplyr before plyr, plyr will produce a warning.

Exercises

In this exercise we will again be using random number generators. When using random numbers, it is wise to always fix the random seed (the starting point of the number generation). This ensures that in the future we can exactly reproduce the chain of executions that led to the randomly generated results.

Start by setting a random seed. If you follow my random seed and reproduce the code below in exactly the same order, you will get the same results. If you do not follow the exact ordering (i.e. if you skip or rerun a question, or have different code), your results may be different due to random sampling. This is not a bad thing! It is just a result of the method and should be that way.

#set random seed, make things reproducible
set.seed(123)

A group of experimenters have 4 experimental conditions they want to run. In each of these four conditions, there are seven manipulations that should be in a random order. Design the experiment such that for every condition, the seven manipulations are randomly ordered.

# Make an empty matrix. 
design <- matrix(NA, 7, 4)

# Name the columns and rows. 
colnames(design) <- paste("condition", 1:4, sep=" ")
rownames(design) <- paste("manipulation",  1:7, sep=" ")

# Put a random ordering in each column.
for(j in 1:ncol(design)) {
    design[, j] <- sample(1:7, replace=FALSE)
}

design

##                condition 1 condition 2 condition 3 condition 4
## manipulation 1           3           7           1           5
## manipulation 2           5           4           6           4
## manipulation 3           7           3           2           7
## manipulation 4           4           6           7           3
## manipulation 5           6           2           4           6
## manipulation 6           1           5           5           2
## manipulation 7           2           1           3           1

Generate a vector of 100 random standard normal numbers.

# Get random numbers.
y <- rnorm(100)

Compute the mean and standard deviation of the vector from question 2.

mean(y)

## [1] 0.005996774

sd(y)

## [1] 0.9135814

Generate a vector of 100 random standard normal numbers (like in Exercise 2) 25 times, and each time store means in object av. Compute the standard deviation of av.

# Repeat the sampling of numbers 25 times, each time getting the mean.
av <- numeric(25)
for(i in 1:25) {
    av[i] <- mean(rnorm(100))
}

# Standard error of sample mean.
sd(av)

## [1] 0.07753845

or alternatively with function rlply() from package plyr:

samples <- rlply(.n = 25, rnorm(100, mean = 0, sd = 1))

Function plyr::rlply() evaluates an expression \(n\) times and combines the results in a list. So the call rnorm(5000, mean = 0, sd = 1) is evaluated .n = 25 times.

We can then call the following sapply() statement to create object av:

av <- sapply(samples, mean)
sd(av) # Standard error of sample mean.

## [1] 0.09250157

Function sapply() evalutes the expression mean() over each listed element in samples. The result is a vector of length 100 with the means of the 100 samples from the samples object.

Create a function that automatically returns a vector like av.

# Create a function for sd of sample means
mean.av <- function(n = 100, reps = 25) {
    
  # Make an empty vector. 
  av <- numeric(reps)
  
  for(i in 1:reps) {
      
    # Make a random standard normal dataset of size n.
    y     <- rnorm(n)
    
    # Compute and save the mean. 
    av[i] <- mean(y)
  }
  
  # Return the vector of means. 
  av
}

sd(mean.av())

## [1] 0.1089382

or, with rlply():

# Create a function for sd of sample means
mean.av <- function(n = 100, reps = 25) {
  rlply(.n = reps, rnorm(n, mean = 0, sd = 1)) %>%
    sapply(mean)
}

sd(mean.av())

## [1] 0.08850599

Add the option to this function to print a density plot. Set it to TRUE by default.

# Create a function for sd of sample means
mean.av <- function(n = 100, reps = 25, plotDens = TRUE) {
    
  # Make an empty vector. 
  av <- numeric(reps)
  
  for(i in 1:reps) {
      
    # Make a random standard normal dataset of size n.
    y     <- rnorm(n)
    
    # Compute and save the mean. 
    av[i] <- mean(y)
  }
  
  if (plotDens) {
      plot(density(av), main = "Sampling distribution of the mean.")
  }
  
  # Return the vector of means. 
  av
}

The above function does two things:

It returns a vector of 25 means by default (the default argument for reps is reps=25)
It plots a densityplot of those 25 means, if wanted. By default it does this because the argument for plotting is plotDens=TRUE.

Let’s go through the function.

av <- numeric(reps). This line creates a vector with zeros that is long as the number of reps.
for(i in 1:reps) {. This line starts the for loop and dictates that we are goint to repeat the code within the loop 1 through 25 - so 25 times in total.
y <- rnorm(n). This samples n values from the standard normal distribution. Remember that the default argument for n is n=100. So, by default it draws 100 values. These values are stored in object y, such that we can use it later on in the function.
av[i] <- mean(y). Now we calculate the mean on the current (ith) sample and store that in the vector av as the ith element. Over all the for-loops we will replace each element in av with the mean of the respective simulated sample.
if (plotDens) { evaluates whether to execute the code within the { and } that define the if-statement. This code is designed to print a plot. The default for plotDens is plotDens=TRUE, so by default the if-statement will print the plot. if looks for TRUE or FALSE and only executes its code when it finds TRUE. Because plotDens is TRUE, we do not have to say if (plotDens = TRUE) and if(plotDens) suffices.
plot(density(av), main = "Sampling distribution of the mean.") plots a densityplot with main title ‘Sampling distribution of the mean.’.

or, with rlply() and a pipe:

# Create a function for sd of sample means
mean.av <- function(n = 100, reps = 25, plotDens = TRUE) {
  av <- 
    rlply(.n = reps, rnorm(n, mean = 0, sd = 1)) %>%
    sapply(mean) 
  if (plotDens) {
    density(av) %>%
      plot(main = "Sampling distribution of the mean.")
  }
  return(av)
}

sd(mean.av())

## [1] 0.09456325

In the next codeblock, we first set the graphical parameters to display a 2 by 2 matrix of plots. Then we run the function mean.av() 4 times; each time generating a plot. After running the function four times, we will have filled the 2 by 2 plot raster. We do not have to specify the arguments of this function because all arguments are set by default. We end by resetting the graphical parameters to its original state by stating par(mfrow = c(1, 1)), such that when we plot a new graph, it will be displayed as a single graph and not in a raster of 4 plots.

par(mfrow = c(2, 2))
mean.av()

##  [1]  0.068470639 -0.077818879 -0.091692048 -0.130617879  0.251717283
##  [6]  0.054858574 -0.036049450  0.104475079 -0.089264959  0.080493654
## [11] -0.024325038 -0.097794614  0.141381646  0.009766486  0.153739037
## [16] -0.105587448 -0.001736498  0.097002198  0.045997872 -0.012079486
## [21] -0.101956204 -0.061177495  0.063383078  0.065626215 -0.069429600

mean.av()

##  [1]  0.1953824109  0.1348506175 -0.0094928504  0.0835959877 -0.0173680343
##  [6] -0.1508615650 -0.0900577592  0.0001168825 -0.0216254082 -0.0420224319
## [11]  0.0207623279 -0.1178076124 -0.1084565496 -0.0379984450  0.0545270409
## [16] -0.0328482522 -0.1625490248 -0.0327697543 -0.0922515961 -0.0779403956
## [21]  0.0294707701  0.0130250264 -0.1876611681  0.0288191266  0.0300587766

mean.av()

##  [1]  0.018066207  0.071066840 -0.115361401  0.061331876 -0.025427969
##  [6] -0.012207763 -0.067216129  0.009949625 -0.093228489 -0.030386433
## [11] -0.092443682 -0.079677253  0.173949802 -0.065719830  0.140577349
## [16] -0.049671329 -0.122333707 -0.048353074  0.052410305 -0.117598169
## [21] -0.095278072  0.008703199 -0.071540490 -0.166771781  0.029540325

mean.av()

##  [1] -0.189430327 -0.066863184  0.226732126  0.154103786 -0.265004735
##  [6]  0.009260854  0.149866929 -0.169644176 -0.047387711  0.105561401
## [11] -0.064871635 -0.187243442  0.192679309  0.037218065  0.093439232
## [16]  0.009343280  0.051302357 -0.068104398 -0.176032151 -0.042986921
## [21] -0.012468267 -0.151543761  0.137205895 -0.031933394  0.096857442

and we return the graphing parameter to its previous state of a single plot (i.e. 1 row and 1 column).

par(mfrow = c(1, 1))

Generate a random sample of size 20 from a normal population with mean 100 and standard deviation 10.

rnorm(20, 100, 10)

##  [1] 118.08541 104.56931 119.38520  88.74277  87.08169 123.24315  95.05002
##  [8]  97.70985 102.79113 100.98217 102.30570 121.45238  94.62936  91.89949
## [15]  92.73861 111.96596 120.77000 104.17306  86.41646  90.04677

Use mfrow to set up the layout for a 3 by 4 array of plots. In the top 4 panels, show normal probability plots (‘QQ-plots’) for 4 separate “random” samples of size 10, all drawn from a normal distribution. In the middle 4 panels, display plots for samples of size 100. In the bottom 4 panels, display plots for samples of size 1000. Comment on how the appearance of the plots changes as the sample size changes.

par(mfrow=c(3, 4))

for (i in 1:3) {
  for (j in 1:4) {
    qqnorm(rnorm(10^i))
  }
}

Repeat exercise 8, but use runif instead of rnorm.

par(mfrow=c(3, 4))

for (i in 1:3) {
  for (j in 1:4) {
    qqnorm(runif(10^i))
  }
}

par(mfrow=c(1, 1))

Use the function rexp() to simulate 100 exponential random numbers with rate 0.2. Do the following on the simulated random numbers

Obtain a density plot for the observations.
Find the sample mean of the observations.
Compare with the population mean (the mean for an exponential population is 1/rate).

# Simulate numbers 
data <- rexp(100, .2) 

# Plot the data 
data %>%
  density(from = 0) %>%
  plot(main="Exponential with rate = 0.2")

# Comparison.
c("sample.mean" = mean(data), "pop.mean" = 1 / .2)

## sample.mean    pop.mean 
##    5.239311    5.000000

Fit the following linear models on the anscombe data:

y1 predicted by x1 - stored in object fit1
y2 predicted by x2 - stored in object fit2
y3 predicted by x3 - stored in object fit3
y4 predicted by x4 - stored in object fit4

fit1 <- anscombe %$%
  lm(y1 ~ x1)
fit2 <- anscombe %$%
  lm(y2 ~ x2)
fit3 <- anscombe %$%
  lm(y3 ~ x3)
fit4 <- anscombe %$%
  lm(y4 ~ x4)

`Create a data frame from the coefficients of the 4 fitted objects from Exercise 11

out <- data.frame(fit1 = coef(fit1),
                  fit2 = coef(fit2),
                  fit3 = coef(fit3),
                  fit4 = coef(fit4))
row.names(out) <- names(coef(fit1))
out

##                  fit1     fit2      fit3      fit4
## (Intercept) 3.0000909 3.000909 3.0024545 3.0017273
## x1          0.5000909 0.500000 0.4997273 0.4999091

These estimates are very similar.

Plot the four fitted models from Exercise 11 in a single plotting window. Make the points in the plots blue, gray, orange and purple, respectively.

plot(y1 ~ x1, col = "blue", data = anscombe)
points(y2 ~ x2, col = "gray", data = anscombe)
points(y3 ~ x3, col = "orange", data = anscombe)
points(y4 ~ x4, col = "purple", data = anscombe)

Now plot all four fitted models from Exercise 11 in a plotting window with 2 rows and 2 columns.

par(mfrow = c(2, 2))
plot(y1 ~ x1, data = anscombe)
plot(y2 ~ x2, data = anscombe)
plot(y3 ~ x3, data = anscombe)
plot(y4 ~ x4, data = anscombe)

End of practical.

Practical G

Gerko Vink

Statistical Programming in R

Exercises