We use the following packages in this Practical:
library(MASS)
library(plyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(magrittr)
library(ggplot2)
If you need package plyr
and package dplyr
, always load package plyr
first! If you load dplyr
before plyr
, plyr
will produce a warning.
In this exercise we will again be using random number generators. When using random numbers, it is wise to always fix the random seed (the starting point of the number generation). This ensures that in the future we can exactly reproduce the chain of executions that led to the randomly generated results.
Start by setting a random seed. If you follow my random seed and reproduce the code below in exactly the same order, you will get the same results. If you do not follow the exact ordering (i.e. if you skip or rerun a question, or have different code), your results may be different due to random sampling. This is not a bad thing! It is just a result of the method and should be that way.
#set random seed, make things reproducible
set.seed(123)
# Make an empty matrix.
design <- matrix(NA, 7, 4)
# Name the columns and rows.
colnames(design) <- paste("condition", 1:4, sep=" ")
rownames(design) <- paste("manipulation", 1:7, sep=" ")
# Put a random ordering in each column.
for(j in 1:ncol(design)) {
design[, j] <- sample(1:7, replace=FALSE)
}
design
## condition 1 condition 2 condition 3 condition 4
## manipulation 1 3 7 1 5
## manipulation 2 5 4 6 4
## manipulation 3 7 3 2 7
## manipulation 4 4 6 7 3
## manipulation 5 6 2 4 6
## manipulation 6 1 5 5 2
## manipulation 7 2 1 3 1
# Get random numbers.
y <- rnorm(100)
mean(y)
## [1] 0.005996774
sd(y)
## [1] 0.9135814
av
. Compute the standard deviation of av
.# Repeat the sampling of numbers 25 times, each time getting the mean.
av <- numeric(25)
for(i in 1:25) {
av[i] <- mean(rnorm(100))
}
# Standard error of sample mean.
sd(av)
## [1] 0.07753845
or alternatively with function rlply()
from package plyr
:
samples <- rlply(.n = 25, rnorm(100, mean = 0, sd = 1))
Function plyr::rlply()
evaluates an expression \(n\) times and combines the results in a list. So the call rnorm(5000, mean = 0, sd = 1)
is evaluated .n = 25
times.
We can then call the following sapply()
statement to create object av
:
av <- sapply(samples, mean)
sd(av) # Standard error of sample mean.
## [1] 0.09250157
Function sapply()
evalutes the expression mean()
over each listed element in samples
. The result is a vector of length 100
with the means of the 100 samples from the samples
object.
av
.# Create a function for sd of sample means
mean.av <- function(n = 100, reps = 25) {
# Make an empty vector.
av <- numeric(reps)
for(i in 1:reps) {
# Make a random standard normal dataset of size n.
y <- rnorm(n)
# Compute and save the mean.
av[i] <- mean(y)
}
# Return the vector of means.
av
}
sd(mean.av())
## [1] 0.1089382
or, with rlply()
:
# Create a function for sd of sample means
mean.av <- function(n = 100, reps = 25) {
rlply(.n = reps, rnorm(n, mean = 0, sd = 1)) %>%
sapply(mean)
}
sd(mean.av())
## [1] 0.08850599
# Create a function for sd of sample means
mean.av <- function(n = 100, reps = 25, plotDens = TRUE) {
# Make an empty vector.
av <- numeric(reps)
for(i in 1:reps) {
# Make a random standard normal dataset of size n.
y <- rnorm(n)
# Compute and save the mean.
av[i] <- mean(y)
}
if (plotDens) {
plot(density(av), main = "Sampling distribution of the mean.")
}
# Return the vector of means.
av
}
The above function does two things:
reps=25
)plotDens=TRUE
.Let’s go through the function.
av <- numeric(reps)
. This line creates a vector with zeros that is long as the number of reps
.for(i in 1:reps) {
. This line starts the for loop and dictates that we are goint to repeat the code within the loop 1 through 25 - so 25 times in total.y <- rnorm(n)
. This samples n
values from the standard normal distribution. Remember that the default argument for n
is n=100
. So, by default it draws 100
values. These values are stored in object y
, such that we can use it later on in the function.av[i] <- mean(y)
. Now we calculate the mean on the current (ith) sample and store that in the vector av
as the ith element. Over all the for-loops we will replace each element in av
with the mean of the respective simulated sample.if (plotDens) {
evaluates whether to execute the code within the {
and }
that define the if-statement. This code is designed to print a plot. The default for plotDens
is plotDens=TRUE
, so by default the if-statement will print the plot. if
looks for TRUE
or FALSE
and only executes its code when it finds TRUE
. Because plotDens
is TRUE
, we do not have to say if (plotDens = TRUE)
and if(plotDens)
suffices.plot(density(av), main = "Sampling distribution of the mean.")
plots a densityplot with main title ‘Sampling distribution of the mean.’.or, with rlply()
and a pipe:
# Create a function for sd of sample means
mean.av <- function(n = 100, reps = 25, plotDens = TRUE) {
av <-
rlply(.n = reps, rnorm(n, mean = 0, sd = 1)) %>%
sapply(mean)
if (plotDens) {
density(av) %>%
plot(main = "Sampling distribution of the mean.")
}
return(av)
}
sd(mean.av())
## [1] 0.09456325
In the next codeblock, we first set the graphical parameters to display a 2 by 2 matrix of plots. Then we run the function mean.av()
4 times; each time generating a plot. After running the function four times, we will have filled the 2 by 2 plot raster. We do not have to specify the arguments of this function because all arguments are set by default. We end by resetting the graphical parameters to its original state by stating par(mfrow = c(1, 1))
, such that when we plot a new graph, it will be displayed as a single graph and not in a raster of 4 plots.
par(mfrow = c(2, 2))
mean.av()
## [1] 0.068470639 -0.077818879 -0.091692048 -0.130617879 0.251717283
## [6] 0.054858574 -0.036049450 0.104475079 -0.089264959 0.080493654
## [11] -0.024325038 -0.097794614 0.141381646 0.009766486 0.153739037
## [16] -0.105587448 -0.001736498 0.097002198 0.045997872 -0.012079486
## [21] -0.101956204 -0.061177495 0.063383078 0.065626215 -0.069429600
mean.av()
## [1] 0.1953824109 0.1348506175 -0.0094928504 0.0835959877 -0.0173680343
## [6] -0.1508615650 -0.0900577592 0.0001168825 -0.0216254082 -0.0420224319
## [11] 0.0207623279 -0.1178076124 -0.1084565496 -0.0379984450 0.0545270409
## [16] -0.0328482522 -0.1625490248 -0.0327697543 -0.0922515961 -0.0779403956
## [21] 0.0294707701 0.0130250264 -0.1876611681 0.0288191266 0.0300587766
mean.av()
## [1] 0.018066207 0.071066840 -0.115361401 0.061331876 -0.025427969
## [6] -0.012207763 -0.067216129 0.009949625 -0.093228489 -0.030386433
## [11] -0.092443682 -0.079677253 0.173949802 -0.065719830 0.140577349
## [16] -0.049671329 -0.122333707 -0.048353074 0.052410305 -0.117598169
## [21] -0.095278072 0.008703199 -0.071540490 -0.166771781 0.029540325
mean.av()
## [1] -0.189430327 -0.066863184 0.226732126 0.154103786 -0.265004735
## [6] 0.009260854 0.149866929 -0.169644176 -0.047387711 0.105561401
## [11] -0.064871635 -0.187243442 0.192679309 0.037218065 0.093439232
## [16] 0.009343280 0.051302357 -0.068104398 -0.176032151 -0.042986921
## [21] -0.012468267 -0.151543761 0.137205895 -0.031933394 0.096857442
and we return the graphing parameter to its previous state of a single plot (i.e. 1 row and 1 column).
par(mfrow = c(1, 1))
rnorm(20, 100, 10)
## [1] 118.08541 104.56931 119.38520 88.74277 87.08169 123.24315 95.05002
## [8] 97.70985 102.79113 100.98217 102.30570 121.45238 94.62936 91.89949
## [15] 92.73861 111.96596 120.77000 104.17306 86.41646 90.04677
mfrow
to set up the layout for a 3 by 4 array of plots. In the top 4 panels, show normal probability plots (‘QQ-plots’) for 4 separate “random” samples of size 10, all drawn from a normal distribution. In the middle 4 panels, display plots for samples of size 100. In the bottom 4 panels, display plots for samples of size 1000. Comment on how the appearance of the plots changes as the sample size changes.par(mfrow=c(3, 4))
for (i in 1:3) {
for (j in 1:4) {
qqnorm(rnorm(10^i))
}
}
runif
instead of rnorm
.par(mfrow=c(3, 4))
for (i in 1:3) {
for (j in 1:4) {
qqnorm(runif(10^i))
}
}
par(mfrow=c(1, 1))
rexp()
to simulate 100 exponential random numbers with rate 0.2. Do the following on the simulated random numbers # Simulate numbers
data <- rexp(100, .2)
# Plot the data
data %>%
density(from = 0) %>%
plot(main="Exponential with rate = 0.2")
# Comparison.
c("sample.mean" = mean(data), "pop.mean" = 1 / .2)
## sample.mean pop.mean
## 5.239311 5.000000
y1
predicted by x1
- stored in object fit1
y2
predicted by x2
- stored in object fit2
y3
predicted by x3
- stored in object fit3
y4
predicted by x4
- stored in object fit4
fit1 <- anscombe %$%
lm(y1 ~ x1)
fit2 <- anscombe %$%
lm(y2 ~ x2)
fit3 <- anscombe %$%
lm(y3 ~ x3)
fit4 <- anscombe %$%
lm(y4 ~ x4)
out <- data.frame(fit1 = coef(fit1),
fit2 = coef(fit2),
fit3 = coef(fit3),
fit4 = coef(fit4))
row.names(out) <- names(coef(fit1))
out
## fit1 fit2 fit3 fit4
## (Intercept) 3.0000909 3.000909 3.0024545 3.0017273
## x1 0.5000909 0.500000 0.4997273 0.4999091
These estimates are very similar.
blue
, gray
, orange
and purple
, respectively. plot(y1 ~ x1, col = "blue", data = anscombe)
points(y2 ~ x2, col = "gray", data = anscombe)
points(y3 ~ x3, col = "orange", data = anscombe)
points(y4 ~ x4, col = "purple", data = anscombe)
par(mfrow = c(2, 2))
plot(y1 ~ x1, data = anscombe)
plot(y2 ~ x2, data = anscombe)
plot(y3 ~ x3, data = anscombe)
plot(y4 ~ x4, data = anscombe)
End of practical.