Practical D - Data generation

Drawing random values

Draw 10 values from a standard normal distribution and store the values in an object called draw1.

draw1 <- rnorm(n = 10)

The call rnorm(n = 10) results in a draw of 10 values from a standard normal distribution, a distribution with mean zero and variance equal to 1. Drawing from a standard normal distribution is the default in rnorm because the default function arguments for rnorm() specify:

args(rnorm)

## function (n, mean = 0, sd = 1) 
## NULL

We can see that by default, values a drawn with mean = 0 and sd = 1. Unless we specify other values for the mean and standard deviation, the defaults will be used. Defaults are a useful property in a developers toolset. We will discuss defaults in more detail later today.

Verify that the means and variances for object draw1 are indeed conform the default arguments of function rnorm()

mean(draw1)

## [1] 0.07462564

sd(draw1)

## [1] 0.9537841

The means and variances are close, but not equal. However, we’ve only drawn 10 values from a theoretical distribution. It is statistically not reasonable to expect that a single draw of 10 values will result in an unbiased estimate of the (infinitely large) population parameters.

Draw 1000 values from a standard normal distribution and store the values in an object called draw2.

draw2 <- rnorm(1000)

Note that I do not have to specify n = 1000: by default, the ordering of the arguments is used to evaluate a function call. So rnorm(40, 20, 5) would generate a draw of 40 values from a normal distribution with mean equal to 20 and a standard deviation of 5.

Determine which object has less bias; draw1 or draw2

means <- c(0, mean(draw1), mean(draw2))
sds <- c(1, sd(draw1), sd(draw2))
n <- c(Inf, 10, 1000)
result <- data.frame(n=n, mean=means, sd=sds)
row.names(result) <- c("population", "draw1", "draw2")
result

##               n       mean        sd
## population  Inf 0.00000000 1.0000000
## draw1        10 0.07462564 0.9537841
## draw2      1000 0.01459110 0.9957478

We have now created a data.frame with the results. Perhaps this dataframe becomes easier to read with some rounding:

round(result, 3)

##               n  mean    sd
## population  Inf 0.000 1.000
## draw1        10 0.075 0.954
## draw2      1000 0.015 0.996

It is clear that the larger sample size will, in general, yield less bias. However, the bias we obtain is directly dependent on the chosen random seed. For example, if we repeat the above process with set.seed(125), we obtain completely different results:

set.seed(125)
draw1b <- rnorm(10)
draw2b <- rnorm(1000)
means <- c(0, mean(draw1b), mean(draw2b))
sds <- c(1, sd(draw1b), sd(draw2b))
result <- data.frame(n=n, mean=means, sd=sds)
row.names(result) <- c("population", "draw1 (seed=125)", "draw2 (seed=125)")
round(result, 3)

##                     n   mean    sd
## population        Inf  0.000 1.000
## draw1 (seed=125)   10  0.053 1.042
## draw2 (seed=125) 1000 -0.058 1.009

Now draw2 has a larger absolute bias from the mean, despite its larger sample size. Note that I renamed the objects to draw1b and draw2b because otherwise I would overwrite draw1 and draw2. Overwriting these objects would be inefficient, because we will use these objects later on in this exercise. If, by accident we would have overwritten these objects, we can simply obtain them again by re-fixing the random seed and parsing the exact same function calls. Also note that I do not recreate object n as it has not changed. I can simply re-use the object n that is already in our Global Environment.

The lesson so far is: When you fix your random seed, do not forget that your results are dependent on this random seed. It may be the case that results you obtain are a mere fluke: extremely rare, but obtained because you use a random seed. It would be nice to automate any random number process with different seeds to obtain the sampling distribution of estimates. This we will do on Friday when we consider Monte Carlo simulation.

In the previous exercise we have changed the random seed to 125.

Re-fix the random seed to 123

set.seed(123)

Draw objects draw1 and draw2 again, but name them draw3 and draw4, respectively.

draw3 <- rnorm(10)
draw4 <- rnorm(1000)

Draw object draw5: 22 values from a normal distribution with mean = 8 and variance = 144.

draw5 <- rnorm(22, 8, sqrt(144)) #square root of 144

Function rnorm() takes the standard deviation \(\sigma\), not the variance \(\sigma^2\). Hence we take the square root of the variance to obtain the standard deviation cf.

\[\sigma=\sqrt{\sigma^2} = \sqrt{144} = 12\]

Verify if object draw1 is equal to draw3 and if object draw2 is equal to draw4.

all.equal(draw1, draw3)

## [1] TRUE

all.equal(draw2, draw4)

## [1] TRUE

The objects are indeed equal because they originated in an identical order since the same random seed was specified. In other words, draw1 and draw3 were generated in random step 1 since set.seed(123) and draw2 and draw4 were both generated in step 2 since set.seed(123). If we would have reversed the order in which these objects were generated, then the objects would not be equal. For example:

set.seed(123)
draw4 <- rnorm(1000)
draw3 <- rnorm(10)
all.equal(draw1, draw3)

## [1] "Mean relative difference: 1.634686"

all.equal(draw2, draw4)

## [1] "Mean relative difference: 1.409047"

Because the objects are not equal, mean differences are automatically printed by function all.equal().

Fun fact: the first 10 cases in draw1 and draw4 are equal:

cbind(draw1[1:10], draw4[1:10])

##              [,1]        [,2]
##  [1,] -0.56047565 -0.56047565
##  [2,] -0.23017749 -0.23017749
##  [3,]  1.55870831  1.55870831
##  [4,]  0.07050839  0.07050839
##  [5,]  0.12928774  0.12928774
##  [6,]  1.71506499  1.71506499
##  [7,]  0.46091621  0.46091621
##  [8,] -1.26506123 -1.26506123
##  [9,] -0.68685285 -0.68685285
## [10,] -0.44566197 -0.44566197

and if we recreate draw5, the results are also equal.

draw5b <- rnorm(22, 8, sqrt(144))
all.equal(draw5, draw5b)

## [1] TRUE

This is because these values and objects follow the same steps since set.seed(123). The first 10 cases in draw1 and draw4 both stem from the first step since set.seed(123) and generating draw5 and draw5b has been the third random step since set.seed(123) in both random chains.

Draw 1000 values from the following distributions and inspect them with summary()

a normal distribution with mean = 50 and sd = 20,
a t-distribution with df = 11 degrees of freedom,
a uniform distribution with minimum min = 5 and maximum max = 6,
a binomial distribution with size = 1 trials and prob = .8 probablity if success on each trial,
an F-distribution with df1 = 1 and df2 = 2 degrees of freedom,
a poisson distribution with lambda = 5 as the mean.

First,

draw.norm <- rnorm(1000, mean = 50, sd = 20)
draw.t <- rt(1000, df = 11)
draw.unif <- runif(1000, min = 5, max = 6)
draw.binom <- rbinom(1000, size = 1, prob = .8)
draw.F <- rf(1000, df1 = 1, df2 = 2)
draw.pois <- rpois(1000, lambda = 5)

It may ease interpretation to create a data.frame from these results and call summary on the data frame.

data <- data.frame(normal = draw.norm, 
                   t = draw.t, 
                   uniform = draw.unif, 
                   binomial = draw.binom,
                   Fdistr = draw.F,
                   poisson = draw.pois)
summary(data)

##      normal             t                uniform         binomial    
##  Min.   :-10.96   Min.   :-4.206696   Min.   :5.000   Min.   :0.000  
##  1st Qu.: 36.94   1st Qu.:-0.697476   1st Qu.:5.247   1st Qu.:1.000  
##  Median : 50.73   Median : 0.006237   Median :5.490   Median :1.000  
##  Mean   : 50.69   Mean   :-0.041201   Mean   :5.493   Mean   :0.797  
##  3rd Qu.: 64.83   3rd Qu.: 0.645770   3rd Qu.:5.726   3rd Qu.:1.000  
##  Max.   :117.81   Max.   : 3.539230   Max.   :6.000   Max.   :1.000  
##      Fdistr            poisson      
##  Min.   :  0.0000   Min.   : 0.000  
##  1st Qu.:  0.1274   1st Qu.: 3.000  
##  Median :  0.6643   Median : 5.000  
##  Mean   :  6.0563   Mean   : 4.973  
##  3rd Qu.:  2.6684   3rd Qu.: 6.000  
##  Max.   :937.0277   Max.   :14.000

Plot histograms for the generated values from exercise 10 to inspect the shape of the respective distributions. Use breaks = 25 for the number of breakpoints in the histogram

hist(draw.norm, breaks = 25)

hist(draw.t, breaks = 25)

hist(draw.unif, breaks = 25)

hist(draw.binom, breaks = 25)

hist(draw.F, breaks = 25)

hist(draw.pois, breaks = 25)

Alternatively, we can automate this procedure with a simple single line call:

invisible(apply(data, 2, function(x) hist(x, breaks = 25)))

I use the invisible() function to omit the printing of the histogram data and only printing the histogram itself.

We will learn all about apply() in the next meeting for today.

We can use random number generation to simulate data. For example, we can use sample() to simulate rolling a single die 100 times:

die.roll <- sample(1:6, size = 100, replace = TRUE)

We have to set replace = TRUE to sample with replacement because the sample size size = 100 is larger than the set 1:6 =1, 2, 3, 4, 5, 6.

Calculate the probability for obtaining each side of the die in die.roll.

die.roll.tbl <- table(die.roll)
prop.table(die.roll.tbl)

## die.roll
##    1    2    3    4    5    6 
## 0.20 0.23 0.14 0.15 0.14 0.14

We see that in this try, the values 1 and 2 have a much higher probability than the other values. There results are obtained by first creating a frequency table out of die.roll:

table(die.roll)

## die.roll
##  1  2  3  4  5  6 
## 20 23 14 15 14 14

and then calculating the probability of the frequencies, given the total number of rolls

prop.table(table(die.roll))

## die.roll
##    1    2    3    4    5    6 
## 0.20 0.23 0.14 0.15 0.14 0.14

Simulate two rolling dice and calculate the probability of obtaining two sixes over 1500 trials

possible <- rep(1:6, rep(6, 6)) + rep(1:6, 6)
possible

##  [1]  2  3  4  5  6  7  3  4  5  6  7  8  4  5  6  7  8  9  5  6  7  8  9
## [24] 10  6  7  8  9 10 11  7  8  9 10 11 12

dice.roll <- sample(possible, size = 1500, replace = TRUE)
dice.roll.tbl <-  table(dice.roll) 
prop.table(dice.roll.tbl)

## dice.roll
##          2          3          4          5          6          7 
## 0.02733333 0.05333333 0.09200000 0.11733333 0.13600000 0.16466667 
##          8          9         10         11         12 
## 0.14666667 0.11666667 0.06733333 0.05133333 0.02733333

The probability equals 0.0273333, which can simply be obtained by calculating the mean over the logical evaluation (i.e. number of TRUE’s given the total sample size 1500)

mean(dice.roll == 12)

## [1] 0.02733333

Replicate the following empirical example by drawing from a theoretical distribution: 14563 students made an exam, but only 3589 successfully passed:

One way to do this is with rbinom()

prob.pass <- 3589/14563
passfail <- rbinom(14563, size = 1, prob = prob.pass)
table(passfail)

## passfail
##     0     1 
## 10988  3575

However, we can also obtain an approximation with sample()

passfail2 <- sample(c("pass", "fail"), 
                    14563, 
                    prob = c(prob.pass, 1-prob.pass), 
                    replace = TRUE)
table(passfail2)

## passfail2
##  fail  pass 
## 11028  3535

So far, we have only considered univariate distributions. Let’s generate some multivariate data

Draw two variables (aka sets or features) of size 1000 from a multivariate normal distribution, both with mean 5 and variance 1 and make sure that they correlate \(\rho = .8\).

There are two ways to go about this.

Approach 1: generate two univariate normal distributions and make them correlate:

rho = .8
var1 <- rnorm(1000, 5, sqrt(1)) #first variable
var2 <- scale(var1) * rho + rnorm(1000, 0, sqrt(1 - rho^2)) #second variable step 1
var2 <- var2 * sd(var1) + mean(var1) #second variable step 2
data <- data.frame(var1 = var1, var2 = var2) #combined data frame
colMeans(data) #means

##     var1     var2 
## 4.991663 4.950090

var(data) #variance/covariance matrix

##           var1      var2
## var1 0.9983921 0.7873003
## var2 0.7873003 0.9816118

cor(var1, var2) #correlations

##           [,1]
## [1,] 0.7952798

We see that the means are indeed approximately 5, the variances approximate 1 and the correlation is approximately .8. In the case where the variance equals 1, the correlation is equal to the covariance. For example, if the variance were exactly 1, the correlation would be:

\[\rho_{X,Y} = \frac{\text{COV(X,Y)}}{\sigma_X\sigma_Y} = \frac{.7873003}{1\times1} = .7873003.\]

In our case, the correlation equals:

\[\rho_{X,Y} = \frac{.7873003}{\sqrt{.9983921}\times\sqrt{.9816118}} = .7952798 \approx .80.\]

Approach 2: generate multivariate data from the multivariate normal distribution:

#you need function rmvnorm from package mvtnorm
sigma <- matrix(c(1, .8, .8, 1), nrow = 2, ncol = 2) #variance/covariance matrix
data <- mvtnorm::rmvnorm(n = 1000, sigma = sigma, mean = c(5, 5))
colMeans(data)

## [1] 5.024864 4.996324

var(data)

##           [,1]      [,2]
## [1,] 0.9616942 0.7817377
## [2,] 0.7817377 0.9881583

Approach two yields approximately the same result as approach 1. Note that both approaches are equally valid: differences between them are only due to random variation. If we would simulate the processes repeatedly an infinite (\(\infty\)) number of times, the average parameters over these repeated simulations will converge to the population parameters.

End of Practical

Practical D - Data generation

Gerko Vink

Statistical Programming in R

Exercises

Drawing random values