Statistical Programming with R

Packages and functions that we use

library(mvtnorm)  # Multivariate Normal tools
  • table(): Build a contingency table out of counts
  • hist(): Draw a histogram
  • mvtnorm::rmvnorm(): Drawing values from a multivariate normal distribution
  • sample(): Probabilistic sampling with or without replacement
  • rchisq(): Probabilistic sampling from a \(X2\) distribution
  • rpois(): Probabilistic sampling from a poisson distribution
  • set.seed(123): Fixing the Random Number Generator seed

Creating our own data

We saw that we could create vectors

c(1, 2, 3, 4, 3, 2, 1)
## [1] 1 2 3 4 3 2 1
c(1:4, 3:1)
## [1] 1 2 3 4 3 2 1

We could create matrices

matrix(1:15, nrow = 3, ncol = 5, byrow = TRUE)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    2    3    4    5
## [2,]    6    7    8    9   10
## [3,]   11   12   13   14   15

Creating our own data

Vectors from matrices

mat <- matrix(1:18, 3, 6)
mat
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    4    7   10   13   16
## [2,]    2    5    8   11   14   17
## [3,]    3    6    9   12   15   18
c(mat)
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18

But we can also draw a sample

Of the same size,

sample(mat)
##  [1] 12  1  7  4 18  5 14 15  9  6  8 13 10 11 17  3 16  2

smaller size,

sample(mat, size = 5)
## [1]  7  3 15 12  4

or larger size

sample(mat, size = 50, replace = TRUE)
##  [1] 17  9  1 13 13  5  5  4  1 16  5  7 12 14 12 10  5  8 10  4 17 16  9
## [24]  4  8 16 18 11  8  1  4  2  6 14 15 17  4  7  4  7 16  6 17  4 12  4
## [47]  3 11 18  6

We can do random sampling

hist(sample(mat, size = 50000, replace=TRUE), breaks = 0:18)

Or nonrandom sampling

probs <- c(rep(.01, 15), .1, .25, .50)
probs
##  [1] 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
## [15] 0.01 0.10 0.25 0.50
hist(sample(mat, size = 50000, prob = probs, replace=TRUE), breaks = 0:18)

We can replicate individual samples

set.seed(123)
sample(mat, size = 50, replace = TRUE)
##  [1]  6 15  8 16 17  1 10 17 10  9 18  9 13 11  2 17  5  1  6 18 17 13 12
## [24] 18 12 13 10 11  6  3 18 17 13 15  1  9 14  4  6  5  3  8  8  7  3  3
## [47]  5  9  5 16
set.seed(123)
sample(mat, size = 50, replace = TRUE)
##  [1]  6 15  8 16 17  1 10 17 10  9 18  9 13 11  2 17  5  1  6 18 17 13 12
## [24] 18 12 13 10 11  6  3 18 17 13 15  1  9 14  4  6  5  3  8  8  7  3  3
## [47]  5  9  5 16

We can replicate a chain of samples

set.seed(123)
sample(mat, size = 5, replace = TRUE)
## [1]  6 15  8 16 17
sample(mat, size = 7, replace = TRUE)
## [1]  1 10 17 10  9 18  9
set.seed(123)
sample(mat, size = 5, replace = TRUE)
## [1]  6 15  8 16 17
sample(mat, size = 7, replace = TRUE)
## [1]  1 10 17 10  9 18  9

The random seed

The random seed is a number used to initialize the pseudo-random number generator

If replication is needed, pseudorandom number generators must be used

  • Pseudorandom number generators generate a sequence of numbers
  • The properties of generated number sequences approximates the properties of random number sequences
  • Pseudorandom number generators are not truly random, because the process is determined by an initial value.

The initial value (the seed) itself does not need to be random.

  • The resulting process is random because the seed value is not used to generate randomness
  • It merely forms the starting point of the algorithm for which the results are random.

Why fix the random seed

When an R instance is started, there is initially no seed. In that case, R will create one from the current time and process ID.

  • Hence, different sessions will give different results when random numbers are involved.
  • When you store a workspace and reopen it, you will continue from the seed specified within the workspace.

If we fix the random seed we can exactly replicate the random process

If the method has not changed: the results of the process will be identical when using the same seed.

  • Replications allows for verification
  • But beware: the process depends on the seed
    • The results obtained could theoretically be extremely rare and would not have occured with every other potential seed
    • Run another seed before publishing your results

Random seeds

HTML5 Icon

Random processes

HTML5 Icon

Drawing data

We can draw data from a standard normal distribution

hist(rnorm(1000, mean = 5, sd = 1))

Drawing data

We can draw data from a specific normal distribution

hist(rnorm(1000, mean = 50, sd = 15))

Drawing data: many values

hist(rnorm(100000, mean = 50, sd = 15))

Drawing data: other distribution

\(X^2\)-distribution with \(df = 1\) degrees of freedom

hist(rchisq(500, df = 1))

Drawing data: other distribution

\(X^2\)-distribution with \(df = 100\) degrees of freedom

hist(rchisq(500, df = 100))

Drawing data: other distribution

Poisson distribution with mean \(\lambda = 1\)

hist(rpois(5000, lambda = 1))

Drawing data: other distribution

Poisson distribution with mean \(\lambda = 10\)

hist(rpois(5000, lambda = 100))

Drawing data: other distribution

Binomial distribution: 1000 draws of 1 trial with \(P(\text{success}=.75)\)

run <- rbinom(1000, size = 1, prob = .75) 
table(run)
## run
##   0   1 
## 248 752

Drawing data: other distribution

Binomial distribution: 10.000 draws of 60 trials with \(P(\text{success}=.75)\)

run <- rbinom(10000, size = 60, prob = .75)
table(run)
## run
##   32   33   34   35   36   37   38   39   40   41   42   43   44   45   46 
##    1    7   10   21   29   67  137  257  395  614  783  980 1098 1197 1137 
##   47   48   49   50   51   52   53   54   55   56 
## 1053  835  571  376  200  136   62   25    7    2

Drawing multivariate distributions

sigma <- matrix(c(30, 15, 15, 30), 2, 2)
sigma
##      [,1] [,2]
## [1,]   30   15
## [2,]   15   30
means <- c(97, 63)
data <- rmvnorm(n = 1000, mean = means, sigma = sigma)
colMeans(data)
## [1] 96.77040 62.85105
var(data)
##          [,1]     [,2]
## [1,] 31.44549 15.04467
## [2,] 15.04467 27.47605

Bivariate normal

plot(data)

Drawing multivariate distributions

sigma <- matrix(c(1, 0, 0, 1), 2, 2)
sigma
##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1
means <- c(0, 0)
data <- rmvnorm(n = 1000, mean = means, sigma = sigma)
colMeans(data)
## [1]  0.009235345 -0.002294830
var(data)
##             [,1]        [,2]
## [1,] 1.076408982 0.006907679
## [2,] 0.006907679 0.865091698

Bivariate normal

plot(data)

Histograms

par(mfrow = c(1, 2))
hist(data[, 1])
hist(data[, 2])

To conclude

Practical