Generating Data

Statistical Programming with R

Packages and functions that we use

library(mvtnorm)  # Multivariate Normal tools

table(): Build a contingency table out of counts
hist(): Draw a histogram
mvtnorm::rmvnorm(): Drawing values from a multivariate normal distribution
sample(): Probabilistic sampling with or without replacement
rchisq(): Probabilistic sampling from a \(X2\) distribution
rpois(): Probabilistic sampling from a poisson distribution
set.seed(123): Fixing the Random Number Generator seed

Creating our own data

We saw that we could create vectors

c(1, 2, 3, 4, 3, 2, 1)

## [1] 1 2 3 4 3 2 1

c(1:4, 3:1)

## [1] 1 2 3 4 3 2 1

We could create matrices

matrix(1:15, nrow = 3, ncol = 5, byrow = TRUE)

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    2    3    4    5
## [2,]    6    7    8    9   10
## [3,]   11   12   13   14   15

Creating our own data

Vectors from matrices

mat <- matrix(1:18, 3, 6)
mat

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    4    7   10   13   16
## [2,]    2    5    8   11   14   17
## [3,]    3    6    9   12   15   18

c(mat)

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18

But we can also draw a sample

Of the same size,

sample(mat)

##  [1] 12  1  7  4 18  5 14 15  9  6  8 13 10 11 17  3 16  2

smaller size,

sample(mat, size = 5)

## [1]  7  3 15 12  4

or larger size

sample(mat, size = 50, replace = TRUE)

##  [1] 17  9  1 13 13  5  5  4  1 16  5  7 12 14 12 10  5  8 10  4 17 16  9
## [24]  4  8 16 18 11  8  1  4  2  6 14 15 17  4  7  4  7 16  6 17  4 12  4
## [47]  3 11 18  6

We can do random sampling

hist(sample(mat, size = 50000, replace=TRUE), breaks = 0:18)

Or nonrandom sampling

probs <- c(rep(.01, 15), .1, .25, .50)
probs

##  [1] 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
## [15] 0.01 0.10 0.25 0.50

hist(sample(mat, size = 50000, prob = probs, replace=TRUE), breaks = 0:18)

We can replicate individual samples

set.seed(123)
sample(mat, size = 50, replace = TRUE)

##  [1]  6 15  8 16 17  1 10 17 10  9 18  9 13 11  2 17  5  1  6 18 17 13 12
## [24] 18 12 13 10 11  6  3 18 17 13 15  1  9 14  4  6  5  3  8  8  7  3  3
## [47]  5  9  5 16

set.seed(123)
sample(mat, size = 50, replace = TRUE)

##  [1]  6 15  8 16 17  1 10 17 10  9 18  9 13 11  2 17  5  1  6 18 17 13 12
## [24] 18 12 13 10 11  6  3 18 17 13 15  1  9 14  4  6  5  3  8  8  7  3  3
## [47]  5  9  5 16

We can replicate a chain of samples

set.seed(123)
sample(mat, size = 5, replace = TRUE)

## [1]  6 15  8 16 17

sample(mat, size = 7, replace = TRUE)

## [1]  1 10 17 10  9 18  9

set.seed(123)
sample(mat, size = 5, replace = TRUE)

## [1]  6 15  8 16 17

sample(mat, size = 7, replace = TRUE)

## [1]  1 10 17 10  9 18  9

The random seed

The random seed is a number used to initialize the pseudo-random number generator

If replication is needed, pseudorandom number generators must be used

Pseudorandom number generators generate a sequence of numbers
The properties of generated number sequences approximates the properties of random number sequences
Pseudorandom number generators are not truly random, because the process is determined by an initial value.

The initial value (the seed) itself does not need to be random.

The resulting process is random because the seed value is not used to generate randomness
It merely forms the starting point of the algorithm for which the results are random.

Why fix the random seed

When an R instance is started, there is initially no seed. In that case, R will create one from the current time and process ID.

Hence, different sessions will give different results when random numbers are involved.
When you store a workspace and reopen it, you will continue from the seed specified within the workspace.

If we fix the random seed we can exactly replicate the random process

If the method has not changed: the results of the process will be identical when using the same seed.

Replications allows for verification
But beware: the process depends on the seed
- The results obtained could theoretically be extremely rare and would not have occured with every other potential seed
- Run another seed before publishing your results

Random seeds

Random processes

Drawing data

We can draw data from a standard normal distribution

hist(rnorm(1000, mean = 5, sd = 1))

Drawing data

We can draw data from a specific normal distribution

hist(rnorm(1000, mean = 50, sd = 15))

Drawing data: many values

hist(rnorm(100000, mean = 50, sd = 15))

Drawing data: other distribution

\(X^2\)-distribution with \(df = 1\) degrees of freedom

hist(rchisq(500, df = 1))

Drawing data: other distribution

\(X^2\)-distribution with \(df = 100\) degrees of freedom

hist(rchisq(500, df = 100))

Drawing data: other distribution

Poisson distribution with mean \(\lambda = 1\)

hist(rpois(5000, lambda = 1))

Drawing data: other distribution

Poisson distribution with mean \(\lambda = 10\)

hist(rpois(5000, lambda = 100))

Drawing data: other distribution

Binomial distribution: 1000 draws of 1 trial with \(P(\text{success}=.75)\)

run <- rbinom(1000, size = 1, prob = .75) 
table(run)

## run
##   0   1 
## 248 752

Drawing data: other distribution

Binomial distribution: 10.000 draws of 60 trials with \(P(\text{success}=.75)\)

run <- rbinom(10000, size = 60, prob = .75)
table(run)

## run
##   32   33   34   35   36   37   38   39   40   41   42   43   44   45   46 
##    1    7   10   21   29   67  137  257  395  614  783  980 1098 1197 1137 
##   47   48   49   50   51   52   53   54   55   56 
## 1053  835  571  376  200  136   62   25    7    2

Drawing multivariate distributions

sigma <- matrix(c(30, 15, 15, 30), 2, 2)
sigma

##      [,1] [,2]
## [1,]   30   15
## [2,]   15   30

means <- c(97, 63)
data <- rmvnorm(n = 1000, mean = means, sigma = sigma)
colMeans(data)

## [1] 96.77040 62.85105

var(data)

##          [,1]     [,2]
## [1,] 31.44549 15.04467
## [2,] 15.04467 27.47605

Bivariate normal

plot(data)

Drawing multivariate distributions

sigma <- matrix(c(1, 0, 0, 1), 2, 2)
sigma

##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1

means <- c(0, 0)
data <- rmvnorm(n = 1000, mean = means, sigma = sigma)
colMeans(data)

## [1]  0.009235345 -0.002294830

var(data)

##             [,1]        [,2]
## [1,] 1.076408982 0.006907679
## [2,] 0.006907679 0.865091698

Bivariate normal

plot(data)

Histograms

par(mfrow = c(1, 2))
hist(data[, 1])
hist(data[, 2])

To conclude

source

Packages and functions that we use

Creating our own data

Creating our own data

But we can also draw a sample

We can do random sampling

Or nonrandom sampling

We can replicate individual samples

We can replicate a chain of samples

The random seed

Why fix the random seed

Random seeds

Random processes

Drawing data

Drawing data

Drawing data: many values

Drawing data: other distribution

Drawing data: other distribution

Drawing data: other distribution

Drawing data: other distribution

Drawing data: other distribution

Drawing data: other distribution

Drawing multivariate distributions

Bivariate normal

Drawing multivariate distributions

Bivariate normal

Histograms

To conclude

Practical