library(dplyr) # Data manipulation library(magrittr) # Pipes library(stringi) # For counting substrings library(ggplot2) # Plotting suite
Statistical Programming with R
library(dplyr) # Data manipulation library(magrittr) # Pipes library(stringi) # For counting substrings library(ggplot2) # Plotting suite
With Monte Carlo methods we can estimate the value of any unknown quantity by using statistical theory (what we would expect based on chance)
It builds upon the principles of inferential statistics and needs:
It works by repetitively sampling from a population while varying the input, but keeping the method consistent.
In repeated independent
experiments
true
probability \(p\) of a particular outcome in each experimenttrue
probability \(p\)So if we replicate the same procedure an infinite number of times, the difference between our estimate and the true value would be zero.
The experiments must be independent, i.e., the probability of the event is the same for every trial.
The tendency to believe that previous throws alter the probability for a future event is referred to as gambler’s fallacy:
for independent trials on a fair die, the probability is one in six
: even if the previous 100 throws have been sixes.
If a fair roulette game landed on black 50 times in a row, the probability of landing on red is not bigger for the next throw: it remains \(\frac{18}{37}\).
Following an extreme random event, the next random event is likely to be less extreme. In the long run the probability event will move to the mean.
RTTM:
When one run of throwing a die 100 times yields 100 sixes (a rare event) –> the next run will likely return less than 100.LLN
: In the long term the 100 sixes event will average out and the probability of rolling a six will move to \(P = \frac{1}{6}\).GF
: The coin is due to a run of 500 not-sixes to even out the 100 sixesset.seed(123) x <- sample(1:6, 10, prob = rep(1/6, 6), replace = TRUE) prop.table(table(x))
## x ## 1 2 3 4 5 6 ## 0.3 0.1 0.1 0.2 0.2 0.1
x <- sample(1:6, 100, prob = rep(1/6, 6), replace = TRUE) prop.table(table(x))
## x ## 1 2 3 4 5 6 ## 0.15 0.17 0.16 0.21 0.14 0.17
x <- sample(1:6, 10000, prob = rep(1/6, 6), replace = TRUE) prop.table(table(x))
## x ## 1 2 3 4 5 6 ## 0.1616 0.1683 0.1663 0.1713 0.1663 0.1662
x <- sample(1:6, 1000000, prob = rep(1/6, 6), replace = TRUE) prop.table(table(x))
## x ## 1 2 3 4 5 6 ## 0.166610 0.167284 0.166763 0.166686 0.166458 0.166199
charx <- paste(x, collapse = "") occurrences <- stri_count_fixed(charx, "123") / 1e6 trueprob <- (1/6)^3 cat(occurrences,"\n", trueprob, sep = "")
## 0.004635 ## 0.00462963
# Initialise results object result <- list() # repeat K times for (i in 1:100) { # generate a dataset of interest dataset <- sample(1:6, 100000, prob = rep(1/6, 6), replace = TRUE) # optional: perform your method on this dataset # Save the values you are interested in result[[i]] <- dataset } # Aggregate your results probs <- sapply(result, function(x) prop.table(table(x))) # Display output rowMeans(probs)
## 1 2 3 4 5 6 ## 0.1665194 0.1666168 0.1668015 0.1666164 0.1666844 0.1667615
result <- list() for (i in 1:100000) { # repeat 100.000 times result[[i]] <- sample(1:6, 100, prob = rep(1/6, 6), replace = TRUE) } # 'only' 100 dice probs <- sapply(result, function(x) prop.table(table(x))) rowMeans(probs)
## 1 2 3 4 5 6 ## 0.1665916 0.1666016 0.1667691 0.1666381 0.1668459 0.1665537
result <- list() for (i in 1:1000) { result[[i]] <- sample(1:6, size = 1000, prob = c(.01, .04, .1, .15, .2, .5), replace = TRUE) } # 1000 trials probs <- sapply(result, function(x) prop.table(table(x))) rowMeans(probs)
## 1 2 3 4 5 6 ## 0.010178 0.039981 0.100071 0.149081 0.199827 0.500862
Power: the probability of detecting an effect if there is in fact a true effect.
What happens to the power of an independent-samples t-test if we measure on a 5-point scale instead of a continuous outcome?
cuts <- c(-Inf, -1.2, -0.4, 0.4, 1.2, Inf) curve(dnorm, -3, 3) abline(v = cuts)
set.seed(123) treatment <- rnorm(24, .7) control <- rnorm(24) treatCut <- as.numeric(cut(treatment, cuts)) contrCut <- as.numeric(cut(control, cuts)) t.test(treatment, control)$p.value
## [1] 0.02382654
t.test(treatCut, contrCut)$p.value
## [1] 0.09583241
How confident are we in this result?
# Initialise results object result <- matrix(FALSE, nrow = 100000, ncol = 2) for (i in 1:100000) { # generate a dataset of interest treatment <- rnorm(24, .7) control <- rnorm(24) treatCut <- as.numeric(cut(treatment, cuts)) contrCut <- as.numeric(cut(control, cuts)) # perform your method on this dataset treatp <- t.test(treatment, control)$p.value contrp <- t.test(treatCut, contrCut)$p.value # Save the values you are interested in result[i,1] <- treatp < 0.05 result[i,2] <- contrp < 0.05 } colMeans(result)
## [1] 0.66311 0.62295