Practical H

Exercises

Again, just like last time it is wise to start with fixing the random seed.

set.seed(123)

Generate two random samples of 10 numbers from a normal distribution with the below specifications. Test the null hypothesis that the population mean is 0.

\(\mu = 0\) and \(\sigma = 2\)
\(\mu = 1.5\) and \(\sigma = 2\)

x <- rnorm(10, mean = 0, sd = 2)
t.test(x)

## 
##  One Sample t-test
## 
## data:  x
## t = 0.24742, df = 9, p-value = 0.8101
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -1.215341  1.513843
## sample estimates:
## mean of x 
## 0.1492513

x <- rnorm(10, 1.5, 2)
t.test(x)

## 
##  One Sample t-test
## 
## data:  x
## t = 2.9202, df = 9, p-value = 0.01703
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.432058 3.402430
## sample estimates:
## mean of x 
##  1.917244

Write a function that generates a random sample of n numbers from a normal distribution with a user defined mean (i.e. a mean that you can choose when running the function) and standard deviation 1, and returns the p.value for the test that the mean is 0.

p.value.t <- function (n, mu) {
  x <- rnorm(n, mu, 1)
  t.test(x)$p.value
}

p.value.t(n = 30, mu = 3)

## [1] 2.912617e-17

Use the function of Exercise 3 to generate 50 \(p\)-values with \(n=10,\mu=0\), and make a qqplot to compare distribution of the \(p\)-values with a uniform \([0,1]\) variable.

y <- numeric(50)
for (i in 1:50) {
  y[i] <- p.value.t(n = 10, mu = 0)
}

qqplot(x=qunif(ppoints(50)), y)

The p-values follow a uniform distribution.

In a study that examined the use of acupuncture to treat migraine headaches, consenting patients on a waiting list for treatment for migraine were randomly assigned in a 2:1:1 ratio to acupuncture treatment, a “sham” acupuncture treatment in which needles were inserted at non-acupuncture points, and waiting-list patients whose only treatment was self-administered (Linde et al., 2005). The “sham” acupuncture treatment was described to trial participants as an acupuncture treatment that did not follow the principles of Chinese medicine.

What is the conclusion when the outcome is classified according to numbers of patients who experienced a greater than 50% reduction in headaches over a four-week period, relative to a pre-randomization baseline?

Use the following data

data <- matrix(c(74, 71, 43, 38, 11, 65), nrow = 2, ncol = 3)
colnames(data) <- c("Acupuncture", "Sham", "Waiting list")
rownames(data) <- c("> 50% reduction", "< 50% reduction")
data

##                 Acupuncture Sham Waiting list
## > 50% reduction          74   43           11
## < 50% reduction          71   38           65

We start with calculating the \(X^2\)-test:

X2test <- 
  data %>%
  chisq.test()

X2test

## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 32.486, df = 2, p-value = 8.825e-08

which is extremely significant. We can then calculate the expected cell frequencies

X2test$expected

##                 Acupuncture     Sham Waiting list
## > 50% reduction    61.45695 34.33113     32.21192
## < 50% reduction    83.54305 46.66887     43.78808

and the raw residual

X2test$observed - X2test$expected

##                 Acupuncture      Sham Waiting list
## > 50% reduction    12.54305  8.668874    -21.21192
## < 50% reduction   -12.54305 -8.668874     21.21192

as well as the Pearson residual

X2test$residuals

##                 Acupuncture      Sham Waiting list
## > 50% reduction    1.599991  1.479513    -3.737418
## < 50% reduction   -1.372296 -1.268963     3.205546

to infer the difference in observed and expected cell frequencies. Patients on the waiting list experience > 50% reduction much less than we would expect under independence of treatment and outcome.

Patients who received the acupuncture and sham acupuncture treatments were asked to guess their treatment at the end of their trial. What would you conclude from this data?

data <- matrix(c(82, 17, 30, 30, 26, 16), nrow = 3, ncol = 2)
colnames(data) <- c("Acupuncture", "Sham")
rownames(data) <- c("Chinese", "Other", "Don't know")
data

##            Acupuncture Sham
## Chinese             82   30
## Other               17   26
## Don't know          30   16

We again start with calculating the \(X^2\)-test:

X2test <- 
  data %>%
  chisq.test()

X2test

## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 15.358, df = 2, p-value = 0.0004624

which is very significant. We can then calculate the expected cell frequencies

X2test$expected

##            Acupuncture     Sham
## Chinese       71.88060 40.11940
## Other         27.59701 15.40299
## Don't know    29.52239 16.47761

and the raw residual

X2test$observed - X2test$expected

##            Acupuncture        Sham
## Chinese     10.1194030 -10.1194030
## Other      -10.5970149  10.5970149
## Don't know   0.4776119  -0.4776119

as well as the Pearson residual

X2test$residuals

##            Acupuncture       Sham
## Chinese     1.19357319 -1.5976353
## Other      -2.01721641  2.7001078
## Don't know  0.08790214 -0.1176598

We find that people who are receiving true Acupuncture are more inclined to believe that they receive Chinese acupuncture than we would expect under independence, while people wo received Sham acupuncture are more inclined to believe that they receive Other type of acupuncture. Don't know is more or less similarly distributed over the observed and expected frequencies.

In the following simulation experiment we investigate least-squares estimation of the mean.

Start by drawing a 100 values from a normal distribution with \(\mu = 3\) and \(\sigma = 7\). Use seed value 32083.

set.seed(32083)

x <- rnorm(100, mean=3, sd=7)

Next, confirm that the sample mean of the values in x is near 4.166.

mean(x)

## [1] 4.165695

Calculate the sample mean’s sum of squared deviations from \(\mu\). The sum of squared deviations from mu is defined as: \[ \sum_{i=1}^{100} (x_i - \mu)^2,\] There is a slow way

mu = 3
summed <- x-mu
sum.sq <- sum(summed^2)

And a fast way

sum.sq2 <- apply(outer(x, mu, "-")^2, 2, sum)

Both solutions are identical

identical(sum.sq, sum.sq2)

## [1] TRUE

Now create a function that automates the calculation of the sum of squares for any given \(\mu\). Call the function lsfun because we are going to identify the least squares estimate in exercise 8.

lsfun <- function(meanest) apply(outer(x, meanest, "-")^2, 2, sum)

or, 100% equivalently, but easier to spot as a function:

lsfun <- function(meanest){
  apply(outer(x, meanest, "-")^2, 2, sum)
}

Plot the curve of your least square function such that you can identify the minimum of the curve (i.e. the location for \(x\) where the sum of the squared deviations are the lowest).

curve(lsfun, from = 4.16, to = 4.17)

Repeat the experiment from 10 with the following \(X \sim \mathcal{N}(\mu, \sigma^2)\) normal samples of length \(n=100\), but now use the sample mean \(\bar{x}\) in your function lsfun(). Let the function plot the curve and print the location where the minimum of the sum of the squares is located each time. Fix the seed to set.seed(123):

\(x \sim \mathcal{N}(3, 7)\)
\(x \sim \mathcal{N}(15, 12)\)
\(x \sim \mathcal{N}(0, 2)\)
\(\sqrt{x} \sim \mathcal{N}(0, 2)\), i.e. you need to square \(x\)

Hint: use the sample mean \(\bar{x}\) as the center of your graph and add/subtract e.g. .5 from this value to plot a range.

First, we fix the random seed

set.seed(123)

Code-wise it is efficient to write a function that does the repetitive experimentation. That way we have to write some lines of code only once.

plotfun <- function(x, meanest, plot = TRUE){
  xbar <- mean(x)
  lsfun <- function(meanest){ apply(outer(xbar, meanest, "-")^2, 2, sum)}
  # lsfun <- function(mu, xbar) {
  #   summed <- x-mu
  #   sum.sq <- sum(summed^2)
  #   return(sum.sq)
  # }
  if (plot) {
    curve(lsfun, from = xbar - .5, to = xbar + .5, 
          ylab=expression(paste(Sigma," ",e^2))) #nicer y-axis label
  }
  return(cat("The mean is:", xbar, "\n"))
}

For \(x \sim \mathcal{N}(3, 7)\):

rnorm(100, mean=3, sd=sqrt(7)) %>%
  plotfun(meanest = 3, plot = TRUE)

## The mean is: 3.239192

For \(x \sim \mathcal{N}(15, 12)\)

rnorm(100, mean=15, sd=sqrt(12)) %>%
  plotfun(meanest = 15, plot = TRUE)

## The mean is: 14.62745

For \(x \sim \mathcal{N}(0, 2)\):

rnorm(100, mean=0, sd=2) %>%
  plotfun(meanest = 0, plot = TRUE)

## The mean is: 0.2409302

For \(\sqrt{x} \sim \mathcal{N}(0, 2)\):

rnorm(100, mean=0, sd=2)^2 %>%
  plotfun(meanest = 0, plot = TRUE)

## The mean is: 4.278351

End of practical.

Practical H

Gerko Vink

Statistical Programming in R

Exercises