library(MASS)
Statistical Programming in R
library(MASS)
The mathematical formulation of relationship between variables can be written as
\[ \mbox{observed}=\mbox{predicted}+\mbox{error} \]
or (for the greek people) in notation as \[y=\mu+\varepsilon\]
where
Regression model:
\[ \text{age}_i=\alpha+\beta\cdot{\text{weight}}_i+\varepsilon_i \]
where
age
, \(y\), conditional on weight
, \(x\).The function lm()
is a base function in R
and allows you to pose a variety of linear models.
args(lm)
## function (formula, data, subset, weights, na.action, method = "qr", ## model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, ## contrasts = NULL, offset, ...) ## NULL
If we want to know what these arguments do we can ask R:
?lm
This will open a help page on the lm()
function.
To obtain a linear model with a main effects for wgt
, we formulate \(age\sim wgt\)
library(mice) fit <- with(boys, lm(age ~ wgt)) fit
## ## Call: ## lm(formula = age ~ wgt) ## ## Coefficients: ## (Intercept) wgt ## -0.1924 0.2517
summary(fit)
## ## Call: ## lm(formula = age ~ wgt) ## ## Residuals: ## Min 1Q Median 3Q Max ## -12.0907 -1.1686 -0.5342 1.4286 5.5143 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.192377 0.136878 -1.405 0.16 ## wgt 0.251673 0.003018 83.395 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.142 on 742 degrees of freedom ## (4 observations deleted due to missingness) ## Multiple R-squared: 0.9036, Adjusted R-squared: 0.9035 ## F-statistic: 6955 on 1 and 742 DF, p-value: < 2.2e-16
To obtain a linear model with just main effects for wgt
and hgt
, we formulate \(age\sim wgt + hgt\)
require(mice) fit <- with(boys, lm(age ~ wgt + hgt)) fit
## ## Call: ## lm(formula = age ~ wgt + hgt) ## ## Coefficients: ## (Intercept) wgt hgt ## -7.45952 0.07138 0.10660
summary(fit)
## ## Call: ## lm(formula = age ~ wgt + hgt) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.1627 -0.8836 -0.1656 0.8765 4.4365 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -7.459525 0.246782 -30.23 <2e-16 *** ## wgt 0.071376 0.005949 12.00 <2e-16 *** ## hgt 0.106602 0.003336 31.96 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.391 on 724 degrees of freedom ## (21 observations deleted due to missingness) ## Multiple R-squared: 0.9592, Adjusted R-squared: 0.9591 ## F-statistic: 8510 on 2 and 724 DF, p-value: < 2.2e-16
To predict \(age\) from \(wgt\), \(hgt\) and the interaction \(wgt*hgt\) we formulate \(age\sim wgt * hgt\)
fit <- with(boys, lm(age ~ wgt * hgt)) fit
## ## Call: ## lm(formula = age ~ wgt * hgt) ## ## Coefficients: ## (Intercept) wgt hgt wgt:hgt ## -7.2833480 -0.0196161 0.1120325 0.0004142
summary(fit)
## ## Call: ## lm(formula = age ~ wgt * hgt) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.2067 -0.8603 -0.1419 0.8452 4.7526 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -7.2833480 0.2510218 -29.015 < 2e-16 *** ## wgt -0.0196161 0.0284921 -0.688 0.49137 ## hgt 0.1120325 0.0037076 30.217 < 2e-16 *** ## wgt:hgt 0.0004142 0.0001269 3.265 0.00115 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.382 on 723 degrees of freedom ## (21 observations deleted due to missingness) ## Multiple R-squared: 0.9598, Adjusted R-squared: 0.9596 ## F-statistic: 5752 on 3 and 723 DF, p-value: < 2.2e-16
If a categorical variable is entered into function lm()
, it is automatically converted to a dummy set in R
. The first level is always taken as the reference category. If we want another reference category we can use the function relevel()
to change this.
fit <- with(boys, lm(age ~ reg))
summary(fit)
## ## Call: ## lm(formula = age ~ reg) ## ## Residuals: ## Min 1Q Median 3Q Max ## -11.805 -7.376 1.339 5.955 11.935 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 11.8984 0.7595 15.666 < 2e-16 *** ## regeast -2.6568 0.9312 -2.853 0.004449 ** ## regwest -2.9008 0.8788 -3.301 0.001011 ** ## regsouth -3.3074 0.9064 -3.649 0.000282 *** ## regcity -3.6266 1.1031 -3.288 0.001058 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 6.836 on 740 degrees of freedom ## (3 observations deleted due to missingness) ## Multiple R-squared: 0.02077, Adjusted R-squared: 0.01548 ## F-statistic: 3.925 on 4 and 740 DF, p-value: 0.003678
names(fit) # the names of the list with output objects
## [1] "coefficients" "residuals" "effects" "rank" ## [5] "fitted.values" "assign" "qr" "df.residual" ## [9] "na.action" "contrasts" "xlevels" "call" ## [13] "terms" "model"
fit$coef # show the estimated coefficients
## (Intercept) regeast regwest regsouth regcity ## 11.898420 -2.656786 -2.900792 -3.307388 -3.626625
coef(fit) # alternative
## (Intercept) regeast regwest regsouth regcity ## 11.898420 -2.656786 -2.900792 -3.307388 -3.626625
Distributions and models for the random component
Normal distributions
Discrete distributions (counts)
Other continuous distributions
Probability of \(y\) successes in \(n\) trials with probability of success \(\pi\)
\[ P(y|n,\pi)={n\choose y}\pi^y(1-\pi)^{(n-y)} \]
For \(n=5\) trials:
par(mfrow=c(1,2), bg = NA) y <- factor(0:5) x <- 0:5 barplot(dbinom(x,5,.1),ylim=c(0,.6),names.arg=y,ylab="P(Y=y)",xlab="y") text(5,.5,expression(paste(pi," = 0.1"))) barplot(dbinom(x,5,.5),ylim=c(0,.6),names.arg=y,ylab="P(Y=y)",xlab="y") text(5,.5,expression(paste(pi," = 0.5")))
The probability of \(y\) events within a time period (e.g. number of earthquakes in a year) \[ P(y|\lambda)=\frac{e^{-\lambda}\lambda^y}{y!}, \text{where } lambda=\text{rate parameter} \]
par(mfrow = c(1, 2), bg = NA) y <- factor(0:5) x <- 0:5 barplot(dpois(x, .5), ylim = c(0, .6), names.arg = y, ylab = "P(Y=y)", xlab = "y") text(5, .5, expression(paste(lambda, " = 0.5"))) barplot(dpois(x, 2), ylim = c(0, .6), names.arg = y, ylab = "P(Y = y)",xlab="y") text(5,.5,expression(paste(lambda," = 2")))
The probability density function (pdf):
par(mfrow=c(1,2), bg=NA) curve(dnorm(x,0,1),xlim=c(-3,3),ylab="f(x)",main='') curve(pnorm(x,0,1),xlim=c(-3,3),ylab="F(X < x)",main='')
t.test(Hwt ~ Sex, data = cats)
## ## Welch Two Sample t-test ## ## data: Hwt by Sex ## t = -6.5179, df = 140.61, p-value = 1.186e-09 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -2.763753 -1.477352 ## sample estimates: ## mean in group F mean in group M ## 9.202128 11.322680
head(bacteria, n=8)
## y ap hilo week ID trt ## 1 y p hi 0 X01 placebo ## 2 y p hi 2 X01 placebo ## 3 y p hi 4 X01 placebo ## 4 y p hi 11 X01 placebo ## 5 y a hi 0 X02 drug+ ## 6 y a hi 2 X02 drug+ ## 7 n a hi 6 X02 drug+ ## 8 y a hi 11 X02 drug+
Column y
:
n
: Disease absent.y
: Disease present.Column ap
:
a
: Active medicine.p
: Placebo.table(bacteria$ap, bacteria$y)
## ## n y ## a 31 93 ## p 12 84
x2.table <- table(bacteria$ap, bacteria$y) x2.table <- cbind(x2.table, rowSums(x2.table)) rbind(x2.table, colSums(x2.table))
## n y ## a 31 93 124 ## p 12 84 96 ## 43 177 220
\[\chi^2 = \sum \frac{(O - E)^2}{E}\]
## n y ## a 31 93 124 ## p 12 84 96 ## 43 177 220
## bacteria$y ## bacteria$ap n y ## a 24.23636 99.76364 ## p 18.76364 77.23636
prop.table(table(bacteria$ap, bacteria$y), margin = 1)
## ## n y ## a 0.250 0.750 ## p 0.125 0.875
In the treatment condition, 1/4 is free of disease.
In the placebo condition 1/8 is free of the disease.
Should we assume this means the medicine works?
chisq.test(bacteria$y, bacteria$ap)
## ## Pearson's Chi-squared test with Yates' continuity correction ## ## data: bacteria$y and bacteria$ap ## X-squared = 4.6109, df = 1, p-value = 0.03177
or the other way around
chisq.test(bacteria$ap, bacteria$y)
## ## Pearson's Chi-squared test with Yates' continuity correction ## ## data: bacteria$ap and bacteria$y ## X-squared = 4.6109, df = 1, p-value = 0.03177
chisq.test(table(bacteria$y, bacteria$ap))
## ## Pearson's Chi-squared test with Yates' continuity correction ## ## data: table(bacteria$y, bacteria$ap) ## X-squared = 4.6109, df = 1, p-value = 0.03177
plot(table(bacteria$y, bacteria$ap))
barplot(table(bacteria$y, bacteria$ap), beside = TRUE, legend = c("No infection", "Infection"), col = c("blue", "orange"), args.legend = list(x=4.5, y=80))
With small cell sizes (roughly if any is below 5), we should use Fishers exact test.
short.bac <- bacteria[bacteria$week <= 2, ] short.tab <- table(short.bac$y, short.bac$ap) short.tab
## ## a p ## n 6 3 ## y 47 38
chisq.test(short.tab)
## Warning in chisq.test(short.tab): Chi-squared approximation may be ## incorrect
## ## Pearson's Chi-squared test with Yates' continuity correction ## ## data: short.tab ## X-squared = 0.090475, df = 1, p-value = 0.7636
fisher.test(table(short.bac$y, short.bac$ap))
## ## Fisher's Exact Test for Count Data ## ## data: table(short.bac$y, short.bac$ap) ## p-value = 0.7268 ## alternative hypothesis: true odds ratio is not equal to 1 ## 95 percent confidence interval: ## 0.3182193 10.5996850 ## sample estimates: ## odds ratio ## 1.609049
What do we conclude?
plot(table(short.bac$y, short.bac$ap))
barplot(table(short.bac$y, short.bac$ap), beside = TRUE, legend = c("No infection", "Infection"), col = c("blue", "orange"), args.legend = list(x=4.5, y=40))