for
-loop that loops over all numbers between 0 and 10, but only prints numbers below 5. for (i in 0:10) {
if (i < 5) {
print(i)
}
}
## [1] 0
## [1] 1
## [1] 2
## [1] 3
## [1] 4
for (i in 0:10) {
if (i >= 3 & i <= 5) {
print(i)
}
}
## [1] 3
## [1] 4
## [1] 5
num <- 0:10
num[num >= 3 & num <=5]
## [1] 3 4 5
or, alternatively,
subset(num, num >= 3 & num <=5)
## [1] 3 4 5
byrow = TRUE
to fill a matrix left-to-right instead of top-to-bottom.## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 1 2 3 4 5 6 7 8
## [2,] 2 4 6 8 10 12 14 16
## [3,] 3 6 9 12 15 18 21 24
## [4,] 4 8 12 16 20 24 28 32
## [5,] 5 10 15 20 25 30 35 40
# Create a matrix with 1 to 8.
mat <- matrix(1:8, ncol=8, nrow=5, byrow = TRUE)
# Loop over each row, and multiply it.
for (i in 1:5) {
mat[i, ] <- mat[i, ] * i
}
mat
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 1 2 3 4 5 6 7 8
## [2,] 2 4 6 8 10 12 14 16
## [3,] 3 6 9 12 15 18 21 24
## [4,] 4 8 12 16 20 24 28 32
## [5,] 5 10 15 20 25 30 35 40
string.mat <- matrix(NA, ncol = 6, nrow = 6)
for (i in 1:6) {
for (j in 1:6) {
string.mat[i, j] <- paste(i, "+", j, "=", i+j, sep="")
}
}
string.mat
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] "1+1=2" "1+2=3" "1+3=4" "1+4=5" "1+5=6" "1+6=7"
## [2,] "2+1=3" "2+2=4" "2+3=5" "2+4=6" "2+5=7" "2+6=8"
## [3,] "3+1=4" "3+2=5" "3+3=6" "3+4=7" "3+5=8" "3+6=9"
## [4,] "4+1=5" "4+2=6" "4+3=7" "4+4=8" "4+5=9" "4+6=10"
## [5,] "5+1=6" "5+2=7" "5+3=8" "5+4=9" "5+5=10" "5+6=11"
## [6,] "6+1=7" "6+2=8" "6+3=9" "6+4=10" "6+5=11" "6+6=12"
"Sum > 8"
in the matrix in the cells where that is true.string.mat <- matrix(NA, ncol = 6, nrow = 6)
for (i in 1:6) {
for (j in 1:6) {
if (i+j <= 8) {
string.mat[i, j] <- paste(i, "+", j, "=", i+j, sep="")
} else {
string.mat[i, j] <- "Sum > 8"
}
}
}
string.mat
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] "1+1=2" "1+2=3" "1+3=4" "1+4=5" "1+5=6" "1+6=7"
## [2,] "2+1=3" "2+2=4" "2+3=5" "2+4=6" "2+5=7" "2+6=8"
## [3,] "3+1=4" "3+2=5" "3+3=6" "3+4=7" "3+5=8" "Sum > 8"
## [4,] "4+1=5" "4+2=6" "4+3=7" "4+4=8" "Sum > 8" "Sum > 8"
## [5,] "5+1=6" "5+2=7" "5+3=8" "Sum > 8" "Sum > 8" "Sum > 8"
## [6,] "6+1=7" "6+2=8" "Sum > 8" "Sum > 8" "Sum > 8" "Sum > 8"
The anscombe
data set is a wonderful data set from 1973 by Francis J. Anscombe aimed to demonstrate that pairs of variables can have the same statistical properties, while having completely differnt graphical representations. We will be using this data set more this week. If you’d like to know more about anscombe
, you can simply call ?anscombe
to enter the help.
You can directly call anscombe
from your console because the datasets
package is a base package in R
. This means that it is always included and loaded when you start an R
instance. In general, when you would like to access functions or data sets from packages that are not automatically loaded, we don’t have to explicitly load the package. We can also call package::thing-we-need
to directly ‘grab’ the thing-we-need
from the package
namespace. For example,
test <- datasets::anscombe
identical(test, anscombe) #test if identical
## [1] TRUE
This is especially handy within functions, as we can call package::function-name
to borrow functionality from installed packages, without loading the whole package. Calling only those functions that you need is more memory-efficient than loading it all. More memory efficient means faster computation.
summary
) of each column of the anscombe
dataset from the datasets
package# Using i as an indicator for the current column.
for (i in 1:ncol(anscombe)) {
print(colnames(anscombe)[i])
print(summary(anscombe[, i]))
}
## [1] "x1"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 6.5 9.0 9.0 11.5 14.0
## [1] "x2"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 6.5 9.0 9.0 11.5 14.0
## [1] "x3"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 6.5 9.0 9.0 11.5 14.0
## [1] "x4"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8 8 8 9 8 19
## [1] "y1"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.260 6.315 7.580 7.501 8.570 10.840
## [1] "y2"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.100 6.695 8.140 7.501 8.950 9.260
## [1] "y3"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.39 6.25 7.11 7.50 7.98 12.74
## [1] "y4"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.250 6.170 7.040 7.501 8.190 12.500
# Looping over the variables directly.
# Although the code is a bit more clear, this does mean that we can not access the names of the variables.
# So the output is less clear.
for (i in anscombe) {
print(summary(i))
}
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 6.5 9.0 9.0 11.5 14.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 6.5 9.0 9.0 11.5 14.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 6.5 9.0 9.0 11.5 14.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8 8 8 9 8 19
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.260 6.315 7.580 7.501 8.570 10.840
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.100 6.695 8.140 7.501 8.950 9.260
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.39 6.25 7.11 7.50 7.98 12.74
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.250 6.170 7.040 7.501 8.190 12.500
anscombe
dataset using apply()
.apply(X = anscombe, MARGIN = 2, FUN = summary)
## x1 x2 x3 x4 y1 y2 y3 y4
## Min. 4.0 4.0 4.0 8 4.260000 3.100000 5.39 5.250000
## 1st Qu. 6.5 6.5 6.5 8 6.315000 6.695000 6.25 6.170000
## Median 9.0 9.0 9.0 8 7.580000 8.140000 7.11 7.040000
## Mean 9.0 9.0 9.0 9 7.500909 7.500909 7.50 7.500909
## 3rd Qu. 11.5 11.5 11.5 8 8.570000 8.950000 7.98 8.190000
## Max. 14.0 14.0 14.0 19 10.840000 9.260000 12.74 12.500000
Remember that in R
, the first indicator in square brackets always indicates the row, and the second indicator always indicates the column, such that anscombe[2, 3]
would give us the value for the intersection of the second row and the third column. The same rationale translates to the margins we would like apply()
to iterate over. The argument MARGIN = 2
specifies the columns, while MARGIN = 1
would indicate that a function should be applied over the rows:
apply(anscombe, 1, summary)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## Min. 6.5800 5.7600 7.58000 7.11000 7.81000 7.0400 5.2500 3.10000
## 1st Qu. 7.8650 6.9050 7.92750 8.57750 8.24750 8.0750 6.0000 4.00000
## Median 8.5900 8.0000 10.74000 8.82500 8.86500 9.4000 6.0400 4.13000
## Mean 8.6525 7.4525 10.47125 8.56625 9.35875 10.4925 6.3375 7.03125
## 3rd Qu. 10.0000 8.0000 13.00000 9.00000 11.00000 14.0000 6.4075 7.16750
## Max. 10.0000 8.1400 13.00000 9.00000 11.00000 14.0000 8.0000 19.00000
## [,9] [,10] [,11]
## Min. 5.5600 4.82000 4.740
## 1st Qu. 8.1125 6.85500 5.000
## Median 9.9850 7.00000 5.340
## Mean 9.7100 6.92625 5.755
## 3rd Qu. 12.0000 7.42250 6.020
## Max. 12.0000 8.00000 8.000
We now see a returned matrix of 11
columns, that give us the summary()
over the 11 rows in the anscombe
data set.
dim(anscombe)
## [1] 11 8
anscombe
dataset using sapply()
. sapply(anscombe, summary)
## x1 x2 x3 x4 y1 y2 y3 y4
## Min. 4.0 4.0 4.0 8 4.260000 3.100000 5.39 5.250000
## 1st Qu. 6.5 6.5 6.5 8 6.315000 6.695000 6.25 6.170000
## Median 9.0 9.0 9.0 8 7.580000 8.140000 7.11 7.040000
## Mean 9.0 9.0 9.0 9 7.500909 7.500909 7.50 7.500909
## 3rd Qu. 11.5 11.5 11.5 8 8.570000 8.950000 7.98 8.190000
## Max. 14.0 14.0 14.0 19 10.840000 9.260000 12.74 12.500000
We can see that sapply()
returns a matrix. We don’t have to specify any margins as the anscombe
data set is of class data.frame
:
class(anscombe)
## [1] "data.frame"
Objects of class data.frame
can be addressed as a list, where the columns are the listed elements (see Lecture B). The function summary()
will automatically be applied over the listed elements.
anscombe
dataset using lapply()
. lapply(anscombe, summary)
## $x1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 6.5 9.0 9.0 11.5 14.0
##
## $x2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 6.5 9.0 9.0 11.5 14.0
##
## $x3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 6.5 9.0 9.0 11.5 14.0
##
## $x4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8 8 8 9 8 19
##
## $y1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.260 6.315 7.580 7.501 8.570 10.840
##
## $y2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.100 6.695 8.140 7.501 8.950 9.260
##
## $y3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.39 6.25 7.11 7.50 7.98 12.74
##
## $y4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.250 6.170 7.040 7.501 8.190 12.500
Function lapply()
behaves just like sapply()
- in fact, sapply()
is a more user-friendly version of lapply()
, but returns a list rather than a matrix. I, personally, prefer the sapply()
- or the equivalent apply()
over the columns - solution for anscombe
data set. However, if a data set has many dimensions, the return from lapply()
may be much more flexible to work with.
giveMeanAsString <- function(x) {
paste("The mean is", mean(x))
}
anscombe
. sapply(anscombe, giveMeanAsString)
## x1 x2
## "The mean is 9" "The mean is 9"
## x3 x4
## "The mean is 9" "The mean is 9"
## y1 y2
## "The mean is 7.50090909090909" "The mean is 7.50090909090909"
## y3 y4
## "The mean is 7.5" "The mean is 7.50090909090909"
round()
off the means to have a single decimal, and apply
it again to see the results.giveRoundedMeanAsString <- function(x) {
paste("The mean is", round(mean(x), 1))
}
sapply(anscombe, giveRoundedMeanAsString)
## x1 x2 x3 x4
## "The mean is 9" "The mean is 9" "The mean is 9" "The mean is 9"
## y1 y2 y3 y4
## "The mean is 7.5" "The mean is 7.5" "The mean is 7.5" "The mean is 7.5"
The mammalsleep
data set from the mice
package shows data collected by Allison and Cicchetti (1976). It holds information for 62 mammal species on the interrelationship between sleep, ecological, and constitutional variables. The dataset contains missing values on five variables, which poses challenges when analyses include these variables.
We will use this datasets also more frequently this week, but we use it only once today. Therefore we could more efficiently call mice::mammalsleep
to obtain only the mammalsleep
data set without loading the whole mice
package.
sd()
) of the vector, if the vector is numeric, or (2) the levels
of the vector, if it is categorical.mammalsleep
dataset from the mice
package.columnInfo <- function(x) {
if (is.numeric(x)) {
return(paste("The mean is", round(mean(x), 2),
"and the sd is", round(sd(x), 2)))
} else {
return(paste(levels(x), collapse = ", "))
}
}
sapply(mice::mammalsleep, columnInfo)
## species
## "African elephant, African giant pouched rat, Arctic Fox, Arctic ground squirrel, Asian elephant, Baboon, Big brown bat, Brazilian tapir, Cat, Chimpanzee, Chinchilla, Cow, Desert hedgehog, Donkey, Eastern American mole, Echidna, European hedgehog, Galago, Genet, Giant armadillo, Giraffe, Goat, Golden hamster, Gorilla, Gray seal, Gray wolf, Ground squirrel, Guinea pig, Horse, Jaguar, Kangaroo, Lesser short-tailed shrew, Little brown bat, Man, Mole rat, Mountain beaver, Mouse, Musk shrew, N. American opossum, Nine-banded armadillo, Okapi, Owl monkey, Patas monkey, Phanlanger, Pig, Rabbit, Raccoon, Rat, Red fox, Rhesus monkey, Rock hyrax (Hetero. b), Rock hyrax (Procavia hab), Roe deer, Sheep, Slow loris, Star nosed mole, Tenrec, Tree hyrax, Tree shrew, Vervet, Water opossum, Yellow-bellied marmot"
## bw
## "The mean is 198.79 and the sd is 899.16"
## brw
## "The mean is 283.13 and the sd is 930.28"
## sws
## "The mean is NA and the sd is NA"
## ps
## "The mean is NA and the sd is NA"
## ts
## "The mean is NA and the sd is NA"
## mls
## "The mean is NA and the sd is NA"
## gt
## "The mean is NA and the sd is NA"
## pi
## "The mean is 2.87 and the sd is 1.48"
## sei
## "The mean is 2.42 and the sd is 1.6"
## odi
## "The mean is 2.61 and the sd is 1.44"
# We need to use the option na.rm=TRUE for mean and sd to make sure the missings are skipped.
columnInfo <- function(x) {
if (is.numeric(x)) {
return(paste("The mean is", round(mean(x, na.rm=TRUE), 2),
"and sd is", round(sd(x, na.rm=TRUE), 2)))
} else {
return(paste(levels(x), collapse = ", "))
}
}
sapply(mice::mammalsleep, columnInfo)
## species
## "African elephant, African giant pouched rat, Arctic Fox, Arctic ground squirrel, Asian elephant, Baboon, Big brown bat, Brazilian tapir, Cat, Chimpanzee, Chinchilla, Cow, Desert hedgehog, Donkey, Eastern American mole, Echidna, European hedgehog, Galago, Genet, Giant armadillo, Giraffe, Goat, Golden hamster, Gorilla, Gray seal, Gray wolf, Ground squirrel, Guinea pig, Horse, Jaguar, Kangaroo, Lesser short-tailed shrew, Little brown bat, Man, Mole rat, Mountain beaver, Mouse, Musk shrew, N. American opossum, Nine-banded armadillo, Okapi, Owl monkey, Patas monkey, Phanlanger, Pig, Rabbit, Raccoon, Rat, Red fox, Rhesus monkey, Rock hyrax (Hetero. b), Rock hyrax (Procavia hab), Roe deer, Sheep, Slow loris, Star nosed mole, Tenrec, Tree hyrax, Tree shrew, Vervet, Water opossum, Yellow-bellied marmot"
## bw
## "The mean is 198.79 and sd is 899.16"
## brw
## "The mean is 283.13 and sd is 930.28"
## sws
## "The mean is 8.67 and sd is 3.67"
## ps
## "The mean is 1.97 and sd is 1.44"
## ts
## "The mean is 10.53 and sd is 4.61"
## mls
## "The mean is 19.88 and sd is 18.21"
## gt
## "The mean is 142.35 and sd is 146.81"
## pi
## "The mean is 2.87 and sd is 1.48"
## sei
## "The mean is 2.42 and sd is 1.6"
## odi
## "The mean is 2.61 and sd is 1.44"
End of Practical
Allison, T., Cicchetti, D.V. (1976). Sleep in Mammals: Ecological and Constitutional Correlates. Science, 194(4266), 732-734.
Anscombe, Francis J. (1973) Graphs in statistical analysis. American Statistician, 27, 17–21.