Statistical Programming in R

Recap

So far

This morning we have learned the basics of programming in R:

  • How to assign elements to objects (<-)
  • How to run code
  • How to save R-scripts
  • How to manage projects in RStudio
  • How to create notebooks or markdown HTML files

Objects that contain more than one element

More than one element

  • We can assign more than one element to a vector (in this case a 1-dimensional congatenation of numbers 1 through 5)
a <- c(1, 2, 3, 4, 5)
a
## [1] 1 2 3 4 5
b <- 1:5
b
## [1] 1 2 3 4 5

More than one element, with characters

Characters (or character strings) in R are indicated by the double quote identifier.

a.new <- c(a, "A")
a.new
## [1] "1" "2" "3" "4" "5" "A"

Notice the difference with a from the previous slide

a
## [1] 1 2 3 4 5

Quickly identifying elements in vectors

rep(a, 15)
##  [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
## [36] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
## [71] 1 2 3 4 5

Calling elements in vectors

If we would want just the third element, we would type

a[3]
## [1] 3

Multiple vectors in one object

This we would refer to as a matrix

c <- matrix(a, nrow = 5, ncol = 2)
c
##      [,1] [,2]
## [1,]    1    1
## [2,]    2    2
## [3,]    3    3
## [4,]    4    4
## [5,]    5    5

Calling elements in matrices #1

  • The first row is called by
c[1, ]
## [1] 1 1
  • The second column is called by
c[, 2]
## [1] 1 2 3 4 5

Calling elements in matrices #2

  • The intersection of the first row and second column is called by
c[1, 2]
## [1] 1

In short; square brackets [] are used to call elements, rows, columns (and much more beyond the scope of this course)

Matrices with mixed numeric / character data

If we add a character column to matrix c; everything becomes a character:

cbind(c, letters[1:5])
##      [,1] [,2] [,3]
## [1,] "1"  "1"  "a" 
## [2,] "2"  "2"  "b" 
## [3,] "3"  "3"  "c" 
## [4,] "4"  "4"  "d" 
## [5,] "5"  "5"  "e"

Matrices with mixed numeric / character data

Alternatively,

cbind(c, c("a", "b", "c", "d", "e"))
##      [,1] [,2] [,3]
## [1,] "1"  "1"  "a" 
## [2,] "2"  "2"  "b" 
## [3,] "3"  "3"  "c" 
## [4,] "4"  "4"  "d" 
## [5,] "5"  "5"  "e"

Remember, matrices and vectors are numerical OR character objects. They can never contain both and still be used for numerical calculations.

Data frames

d <- data.frame("V1" = rnorm(5),
                "V2" = rnorm(5, mean = 5, sd = 2), 
                "V3" = letters[1:5])
d
##            V1       V2 V3
## 1 -0.56047565 8.430130  a
## 2 -0.23017749 5.921832  b
## 3  1.55870831 2.469878  c
## 4  0.07050839 3.626294  d
## 5  0.12928774 4.108676  e

We ‘filled’ a dataframe with two randomly generated sets from the normal distribution - where \(V1\) is standard normal and \(V2 \sim N(5,2)\) - and a character set.

Data frames (continued)

Data frames can contain both numerical and character elements at the same time, although never in the same column.

You can name the columns and rows in data frames (just like in matrices)

row.names(d) <- c("row 1", "row 2", "row 3", "row 4", "row 5")
d
##                V1       V2 V3
## row 1 -0.56047565 8.430130  a
## row 2 -0.23017749 5.921832  b
## row 3  1.55870831 2.469878  c
## row 4  0.07050839 3.626294  d
## row 5  0.12928774 4.108676  e

Calling row elements in data frames

There are two ways to obtain row 3 from data frame d:

d["row 3", ]
##             V1       V2 V3
## row 3 1.558708 2.469878  c

and

d[3, ]
##             V1       V2 V3
## row 3 1.558708 2.469878  c

The intersection between row 2 and column 4 can be obtained by

d[2, 3]
## [1] b
## Levels: a b c d e

Calling columns elements in data frames

Both

d[, "V2"] # and
## [1] 8.430130 5.921832 2.469878 3.626294 4.108676
d[, 2]
## [1] 8.430130 5.921832 2.469878 3.626294 4.108676

yield the second column. But we can also use $ to call variable names in data frame objects

d$V2
## [1] 8.430130 5.921832 2.469878 3.626294 4.108676

Beyond two dimensions

If you wish to use numerical objects that have more than two dimension, an array would be a suitable object. The following code yields a 3-dimensional array (2 rows, 4 columns and 3 matrices):

e <- array(1:24, dim = c(2, 4, 3))
e
## , , 1
## 
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    5    7
## [2,]    2    4    6    8
## 
## , , 2
## 
##      [,1] [,2] [,3] [,4]
## [1,]    9   11   13   15
## [2,]   10   12   14   16
## 
## , , 3
## 
##      [,1] [,2] [,3] [,4]
## [1,]   17   19   21   23
## [2,]   18   20   22   24

Indexing an array

The square bracket identification works similarly to the identification of matrices and dataframes, but with the added dimension(s). For example,

e[1, 3, 2]
## [1] 13

yields the element in the first row of the third column in the second matrix. This is exactly the downside to an array: it is a series of matrices.

In other words, characters and numerical elements may not be mixed.

Potential problem with array

If we replace the third matrix in the array by a character version of that matrix, we obtain

e[, , 3] <- as.character(e[, , 3])
e
## , , 1
## 
##      [,1] [,2] [,3] [,4]
## [1,] "1"  "3"  "5"  "7" 
## [2,] "2"  "4"  "6"  "8" 
## 
## , , 2
## 
##      [,1] [,2] [,3] [,4]
## [1,] "9"  "11" "13" "15"
## [2,] "10" "12" "14" "16"
## 
## , , 3
## 
##      [,1] [,2] [,3] [,4]
## [1,] "17" "19" "21" "23"
## [2,] "18" "20" "22" "24"

Solution: a list

List are just what it says they are: lists. You can have a list of everything mixed with everything. For example, an simple list can be created by

f <- list(a)
f
## [[1]]
## [1] 1 2 3 4 5

Elements or objects within lists can be called by using double square brackets [[]]. For example, the first (and only) element in list f is object a

f[[1]]
## [1] 1 2 3 4 5

Lists (continued)

We can simply add an object or element to an existing list

f[[2]] <- d
f
## [[1]]
## [1] 1 2 3 4 5
## 
## [[2]]
##                V1       V2 V3
## row 1 -0.56047565 8.430130  a
## row 2 -0.23017749 5.921832  b
## row 3  1.55870831 2.469878  c
## row 4  0.07050839 3.626294  d
## row 5  0.12928774 4.108676  e

to obtain a list with a vector and a data frame.

Lists (continued)

We can add names to the list as follows

names(f) <- c("vector", "data frame")
f
## $vector
## [1] 1 2 3 4 5
## 
## $`data frame`
##                V1       V2 V3
## row 1 -0.56047565 8.430130  a
## row 2 -0.23017749 5.921832  b
## row 3  1.55870831 2.469878  c
## row 4  0.07050839 3.626294  d
## row 5  0.12928774 4.108676  e

Calling elements in lists

Calling the vector (a) from the list can be done as follows

f[[1]]
## [1] 1 2 3 4 5
f[["vector"]]
## [1] 1 2 3 4 5
f$vector
## [1] 1 2 3 4 5

Lists in lists

Take the following example

g <- list(f, f)

To call the vector from the second list within the list g, use the following code

g[[2]][[1]]
## [1] 1 2 3 4 5
g[[2]]$vector
## [1] 1 2 3 4 5

Logical operators

  • Logical operators are signs that evaluate a statement, such as ==, <, >, <=, >=, and | (OR) as well as & (AND). Typing ! before a logical operator takes the complement of that action. There are more operations, but these are the most useful.

  • For example, if we would like elements out of matrix c that are larger than 3, we would type:

c[c > 3]
## [1] 4 5 4 5

Why does a logical statement on a matrix return a vector?

c > 3
##       [,1]  [,2]
## [1,] FALSE FALSE
## [2,] FALSE FALSE
## [3,] FALSE FALSE
## [4,]  TRUE  TRUE
## [5,]  TRUE  TRUE

The column values for TRUE may be of different length. A vector as a return is therefore more appropriate.

Logical operators (cont’d)

  • If we would like the elements that are smaller than 3 OR larger than 3, we could type
c[c < 3 | c > 3] #c smaller than 3 or larger than 3
## [1] 1 2 4 5 1 2 4 5

or

c[c != 3] #c not equal to 3
## [1] 1 2 4 5 1 2 4 5

Logical operators (cont’d)

  • In fact, c != 3 returns a matrix
##       [,1]  [,2]
## [1,]  TRUE  TRUE
## [2,]  TRUE  TRUE
## [3,] FALSE FALSE
## [4,]  TRUE  TRUE
## [5,]  TRUE  TRUE
  • Remember c?:
##      [,1] [,2]
## [1,]    1    1
## [2,]    2    2
## [3,]    3    3
## [4,]    4    4
## [5,]    5    5

Things that cannot be done

  • Things that have no representation in real number space (at least not without tremendous effort)
    • For example, the following code returns “Not a Number”
0 / 0
## [1] NaN
  • Also impossible are calculations based on missing values (NA’s)
mean(c(1, 2, NA, 4, 5))
## [1] NA

Standard solves for missing values

There are two easy ways to perform “listwise deletion”:

mean(c(1, 2, NA, 4, 5), na.rm = TRUE)
## [1] 3
mean(na.omit(c(1, 2, NA, 4, 5)))
## [1] 3

Floating point example

(round(1740 / 600, 0) - 1740 / 600)
## [1] 0.1
(round(1740 / 600, 0) - 1740 / 600) <= 0.1
## [1] FALSE
(round(1740 / 600, 0) - 1740 / 600) <= 0.11
## [1] TRUE

Floating point example #2

(3 - 2.9)
## [1] 0.1
(3 - 2.9) <= 0.1
## [1] FALSE
(3 - 2.9) - .1
## [1] 8.326673e-17

Some programming tips:

  • keep your code tidy
  • use comments (text preceded by #) to clarify what you are doing
    • If you look at your code again, one month from now: you will not know what you did –> unless you use comments
  • when working with functions, use the TAB key to quickly access the help for the function’s components
  • work with logically named R-scripts
    • indicate the sequential nature of your work
  • work with RStudio projects
  • if allowed, place your project folders in some cloud-based environment

Practical

How to approach the next practical

Aim to make the exercises without looking at the answers.

  • Use the answers to evaluate your work
  • Use the help to identify the workings of functions

If this does not work out –> switch to the answer-based practical.

In any case; ask for help when you feel help is needed.

  • Do not ‘struggle’ for too long: we only have limited time!