This is the second vignette in a series of six.

The aim of this vignette is to enhance your understanding of multiple imputation, in general. You will learn how to pool the results of analyses performed on multiply-imputed data, how to approach different types of data and how to avoid the pitfalls researchers may fall into. The main objective is to increase your knowledge and understanding on applications of multiple imputation.

No previous experience with R is required. Again, we start by loading (with require()) the necessary packages and fixing the random seed to allow for our outcomes to be replicable.


1. Vary the number of imputations.

The number of imputed data sets can be specified by the m = ... argument. For example, to create just three imputed data sets, specify

imp <- mice(nhanes, m = 3, print=F)

2. Change the predictor matrix

The predictor matrix is a square matrix that specifies the variables that are used to impute each incomplete variable. Let us have a look at the predictor matrix that was used

##     age bmi hyp chl
## age   0   0   0   0
## bmi   1   0   1   1
## hyp   1   1   0   1
## chl   1   1   1   0

Each variable in the data has a row and a column in the predictor matrix. A value 1 indicates that the column variable was used to impute the row variable. For example, the 1 at entry [bmi, age] indicates that variable age was used to impute the incomplete variable bmi. Note that the diagonal is zero because a variable is not allowed to impute itself. The row of age contains all zeros because there were no missing values in age. mice gives you complete control over the predictor matrix, enabling you to choose your own predictor relations. This can be very useful, for example, when you have many variables or when you have clear ideas or prior knowledge about relations in the data at hand. You can use mice() to give you the initial predictor matrix, and change it afterwards, without running the algorithm. This can be done by typing

ini <- mice(nhanes, maxit=0, print=F)
pred <- ini$pred
##     age bmi hyp chl
## age   0   0   0   0
## bmi   1   0   1   1
## hyp   1   1   0   1
## chl   1   1   1   0

The object pred contains the predictor matrix from an initial run of mice with zero iterations, specified by maxit = 0. Altering the predictor matrix and returning it to the mice algorithm is very simple. For example, the following code removes the variable hyp from the set of predictors, but still leaves it to be predicted by the other variables.

pred[ ,"hyp"] <- 0
##     age bmi hyp chl
## age   0   0   0   0
## bmi   1   0   0   1
## hyp   1   1   0   1
## chl   1   1   0   0

Use your new predictor matrix in mice() as follows

imp <- mice(nhanes, pred=pred, print=F)

There is a special function called quickpred() for a quick selection procedure of predictors, which can be handy for datasets containing many variables. See ?quickpred for more info. Selecting predictors according to data relations with a minimum correlation of \(\rho=.30\) can be done by

ini <- mice(nhanes, pred=quickpred(nhanes, mincor=.3), print=F)
##     age bmi hyp chl
## age   0   0   0   0
## bmi   1   0   0   1
## hyp   1   0   0   1
## chl   1   1   1   0

For large predictor matrices, it can be useful to export them to Microsoft Excel for easier configuration (e.g. see the xlsx package for easy exporting and importing of Excel files).

3. Inspect the convergence of the algorithm

The mice() function implements an iterative Markov Chain Monte Carlo type of algorithm. Let us have a look at the trace lines generated by the algorithm to study convergence:

imp <- mice(nhanes, print=F)

The plot shows the mean (left) and standard deviation (right) of the imputed values only. In general, we would like the streams to intermingle and be free of any trends at the later iterations.

The algorithm uses random sampling, and therefore, the results will be (perhaps slightly) different if we repeat the imputations with different seeds. In order to get exactly the same result, use the seed argument

imp <- mice(nhanes, seed=123, print=F)

where 123 is some arbitrary number that you can choose yourself. Rerunning this command will always yields the same imputed values.

4. Change the imputation method

For each column, the algorithm requires a specification of the imputation method. To see which method was used by default:

##   age   bmi   hyp   chl 
##    "" "pmm" "pmm" "pmm"

The variable age is complete and therefore not imputed, denoted by the "" empty string. The other variables have method pmm, which stands for predictive mean matching, the default in mice for numerical and integer data. In reality, the data are better described a as mix of numerical and categorical data. Let us take a look at the nhanes2 data frame

##     age          bmi          hyp          chl       
##  20-39:12   Min.   :20.40   no  :13   Min.   :113.0  
##  40-59: 7   1st Qu.:22.65   yes : 4   1st Qu.:185.0  
##  60-99: 6   Median :26.75   NA's: 8   Median :187.0  
##             Mean   :26.56             Mean   :191.4  
##             3rd Qu.:28.93             3rd Qu.:212.0  
##             Max.   :35.30             Max.   :284.0  
##             NA's   :9                 NA's   :10

and the structure of the data frame

## 'data.frame':    25 obs. of  4 variables:
##  $ age: Factor w/ 3 levels "20-39","40-59",..: 1 2 1 3 1 3 1 1 2 2 ...
##  $ bmi: num  NA 22.7 NA NA 20.4 NA 22.5 30.1 22 NA ...
##  $ hyp: Factor w/ 2 levels "no","yes": NA 1 1 NA 1 NA 1 1 1 NA ...
##  $ chl: num  NA 187 187 NA 113 184 118 187 238 NA ...

Variable age consists of 3 age categories, while variable hyp is binary. The mice() function takes these properties automatically into account. Impute the nhanes2 dataset

imp <- mice(nhanes2, print=F)
##      age      bmi      hyp      chl 
##       ""    "pmm" "logreg"    "pmm"

Notice that mice has set the imputation method for variable hyp to logreg, which implements multiple imputation by logistic regression.

An up-to-date overview of the methods in mice can be found by

##  [1] mice.impute.2l.norm      mice.impute.2l.pan      
##  [3] mice.impute.2lonly.mean  mice.impute.2lonly.norm 
##  [5] mice.impute.2lonly.pmm   mice.impute.cart        
##  [7] mice.impute.fastpmm      mice.impute.lda         
##  [9] mice.impute.logreg       mice.impute.logreg.boot 
## [11] mice.impute.mean         mice.impute.midastouch  
## [13] mice.impute.norm         mice.impute.norm.boot   
## [15] mice.impute.norm.nob     mice.impute.norm.predict
## [17] mice.impute.passive      mice.impute.pmm         
## [19] mice.impute.polr         mice.impute.polyreg     
## [21] mice.impute.quadratic    mice.impute.rf          
## [23] mice.impute.ri           mice.impute.sample      
## [25] mice.mids                mice.theme              
## see '?methods' for accessing help and source code

Let us change the imputation method for bmi to Bayesian normal linear regression imputation

ini <- mice(nhanes2, maxit = 0)
meth <- ini$meth
##      age      bmi      hyp      chl 
##       ""    "pmm" "logreg"    "pmm"
meth["bmi"] <- "norm"
##      age      bmi      hyp      chl 
##       ""   "norm" "logreg"    "pmm"

and run the imputations again.

imp <- mice(nhanes2, meth = meth, print=F)

We may now again plot trace lines to study convergence