Data analysis, common to all empirical sciences, often requires complete data sets. Unfortunately, real world data collection will usually result in data values not being observed. We present a package for robust multiple imputation (the ImputeRobust package) that allows the use of generalized additive models for location, scale, and shape in the context of chained equations. The paper describes the basics of the imputation technique which builds on a semi-parametric regression model (GAMLSS) and the algorithms and functions provided with the corresponding package. Furthermore, some illustrative examples are provided.
A common approach to allow valid inferences in the presence of missing
data is “Multiple Imputation” (MI) introduced by (Rubin 1987).
Application of MI can be summarized in three steps. The first step is to
create
Basically, there are two ways of specifying imputation models: Joint modelling and fully conditional specification. The joint modelling approach requires specification of a multivariate distribution for the variables whose values are not observed and drawing imputations from their predictive posterior distribution using Markov Chain Monte Carlo (MCMC) techniques. Within this framework, the most common assumption is that the data is multivariate normally distributed (Schafer 1997). Other alternatives for categorical data are based on the latent normal model (Albert and Chib 1993) or the general location model (Little and Rubin 2002).
This methodology is attractive if the assumed multivariate distribution is an appropriate model for the data but may lack the flexibility needed to deal with complex data sets encountered in applications. In such cases, the joint modelling approach may be too restrictive because the typical specification of a multivariate distribution is not sufficiently flexible to allow different continuous and discrete distributions (He and Raghunathan 2009). For example, most of the existing model-based methods and software implementations assume that the data originate from a multivariate normal distribution (van Buuren 2007; e.g. Honaker et al. 2011; Templ et al. 2011). However, the assumption of normality is inappropriate as soon as there are outliers in the data, or in the case of skewed, heavy-tailed or multi-modal distributions, potentially leading to deficient results (He and Raghunathan 2012; van Buuren 2012).
With the fully conditional specification, also known as multivariate imputation by chained equations (van Buuren and Groothuis-Oudshoorn 2011), a univariate imputation model is specified for each variable with missing values conditional on other variables of the data set. Starting from initially bootstrapped imputations, subsequent imputations are drawn by iterating over conditional densities (van Buuren 2007; van Buuren and Groothuis-Oudshoorn 2011). This framework splits high-dimensional imputation models into multiple one-dimensional problems and is appealing as an alternative to joint modelling in cases where a proper multivariate distribution can not be found. The choice of imputation models in this setting can vary depending on the type of variable to be imputed, for example, parametric models like the Bayesian linear regression or logistic regression. Liu et al. (2013) studied the asymptotic properties of this iterative imputation procedure and provided sufficient conditions under which the imputation distribution converges to the posterior distribution of a joint model when the conditional models are compatible.
and de Jong et al. (2014) proposed a new imputation technique based on generalized additive models for location, scale, and shape, GAMLSS, (Rigby and Stasinopoulos 2005), which is a class of univariate regression models, where the assumption of an exponential family is relaxed and replaced by a general distribution family. This allows a more flexible modelling than standard parametric imputation models, not only based on the location (e.g. the mean), but also the scale (e.g. variance), and the shape (e.g., skewness and kurtosis) of the conditional distribution of the dependent variable to be imputed given all other variables. The R package ImputeRobust, described in the next section, provides the functions necessary to apply the GAMLSS imputation technique to missing data problems. It can be used as standalone tool or in combination with mice (van Buuren and Groothuis-Oudshoorn 2011). Salfran (2018) provides an extensive comparison study of the new package and other Multiple Imputation techniques.
The imputation method realized by the package ImputeRobust is based on a generalized additive model for location, scale and shape of the variable to be imputed. Specifically, we adopt the semi-parametric additive formulation of GAMLSS described in (Stasinopoulos and Rigby 2007).
Let
Not all four parameters may be needed, depending on the conditional
distribution
The proposed imputation method selects the conditional distribution
The chosen distribution
The implementation of the imputation method for our simulations uses the
gamlss package (see Rigby and Stasinopoulos 2005; Stasinopoulos and Rigby 2007) in R to fit
model ((1)) based on (penalized) maximum likelihood
estimation and adopting the default link functions.
Rigby and Stasinopoulos (2005) and Stasinopoulos and Rigby (2007) provide a
description of the fitting algorithms used by this package. As smoothing
functions
Let
Fit model
Resample
Refit the above model using
Repeat
The two main functions included in the ImputeRobust package are
the imputing and fitting functions mice.impute.gamlss()
and
ImpGamlssFit()
, respectively.
The function ImpGamlssFit()
is internal and its job is to read in the
data and model parameters and create a bootstrap predictive sampling
function, i.e., it will work through steps 1 to 3 of Algorithm
1. The fitting step depends on the gamlss package,
and the same control options described in
(Stasinopoulos and Rigby 2007) can be passed to ImpGamlssFit()
when
calling the mice()
function. By default, ImpGamlssFit()
uses
P-splines as the smoothers in model ((1)), which are
considered to be stable in the gamlss package, but sometimes the
fitting algorithm may diverge. The implemented imputation method will
catch this event and will try to correct it by automatically cutting
down the complexity of the model.
The necessary formula objects for the model are automatically created by
the function during execution time. The type of imputation model and its
parameters, like the degree of the P-splines, can be controlled with the
argument gam.mod
. Another way of controlling the complexity of the
fitting and imputation steps is the lin.terms
argument. This argument
can be used to specify variables by their name, that should enter model
((1)) linearly.
The default response distribution family used by the fitting and
imputation methods is the normal distribution, but it can be modified
with the argument family
. The selected family determines how many
parameters are to be modelled. To improve the stability of the
imputation method, distributional parameters can be restricted to be the
same for all units. The maximum number of parameters to be fitted is
controlled by n.ind.par
, that is, the value of n.ind.par = 2
, then
the mean and the variance are modelled individually with p-splines, but
the shape parameters are restricted to be the same for all units and are
modelled as contant values.
The function mice.impute.gamlss()
has the same structure as the
imputation methods included in the mice package, meaning that the
named method "gamlss"
can be directly passed as an argument to the
mice()
function. This function also passes all modified default
arguments to the fitting function.
Additional functions are included in the package to set the family
and
n.ind.par
arguments to non-default values. This allows users to mix
different gamlss imputation methods within one call to the function
mice()
. The functions are variantes of mice.impute.gamlss()
where
"gamlss"
is replaced with a method from Table 1 with
the syntax mice.impute.method()
. The name of the function is a
reference to the corresponding family from gamlss.family
(see Stasinopoulos and Rigby 2007)
Method | Model distribution |
---|---|
gamlssNO |
Normal |
gamlssBI |
Binomial |
gamlssGA |
Gamma |
gamlssJSU |
Johnson’s SU |
gamlssPO |
Poisson |
gamlssTF |
Student’s t |
gamlssZIBI |
Zero inflated Binomial |
gamlssZIP |
Zero inflated Poisson |
The imputation methods provided by ImputeRobust add a new
semiparametric method to the already included methods in mice.
This means that it can be used directly with the function mice()
.
As an illustration let us consider an example using the proposed method
to estimate the parameters in a linear regression model with multiple
imputation. To do this we created a data set with
> set.seed(19394)
> n <- 500
> mu <- rep(0, 4)
> Sigma <- diag(4)
> Sigma[1,2] <- 0.15; Sigma[1,3] <- 0.1; Sigma[1,4] <- -0.1
> Sigma[2,3] <- 0.25; Sigma[2,4] <- 0.05
> Sigma[lower.tri(Sigma)] = t(Sigma)[lower.tri(Sigma)]
> require("MASS")
> rawvars <- mvrnorm(n, mu = mu, Sigma = Sigma)
> pvars <- pnorm(rawvars)
> X.1 <- rawvars[,1]
> X.2 <- qchisq(pvars, 3)[,3]
> X.3 <- qpois(pvars, 2.5)[,2]
> X.4 <- qbinom(pvars, 1, .4)[,4]
> data <- cbind(X.1, X.2, X.3, X.4)
> beta <- c(1.8, 1.3, 1, -1)
> sigma <- 4.2
> y <- data %*% beta + rnorm(n, 0, sigma)
> data <- data.frame(y, data)
Thus, we obtain correlated covariates
In this first example, we create missing values in
> r.s <- cbind(y, X.1) %*% c(2,1)
> r.s <- scale(r.s)
> pos <- cut(r.s, quantile(r.s, c(0, .5, 1)), include.lowest=TRUE)
> p.r <- as.numeric(c(.9, .2))
> p.r <- as.vector(p.r[pos])
> R2 <- as.logical(rbinom(length(p.r),1,p.r))
> r.s <- cbind(y[!R2], X.1[!R2]) %*% c(2,1)
> r.s <- scale(r.s)
> pos <- cut(r.s, quantile(r.s, c(0, .4, 1)), include.lowest=TRUE)
> p.r <- as.numeric(c(.32, .27))
> p.r <- as.vector(p.r[pos])
> R3 <- as.logical(rbinom(length(p.r),1,p.r))
> R4 <- runif(nrow(data[!R2,][!R3,]), 0, 1) >= .25
> data$X.2[!R2] <- NA
> data$X.3[!R2][!R3] <- NA
> data$X.4[!R2][!R3][!R4] <- NA
More precisely, to generate missing values in md.pattern()
:
> library(ImputeRobust)
> md.pattern(data)
y X.1 X.4 X.3 X.2
276 1 1 1 1 1 0
66 1 1 1 1 0 1
127 1 1 1 0 0 2
31 1 1 0 0 0 3
0 0 31 158 224 413
The output is generated by the mice package, for details see
(van Buuren and Groothuis-Oudshoorn 2011). It can be seen that only 276 out of the 500
units are fully observed and that the missing pattern is monotone with
the smallest amount of missing values in
The imputation task can be performed with a simple call of the mice()
function:
> predictorMatrix <- matrix(c(rep(c(0,0,1,1,1),2), rep(0,5), c(0,0,1,0,0),
+ c(0,0,1,1,0)), nrow = 5)
> imps <- mice(data, method = "gamlss", predictorMatrix = predictorMatrix,
+ visitSequence = "monotone", maxit = 1, seed = 8913)
iter imp variable
1 1 X.4 X.3 X.2
1 2 X.4 X.3 X.2
1 3 X.4 X.3 X.2
1 4 X.4 X.3 X.2
1 5 X.4 X.3 X.2
In this example with a monotone missing pattern, the missing values are
imputed in accordance with Rubin (1987 171). The variables
are ranked according to the amount of missing values and imputations are
drawn starting with the most frequently observed incomplete variable. In
the next step, the most frequently observed variable of the remaining
incompletely observed variables is imputed. This procedure continues
until all variables with missing values are completed, using completely
observed but also completed variables to impute the next. In mice()
the arguments visitSequence
and predictorMatrix
control the column
order in which incomplete variables are imputed and which variables
serve as predictors in each imputation model, respectively. By setting
these arguments to non-default values the required order of imputing
values is achieved.
The result is an object of class Multiply Imputed Data Set (mids
) with
contents:
> print(imps)
Multiply imputed data set
Call:
mice(data = data, method = "gamlss", predictorMatrix = predictorMatrix,
visitSequence = "monotone", maxit = 1, seed = 8913)
Number of multiple imputations: 5
Missing cells per column:
y X.1 X.2 X.3 X.4
0 0 224 158 31
Imputation methods:
y X.1 X.2 X.3 X.4
"gamlss" "gamlss" "gamlss" "gamlss" "gamlss"
VisitSequence:
X.4 X.3 X.2
5 4 3
PredictorMatrix:
y X.1 X.2 X.3 X.4
y 0 0 0 0 0
X.1 0 0 0 0 0
X.2 1 1 0 1 1
X.3 1 1 0 0 1
X.4 1 1 0 0 0
Random generator seed value: 8913
The argument method = "gamlss"
in the mice
function call implies by
default, that imputations are drawn assuming a normal distribution for
the response variable in the imputation model given all covariables.
This has been shown to lead to acceptable results in situations in which
even non-plausible values are imputed (de Jong et al. 2014). A useful
way of inspecting the distribution of the original and an imputed data
is the stripplot()
function as shown in figure 1.
> library(lattice)
> stripplot(imps)
The model of interest, as per equation ((3)), is the linear
regression of c(1.8, 1.3, 1, -1)
. The imputed data sets can be analysed as follows:
> fit <- with(imps, lm(y ~ X.1 + X.2 + X.3 + X.4))
> print(pool(fit))
Call: pool(object = fit)
Pooled coefficients:
(Intercept) X.1 X.2 X.3 X.4
1.0729769 1.7565698 1.1600009 0.6262502 -0.4023920
Fraction of information about the coefficients missing due to nonresponse:
(Intercept) X.1 X.2 X.3 X.4
0.4477033 0.5111319 0.9276184 0.7934569 0.2925600
> round(summary(pool(fit)), 2)
est se t df Pr(>|t|) lo 95 hi 95 nmis fmi lambda
(Intercept) 1.07 0.57 1.87 22.89 0.07 -0.11 2.26 NA 0.45 0.40
X.1 1.76 0.26 6.79 17.73 0.00 1.21 2.30 0 0.51 0.46
X.2 1.16 0.29 4.00 4.47 0.01 0.39 1.93 224 0.93 0.90
X.3 0.63 0.27 2.33 6.89 0.05 -0.01 1.26 158 0.79 0.74
X.4 -0.40 0.46 -0.87 49.39 0.39 -1.33 0.52 31 0.29 0.26
The default behaviour of the "gamlss"
method of using a normal
distribution can be overridden by setting the argument family
to any
distribution contained in the gamlss.dist package (Stasinopoulos et al. 2017).
This will define globally the new response distribution for all
imputation methods that call the "gamlss"
method. If the user wants to
set different distribution families, to suit the particularities of the
variables to be imputed, several functions fully compatible with
mice()
are already included.
In the previous example, variables
> imps <- mice(data, method = c("", "", "gamlssGA", "gamlssPO", "gamlssBI"),
+ seed = 8913)
iter imp variable
1 1 X.2 X.3 X.4
1 2 X.2 X.3 X.4
1 3 X.2 X.3 X.4
1 4 X.2 X.3 X.4
1 5 X.2 X.3 X.4
2 1 X.2 X.3 X.4
2 2 X.2 X.3 X.4
...
Figure 2 shows the effect of the changes in the response
distribution for the imputation model. Choosing the Gamma distribution
for
> stripplot(imps)
The model of interest in this example is the same as in section 4.1. The results of running the analysis are slightly different due to choosing different imputation models.
> fit <- with(imps, lm(y ~ X.1 + X.2 + X.3 + X.4))
> round(summary(pool(fit)), 2)
est se t df Pr(>|t|) lo 95 hi 95 nmis fmi lambda
(Intercept) 0.78 0.48 1.61 40.46 0.11 -0.20 1.76 NA 0.33 0.30
X.1 1.78 0.25 7.27 18.00 0.00 1.27 2.30 0 0.51 0.46
X.2 1.19 0.13 8.89 10.43 0.00 0.90 1.49 224 0.66 0.60
X.3 0.70 0.21 3.37 8.90 0.01 0.23 1.18 158 0.71 0.65
X.4 -0.32 0.40 -0.80 127.39 0.43 -1.12 0.48 31 0.16 0.15
A real world example data set was retrieved from the Machine Learning
Database Repository at the University of California, Irvine
(Lichman 2013). The dataset contains 400 observations from which some
variables were selected. The data already contained 149 rows where some
values were missing. This example illustrates some of the specific extra
arguments that can be passed to the imputation method, mainly with the
objective of controlling the speed of the gamlss()
function.
> library(RWeka)
> data <- read.arff("data/chronic_kidney_disease_full.arff")
> data <- data[,c("age", "bp", "bgr", "bu", "sc", "sod", "pot", "hemo",
+ "class")]
> data$class <- ifelse(data$class == "ckd", 1, 0)
The missing data pattern is non-monotone. The fully conditional
specification approach behind mice will handle this case. Running
mice
with the default fitting arguments of gamlss
in complex data
sets like this would take a long time. There are several ways in which
the speed of the imputation can be adjusted. One alternative is by
changing the values controlling the GAMLSS fitting algorithm. The
arguments n.cyc
, bf.cyc
, cyc
control respectively the number of
cycles of the outer, backfitting, and inner algorithms of gamlss
(Rigby and Stasinopoulos 2005; Stasinopoulos and Rigby 2007). Other
arguments that can be changed are the convergence criterions of the
outer and inner algorithms with the c.crit
or cc
arguments,
respectively.
The choice of imputation model can also be modified by the user. The
relevant argument is gam.mod
. The default imputation model is a
P-spline with two degrees of freedom and second order differences, this
amounts to
gam.mod = list(type = "pb", par = list(degree = 2, order = 2)
. The
degrees and order can be changed. Also, the model can be set to be
linear using type = "linear"
.
The following code shows the result of imputing missing values of the chronic kidney disease data set. With the current parameters and without any other optimization this took 10 minutes computing time with a Core i7-3520M CPU.
> meth <- c("gamlssJSU", "gamlssJSU", "gamlssJSU", "gamlssJSU", "gamlssJSU",
+ "gamlssJSU", "gamlssJSU", "gamlssJSU", "gamlss")
> imps <- mice(data, method = meth, n.cyc = 3, bf.cyc = 2, cyc = 2,
+ maxit = 5, m = 5, seed = 8901,
+ gam.mod = list(type = "linear"))
iter imp variable
1 1 age bp bgr bu sc sod pot hemo
1 2 age bp bgr bu sc sod pot hemo
1 3 age bp bgr bu sc sod pot hemo
1 4 age bp bgr bu sc sod pot hemo
1 5 age bp bgr bu sc sod pot hemo
2 1 age bp bgr bu sc sod pot hemo
2 2 age bp bgr bu sc sod pot hemo
...
Figure 3 shows that even with an extreme simplification
of the fitting algorithm of gamlss
. The imputation method still
manages to produce acceptable values.
> stripplot(imps)
We used the imputed data set to fit a logistic regression in which we modelled the probability of having chronic kidney disease as a function of age and some other blood related information.
> fit <- with(imps, glm(class ~ age + bp + bgr + sc + sod + pot + hemo,
+ family=binomial(link='logit')))
> round(summary(pool(fit)), 2)
est se t df Pr(>|t|) lo 95 hi 95 nmis fmi lambda
(Intercept) 33.04 12.85 2.57 79.40 0.01 7.47 58.61 NA 0.21 0.19
age -0.03 0.02 -1.16 33.22 0.25 -0.07 0.02 9 0.36 0.32
bp 0.09 0.04 2.45 134.55 0.02 0.02 0.16 12 0.15 0.13
bgr 0.03 0.01 2.29 12.24 0.04 0.00 0.07 44 0.61 0.55
sc 4.27 1.59 2.69 14.12 0.02 0.87 7.67 17 0.57 0.51
sod -0.15 0.09 -1.62 45.86 0.11 -0.34 0.04 87 0.30 0.27
pot -0.68 0.38 -1.82 45.99 0.08 -1.44 0.07 88 0.30 0.27
hemo -1.72 0.37 -4.65 30.81 0.00 -2.48 -0.97 52 0.38 0.34
The imputation method based on gamlss is a fairly new imputation technique which is provided to properly handle situations in which fully parametric assumptions with respect to the conditional distribution of the variable to be imputed is questionable. The technique is based on the idea of attaining imputations avoiding possibly misspecified imputation models. Salfran (2018), de Jong et al. (2014) and de Jong (2012) showed that this technique outperforms standard imputation techniques in several scenarios. The ImputeRobust package is a step forward in the development of an imputation method that requires weak distributional assumptions, but still allows valid inference. It extends the set of applications of the existing GAMLSS package by allowing it to interact with the mice package.
By building on the popular mice package, the advantages of standard and robust imputation procedures are available in a user friendly way. Further research will focus on improving the efficiency of this semi-parametric imputation technique, requiring only weak assumptions for generating multiple imputations but still allowing valid inferences.
The authors gratefully acknowledge financial support from the German Science Foundation via grant SP 930/8-1.
Econometrics, MissingData, MixedModels
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Salfran & Spiess, "Generalized Additive Model Multiple Imputation by Chained Equations With Package ImputeRobust", The R Journal, 2018
BibTeX citation
@article{RJ-2018-014, author = {Salfran, Daniel and Spiess, Martin}, title = {Generalized Additive Model Multiple Imputation by Chained Equations With Package ImputeRobust}, journal = {The R Journal}, year = {2018}, note = {https://rjournal.github.io/}, volume = {10}, issue = {1}, issn = {2073-4859}, pages = {61-72} }