Generalized Additive Model Multiple Imputation by Chained Equations With Package

Data analysis, common to all empirical sciences, often requires complete data sets. Unfortunately, real world data collection will usually result in data values not being observed. We present a package for robust multiple imputation (the ImputeRobust package) that allows the use of generalized additive models for location, scale, and shape in the context of chained equations. The paper describes the basics of the imputation technique which builds on a semi-parametric regression model (GAMLSS) and the algorithms and functions provided with the corresponding package. Furthermore, some illustrative examples are provided.


Introduction
A common approach to allow valid inferences in the presence of missing data is "Multiple Imputation" (MI) introduced by Rubin (1987).Application of MI can be summarized in three steps.The first step is to create m > 1 sets of completed data sets by replacing each missing value with m values drawn from an appropriate posterior predictive distribution ("imputations").In the second step, the required statistical analysis technique is applied to each of the completed data sets as if it were a completely observed data set.The third step is the pooling step, where the results from the m analyses are combined to form the final results, and allows statistical inferences in the usual way.
Basically, there are two ways of specifying imputation models: Joint modelling and fully conditional specification.The joint modelling approach requires specification of a multivariate distribution for the variables whose values are not observed and drawing imputations from their predictive posterior distribution using Markov Chain Monte Carlo (MCMC) techniques.Within this framework, the most common assumption is that the data is multivariate normally distributed (Schafer, 1997).Other alternatives for categorical data are based on the latent normal model (Albert and Chib, 1993) or the general location model (Little and Rubin, 2002).This methodology is attractive if the assumed multivariate distribution is an appropriate model for the data but may lack the flexibility needed to deal with complex data sets encountered in applications.In such cases, the joint modelling approach may be too restrictive because the typical specification of a multivariate distribution is not sufficiently flexible to allow different continuous and discrete distributions (He and Raghunathan, 2009).For example, most of the existing model-based methods and software implementations assume that the data originate from a multivariate normal distribution (e.g.Honaker et al., 2011;Templ et al., 2011;van Buuren, 2007).However, the assumption of normality is inappropriate as soon as there are outliers in the data, or in the case of skewed, heavy-tailed or multimodal distributions, potentially leading to deficient results (van Buuren, 2012;He and Raghunathan, 2012).
With the fully conditional specification, also known as multivariate imputation by chained equations (van Buuren and Groothuis-Oudshoorn, 2011), a univariate imputation model is specified for each variable with missing values conditional on other variables of the data set.Starting from initially bootstrapped imputations, subsequent imputations are drawn by iterating over conditional densities (van Buuren and Groothuis-Oudshoorn, 2011;van Buuren, 2007).This framework splits high-dimensional imputation models into multiple one-dimensional problems and is appealing as an alternative to joint modelling in cases where a proper multivariate distribution can not be found.The choice of imputation models in this setting can vary depending on the type of variable to be imputed, for example, parametric models like the Bayesian linear regression or logistic regression.Liu et al. (2013) studied the asymptotic properties of this iterative imputation procedure and provided sufficient conditions under which the imputation distribution converges to the posterior distribution of a joint model when the conditional models are compatible.
De Jong (2012) and de Jong et al. (2014) proposed a new imputation technique based on generalized additive models for location, scale, and shape, GAMLSS, (Rigby and Stasinopoulos, 2005), which is a class of univariate regression models, where the assumption of an exponential family is relaxed and replaced by a general distribution family.This allows a more flexible modelling than standard parametric imputation models, not only based on the location (e.g. the mean), but also the scale (e.g.variance), and the shape (e.g., skewness and kurtosis) of the conditional distribution of the dependent variable to be imputed given all other variables.The R package ImputeRobust, described The R Journal Vol.

Robust imputation with gamlss and mice
The imputation method realized by the package ImputeRobust is based on a generalized additive model for location, scale and shape of the variable to be imputed.Specifically, we adopt the semiparametric additive formulation of GAMLSS described in Stasinopoulos and Rigby (2007).
Let Y = (Y ij ), i = 1, . . ., n and j = 1, . . ., p be a matrix with n independent observations on p variables and let y ij be the realization of variable Y j with probability function f (y ij |θ ij ) conditional on parameters θ ij = (θ k ij ), k = 1, . . ., 4. Assume that g k are known monotonic link functions relating the distribution parameters θ k ij to explanatory variables by the following equation: where θ k ij for k = 1, 2, 3, 4 are the location, scale and shape parameters of the distribution, X k is a fixed known design matrix, β T k a vector of linear predictors, and h lk (x lk ) are unknown smoothing functions of the explanatory variables.Not all four parameters may be needed, depending on the conditional distribution f (y ij |θ ij ) which will be denoted as D. The package gamlss (Rigby and Stasinopoulos, 2005) provides a wide range of possible continuous and discrete conditional distributions with varying number of parameters, although not all of these distributions have been adopted yet to create imputations (see Table 1 on section "Main functions").
The proposed imputation method selects the conditional distribution D for each of the variables to be imputed with the MICE algorithm (van Buuren and Groothuis-Oudshoorn, 2011).This distribution defaults to normal for continuous data, but other alternatives may be chosen.Users can take advantage of this option to restrict imputations to a certain range, e.g., by specifying a truncated normal distribution.Alternatives already included are Logit and Poisson models to handle binary and count data.
The chosen distribution D defines the type and number of parameters to be modelled, e.g., for the default normal distribution the mean and variance are estimated (individually for each point), but other distributions may require the estimation of the skewness and kurtosis in addition.Adopting models with more parameters increases their flexibility and thus may increase the chance that the imputation procedure is proper in the sense of Rubin (1987).On the other hand, larger sample sizes may be needed to identify the larger number of parameters.

Implementation
The implementation of the imputation method for our simulations uses the gamlss package (see Rigby and Stasinopoulos, 2005;Stasinopoulos and Rigby, 2007) in R to fit model (1) based on (penalized) maximum likelihood estimation and adopting the default link functions.Rigby and Stasinopoulos (2005) and Stasinopoulos and Rigby (2007) provide a description of the fitting algorithms used by this package.As smoothing functions h jk , we use P-splines with 20 knots, a piecewise polynomial of degree three, a second order penalty and automatic selection of the smoothing parameter using the Local Maximum Likelihood criterion (see Eilers and Marx, 1996).To prevent abnormal termination of the algorithm, for example if samples are too small, the degree of the polynomial, the order of the penalty, or the stopping time of the fitting algorithm are reduced as a fallback strategies.
Let Y j be one incompletely observed column of Y.The observed and missing parts of Y j are denoted by Y obs j and Y mis j , respectively.Let Y −j = (Y 1 , . . ., Y j−1 , Y j+1 , . . ., Y p ) denote the collection of variables in Y except Y j .The package gamlss does not support Bayesian inference, hence it is not possible to obtain multiple imputations by drawing from the posterior predictive distribution.However, draws from the predictive posterior distribution are approximated by the bootstrap predictive distribution (Harris, 1989): where η denotes the possible values of the imputation model parameters, η(Y obs j , Y obs −j ) is an estimator The R Journal Vol.10/1, July 2018 ISSN 2073-4859 of such parameters, and f ( η| η(Y obs j , Y obs −j )) is the sampling distribution of the imputation parameters evaluated at the estimated values.The sampling distribution is simulated with a parametric bootstrap acting as a replacement for the posterior distribution of the imputation parameters.Algorithm 1 describes the imputation process.For a clear presentation, we drop the index i. 4. Repeat m times steps 2 and 3 to generate m imputed data sets.

Main functions
The two main functions included in the ImputeRobust package are the imputing and fitting functions mice.impute.gamlss()and ImpGamlssFit(), respectively.
The function ImpGamlssFit() is internal and its job is to read in the data and model parameters and create a bootstrap predictive sampling function, i.e., it will work through steps 1 to 3 of Algorithm 1.The fitting step depends on the gamlss package, and the same control options described in Stasinopoulos and Rigby (2007) can be passed to ImpGamlssFit() when calling the mice() function.By default, ImpGamlssFit() uses P-splines as the smoothers in model (1), which are considered to be stable in the gamlss package, but sometimes the fitting algorithm may diverge.The implemented imputation method will catch this event and will try to correct it by automatically cutting down the complexity of the model.
The necessary formula objects for the model are automatically created by the function during execution time.The type of imputation model and its parameters, like the degree of the P-splines, can be controlled with the argument gam.mod.Another way of controlling the complexity of the fitting and imputation steps is the lin.terms argument.This argument can be used to specify variables by their name, that should enter model (1) linearly.
The default response distribution family used by the fitting and imputation methods is the normal distribution, but it can be modified with the argument family.The selected family determines how many parameters are to be modelled.To improve the stability of the imputation method, distributional parameters can be restricted to be the same for all units.The maximum number of parameters to be fitted is controlled by n.ind.par, that is, the value of k in equation (1).For example, if the Johnson' SU family (a four parameter continuous distribution) is selected and n.ind.par= 2, then the mean and the variance are modelled individually with p-splines, but the shape parameters are restricted to be the same for all units and are modelled as contant values.
The function mice.impute.gamlss()has the same structure as the imputation methods included in the mice package, meaning that the named method "gamlss" can be directly passed as an argument to the mice() function.This function also passes all modified default arguments to the fitting function.
Additional functions are included in the package to set the family and n.ind.pararguments to non-default values.This allows users to mix different gamlss imputation methods within one call to the function mice().The functions are variantes of mice.impute.gamlss()where "gamlss" is replaced with a method from Table 1 with the syntax mice.impute.method().The name of the function is a reference to the corresponding family from gamlss.family(see Stasinopoulos and Rigby, 2007) The R Journal Vol.

Usage
The imputation methods provided by ImputeRobust add a new semiparametric method to the already included methods in mice.This means that it can be used directly with the function mice().

Simple example
As an illustration let us consider an example using the proposed method to estimate the parameters in a linear regression model with multiple imputation.To do this we created a data set with n = 500 units composed of four independent variables and one dependent variable with the following code: > data <-cbind(X.1,X.2, X.3, X.4) > beta <-c(1.8,1.3, 1, -1) > sigma <-4.2 > y <-data %*% beta + rnorm(n, 0, sigma) > data <-data.frame(y,data) Thus, we obtain correlated covariates X.1, . . ., X.4 which are random draws from specific distributions, that is, values for X.1 are drawn from a standard normal random distribution, values for X.2 are drawn from a χ 2 distribution with three degrees of freedom, values for X.3 are drawn from a Poisson distribution with parameter λ = 2.5, and values for X.4 come from a Bernoulli distribution with parameter π = 0.4.The dependent variable, y, is created according to the linear regression model: The vector of linear predictors, β, and the error variance, σ 2 , are chosen so that the coefficient of determination, R 2 , is 0.5.
The imputation task can be performed with a simple call of the mice() function: > predictorMatrix <-matrix(c(rep(c(0,0,1,1,1),2), rep(0,5), c(0,0,1,0,0), + c(0,0,1,1,0)), nrow = 5) > imps <-mice(data, method = "gamlss", predictorMatrix = predictorMatrix, + visitSequence = "monotone", maxit = 1, seed = 8913) iter imp variable 1 In this example with a monotone missing pattern, the missing values are imputed in accordance with Rubin (1987, p. 171).The variables are ranked according to the amount of missing values and imputations are drawn starting with the most frequently observed incomplete variable.In the next step, the most frequently observed variable of the remaining incompletely observed variables is imputed.This procedure continues until all variables with missing values are completed, using completely observed but also completed variables to impute the next.In mice() the arguments visitSequence and predictorMatrix control the column order in which incomplete variables are imputed and which variables serve as predictors in each imputation model, respectively.By setting these arguments to non-default values the required order of imputing values is achieved.

Modifying the imputation model
The default behaviour of the "gamlss" method of using a normal distribution can be overridden by setting the argument family to any distribution contained in the gamlss.distpackage (Stasinopoulos et al., 2017).This will define globally the new response distribution for all imputation methods that call the "gamlss" method.If the user wants to set different distribution families, to suit the particularities of the variables to be imputed, several functions fully compatible with mice() are already included.
In the previous example, variables X.3 and X.4 were treated as continuous variables in the imputation step, and it was possible for X.2 to take on negative values, even though is a strictly positive variable.We will show now how the GAMLSS imputation model can be adjusted to accommodate different types of distributions simultaneously.For the new imputation task, we use a Gamma distribution for X.2, a Poisson distribution for the imputation of X.3, and a Binomial distribution for X.4.The incomplete data set is the same as before, where X.1 and y need not to be imputed.
> imps <-mice(data, method = c("", "", "gamlssGA", "gamlssPO", "gamlssBI"), Figure 2 shows the effect of the changes in the response distribution for the imputation model.Choosing the Gamma distribution for X.2, the Poisson distribution for X.3 and the Binomial distribution for X.4 imputes realistic values as compared to, e.g., choosing the normal distribution.Whether to favor the imputation of realistic values or not, is a different problem.Since the final goal is statistical validity of MI based estimation (Rubin, 1996), in some applications it may be better to allow for the imputation of "unrealistic" while using a flexible model, for example, impute continuous values when the variable to be imputed is a count variable (see de Jong, 2012;Salfran, 2018).

> stripplot(imps)
The model of interest in this example is the same as in section 2.4.1.The results of running the analysis are slightly different due to choosing different imputation models.

Chronic kidney disease data
A real world example data set was retrieved from the Machine Learning Database Repository at the University of California, Irvine (Lichman, 2013).The dataset contains 400 observations from which some variables were selected.The data already contained 149 rows where some values were missing.This example illustrates some of the specific extra arguments that can be passed to the imputation method, mainly with the objective of controlling the speed of the gamlss() function.
> library(RWeka) > data <-read.arff("data/chronic_kidney_disease_full.arff")> data <-data[,c("age", "bp", "bgr", "bu", "sc", "sod", "pot", "hemo", + "class")] > data$class <-ifelse(data$class == "ckd", 1, 0) The missing data pattern is non-monotone.The fully conditional specification approach behind mice will handle this case.Running mice with the default fitting arguments of gamlss in complex data sets like this would take a long time.There are several ways in which the speed of the imputation can be adjusted.One alternative is by changing the values controlling the GAMLSS fitting algorithm.The arguments n.cyc, bf.cyc, cyc control respectively the number of cycles of the outer, backfitting, and inner algorithms of gamlss (Rigby and Stasinopoulos, 2005;Stasinopoulos and Rigby, 2007).Other The R Journal Vol.10/1, July 2018 ISSN 2073-4859 arguments that can be changed are the convergence criterions of the outer and inner algorithms with the c.crit or cc arguments, respectively.
The choice of imputation model can also be modified by the user.The relevant argument is gam.mod.The default imputation model is a P-spline with two degrees of freedom and second order differences, this amounts to gam.mod = list(type = "pb",par = list(degree = 2,order = 2).The degrees and order can be changed.Also, the model can be set to be linear using type = "linear".
The following code shows the result of imputing missing values of the chronic kidney disease data set.With the current parameters and without any other optimization this took 10 minutes computing time with a Core i7-3520M CPU.

> stripplot(imps)
We used the imputed data set to fit a logistic regression in which we modelled the probability of having chronic kidney disease as a function of age and some other blood related information.

Conclusion
The imputation method based on gamlss is a fairly new imputation technique which is provided to properly handle situations in which fully parametric assumptions with respect to the conditional distribution of the variable to be imputed is questionable.The technique is based on the idea of attaining imputations avoiding possibly misspecified imputation models.Salfran (2018), de Jong et al. (2014) and de Jong (2012) showed that this technique outperforms standard imputation techniques in several scenarios.The ImputeRobust package is a step forward in the development of an imputation method that requires weak distributional assumptions, but still allows valid inference.It extends the set of applications of the existing GAMLSS package by allowing it to interact with the mice package.
By building on the popular mice package, the advantages of standard and robust imputation procedures are available in a user friendly way.Further research will focus on improving the efficiency of this semi-parametric imputation technique, requiring only weak assumptions for generating multiple imputations but still allowing valid inferences.

Figure 1 :
Figure 1: Stripplot of the five variables in the original and the five imputed data sets.Observed data values are blue and imputed data values are red.

Figure 2 :
Figure 2: Stripplot of the five variables in the original data and the five imputed data sets.Observed data values are blue and imputed data values are red.

TheFigure 3 :
Figure 3: Stripplot of the chronic kidney data set variables with missings and five imputed data sets.Observed data is in blue and imputed data in red.
Salfran (2018)d Groothuis-Oudshoorn, 2011)ection, provides the functions necessary to apply the GAMLSS imputation technique to missing data problems.It can be used as standalone tool or in combination with mice(van Buuren and Groothuis-Oudshoorn, 2011).Salfran (2018)provides an extensive comparison study of the new package and other Multiple Imputation techniques.