In many contexts, missing data and disclosure control are ubiquitous and challenging issues. In particular, at statistical agencies, the respondent-level data they collect from surveys and censuses can suffer from high rates of missingness. Furthermore, agencies are obliged to protect respondents’ privacy when publishing the collected data for public use. The NPBayesImputeCat R package, introduced in this paper, provides routines to i) create multiple imputations for missing data and ii) create synthetic data for statistical disclosure control, for multivariate categorical data, with or without structural zeros. We describe the Dirichlet process mixture of products of the multinomial distributions model used in the package and illustrate various uses of the package using data samples from the American Community Survey (ACS). We also compare results of the missing data imputation to the mice R package and those of the synthetic data generation to the synthpop R package.
Missing data problems arise in many statistical analyses. To impute missing values, multiple imputation, first proposed by Rubin (1987), has been widely adopted. This approach estimates predictive models based on the observed data, fills in missing values with draws from the predictive models, and produces multiple imputed and completed datasets. Data analysts then apply standard statistical analyses (e.g., regression analysis) on each imputed dataset and use appropriate combining rules to obtain valid point estimates and variance estimates (Rubin 1987).
As a brief review of the multiple imputation combining rules for missing data, let \(q\) be the completed data estimator of some estimand of interest \(Q\), and let \(u\) be the estimator of the variance of \(q\). For \(l=1, \dots, m\), let \(q^{(l)}\) and \(u^{(l)}\) be the values of \(q\) and \(u\) in the \(l\)th completed dataset. The multiple imputation estimate of \(Q\) is equal to \(\bar{q}_m = \sum_{l=1}^m q^{(l)}/m\), and the estimated variance associated with \(\bar{q}_m\) is equal to \(T_m = (1 + 1/m)b_m + \bar{u}_m\) , where \(b_m = \sum_{l=1}^m (q^{(l)} - \bar{q}_m)^2/(m-1)\) and \(\bar{u}_m = \sum_{l=1}^m u^{(l)}/m\). Inferences for \(Q\) are based on \((\bar{q}_m - Q) \sim t_{v}(0, T_m)\), where \(t_{v}\) is a \(t\)-distribution with \(v = (m-1)(1 + \bar{u}_m/ [(1+1/m) b_m])^2\) degrees of freedom.
Multiple imputation by chained equations (MICE, Buuren and Groothuis-Oudshoorn (2011)) remains the most popular method for generating multiple completed datasets after multiple imputation. Under MICE, one specifies univariate conditional models separately for each variable, usually using generalized linear models (GLMs) or classification and regression trees (CART Breiman et al. (1984; Burgette and Reiter 2010)), and then iteratively samples plausible predicted values from the sequence of conditional models . For implementing MICE in R, most analysts use the mice package. For an in-depth review of the MICE algorithm, see Buuren and Groothuis-Oudshoorn (2011). For more details and reviews, see Rubin (1996), Harel and Zhou (2007), Reiter and Raghunathan (2007).
Statistical agencies regularly collect information from surveys and censuses and make such information publicly available for various purposes, including research and policymaking. In numerous countries around the world, statistical agencies are legally obliged to protect respondents’ privacy when making this information available to the public. Statistical disclosure control (SDC) is the collection of techniques applied to confidential data before public release for privacy protection. Popular SDC techniques for tabular data include cell suppression and adding noise, and popular SDC techniques for respondent-level data (also known as microdata) include swapping, adding noise, and aggregation. Hundepool et al. (2012) provide a comprehensive review of SDC techniques and applications.
The multiple imputation methodology has been generalized to SDC. One approach to facilitating microdata release is to provide synthetic data. First proposed by Little (1993) and Rubin (1993), the synthetic data approach estimates predictive models based on the original, confidential data, simulates synthetic values with draws from the predictive models, and produces multiple synthetic datasets. Data analysts then apply standard statistical analyses (e.g., regression analysis) on each synthetic dataset and use appropriate combining rules (different from those in multiple imputation) to obtain valid point estimates and variance estimates (Reiter and Raghunathan 2007; Drechsler 2011). Moreover, synthetic data comes in two flavors: fully synthetic data (Rubin 1993), where every variable is deemed sensitive and therefore synthesized, and partially synthetic data (Little 1993), where only a subset of variables is deemed sensitive and synthesized, while the remaining variables are un-synthesized. Statistical agencies can choose between these two approaches depending on their protection goals, and subsequent analyses also differ.
When dealing with fully synthetic data, \(\bar{q}_m\) estimates \(Q\) as in the multiple imputation setting, but the estimated variance associated with \(\bar{q}_m\) becomes \(T_f = (1 + 1/m)b_m - \bar{u}_m\) , where \(b_m\) and \(\bar{u}_m\) are defined as in previous section on multiple imputation. Inferences for \(Q\) are now based on \((\bar{q}_m - Q) \sim t_{v}(0, T_f)\), where the degrees of freedom is \(v_f = (m-1)(1-m\bar{u}_m/((m+1)b_m))^2\).
For partially synthetic data, \(\bar{q}_m\) still estimates \(Q\) but the estimated variance associated with \(\bar{q}_m\) is \(T_p = b_m/m + \bar{u}_m\) , where \(b_m\) and \(\bar{u}_m\) are defined as in the multiple imputation setting. Inferences for \(Q\) are based on \((\bar{q}_m - Q) \sim t_{v}(0, T_p)\), where the degrees of freedom is \(v_p = (m-1)(1 + \bar{u}_m/ [ b_m/m])^2\).
For synthetic data with R, synthpop provides synthetic data generated by drawing from conditional distributions fitted to the confidential data. The conditional distributions are estimated by models chosen by the user, whose choices include parametric or CART models. For more details and reviews of synthetic data for statistical disclosure control, see Drechsler (2011).
An important feature of survey data is the existence of structural zeros, which are combinations of variables with probability zero. For example, in the combinations of variables of vital signs, there should not exist a deceased patient with a pulse. For the household surveys, in the combinations of variables of relationship and age, there should not exist a household where a son is older than his biological father. As another example, if a dataset contains information of a record’s age and educational attainment in the form of categorical variables, there can be no record having the combination of being younger than 5 and having a doctorate degree.
In survey data with many variables, cross-tabulations of variables could result in sparse tables, containing non-structural zeros (combinations that are possible but happen not to exist in the particular dataset) and structural zeros (combinations that are simply impossible). To deal with structural zeros, many advanced statistical models are designed to assign zero probability for every impossible combination, which is a challenging task.
The NPBayesImputeCat package specializes in estimating and performing multiple imputation and synthetic data generation for multivariate categorical data. Unlike mice and synthpop, both of which specify conditional models, the NPBayesImputeCat implements the Dirichlet process mixture of products of multinomial distributions (DPMPM), which specifies a joint latent class model on multivariate categorical variables. It uses Dirichlet process (DP) priors to allow effective clustering of the observations. Therefore, the NPBayesImputeCat package adds to the tools of imputation and synthesis, where a joint model might be more suitable than a series of conditional models for multivariate categorical data.
NPBayesImputeCat also allows imputation with structural zeros. It, therefore, helps fill an important gap in missing data imputation techniques, as currently available R packages do not facilitate imputation with structural zeros, and users might have to post-process, such as rejection sampling to delete generated but impossible cases.
For multiple imputation, the NPBayesImputeCat package allows data with and without structural zeros. For synthetic data, currently, the package only allows data without structural zeros.
The rest of the paper is organized as follows. We first introduce the joint latent class models for multivariate categorical data that the NPBayesImputeCat package applies, that is, the DPMPM model. In addition, we review applications of multiple imputation and synthetic data generation using the DPMPM in the literature. Next, we introduce the sample datasets from the American Community Survey (ACS) to be used in the demonstration, and provide illustrations for both multiple imputation and synthetic data generation using the NPBayesImputeCat package while comparing to other existing R packages. The paper concludes with a summary and discussion.
Proposed by Dunson and Xing (2009), the DPMPM is a Bayesian latent class model developed for multivariate categorical data. To allow for effective clustering of the observations based on all categorical variables, DP priors are specified for the mixture probabilities and multinomial probability vectors of the categorical data. The DPMPM has been shown to capture the complex dependencies in multivariate categorical data while being computationally efficient. In addition, it empowers the data to select the number of latent classes to be used in the model estimation. The model has also been extended to account for structural zeros in categorical data (Manrique-Vallier and Reiter 2014a).
The NPBayesImputeCat package includes two versions of the DPMPM: i) DPMPM without structural zeros, and ii) DPMPM with structural zeros. In this section, we introduce the details of both versions and review previous work on using the DPMPM for multiple imputation and synthetic data.
Our review of the DPMPM without structural zeros closely follows the review in Hu and Hoshino (2018). Consider a sample \(\mathbf{X}\) consisting of \(n\) records, where each \(i\)th record, with \(i = 1, \ldots, n\), has \(p\) unordered categorical variables. The basic assumption of the DPMPM is that every record \(\mathbf{X}_i = (X_{i1}, \cdots, X_{ip})\) belongs to one of \(K\) underlying unobserved/latent classes. Given the latent class assignment \(z_i\) of record \(i\), as in Equation ((2)), each variable \(X_{ij}\) independently follows a multinomial distribution, as in Equation ((1)), where \(d_j\) is the number of categories of variable \(j\), and \(j = 1, \ldots, p\). \[\begin{aligned} \label{lca-x} X_{ij} \mid z_i, \theta &\overset{ind}{\sim} \textrm{Multinomial}(\theta_{z_i1}^{(j)}, \dots, \theta_{z_i d_j}^{(j)}; 1) \,\,\,\, \forall i, j \end{aligned} \tag{1}\]
\[\begin{aligned} \label{lca-z} z_i \mid \pi &\sim \textrm{Multinomial}(\pi_1, \dots, \pi_K; 1) \,\,\,\, \forall i \end{aligned} \tag{2}\]
The marginal probability \(Pr(X_{i1} = x_{i1}, \cdots, X_{ip} = x_{ip} \mid \pi, \theta)\) can be expressed as averaging over the latent classes: \[\label{joint1} Pr(X_{i1} = x_{i1}, \cdots, X_{ip} = x_{ip} \mid \pi, \theta) = \sum_{k=1}^{K} \pi_k \prod_{j=1}^{p} \theta_{kx_{ij}}^{(j)}. \tag{3}\] As pointed out in Si and Reiter (2013), Hu et al. (2014), Akande et al. (2017), such averaging over latent classes results in dependence among the variables. Equation ((3)) will also help illustrate the DPMPM with structural zeros in the next section.
The DPMPM clusters records with similar characteristics based on all \(p\) variables. Relationships among all the variables are induced by integrating out the latent class assignment \(z_i\). To empower the DPMPM to pick the effective number of occupied latent classes, the truncated stick-breaking representation (Sethuraman 1994) of the DP prior is used as in Equation ((4)) through Equation ((7)), \[\begin{aligned} \label{prior-pi} \pi_k &= V_{k}\prod_{l<k}(1-V_{l}) \,\,\,\, \textrm{for } k = 1, \dots, K, \end{aligned} \tag{4}\]
\[\begin{aligned} \label{prior-V} V_{k}&\overset{iid}{\sim} \textrm{Beta} (1,\alpha) \,\,\,\, \textrm{for } k = 1, \dots, K-1, \,\,\,\, V_K=1, \end{aligned} \tag{5}\]
\[\begin{aligned} \label{prior-alpha} \alpha &\sim \textrm{Gamma}(a_{\alpha}, b_{\alpha}), \end{aligned} \tag{6}\]
\[\begin{aligned} \label{prior-theta} \boldsymbol{\theta}_{k}^{(j)}=(\theta_{k1}^{(j)},\dots,\theta_{kd_j}^{(j)})&\sim \textrm{Dirichlet} (a_{1}^{(j)}, \dots,a_{d_{j}}^{(j)}) \,\,\,\, \textrm{for } j = 1, \dots, p, \ \ \ k = 1, \dots, K. \end{aligned} \tag{7}\] and a blocked Gibbs sampler is implemented for the Markov chain Monte Carlo (MCMC) sampling procedure (Ishwaran and James 2001; Si and Reiter 2013; Hu et al. 2014; Akande et al. 2017; Hu and Savitsky 2018; Manrique-Vallier and Hu 2018; Drechsler and Hu 2021).
When used as an imputation engine, missing values are handled within the Gibbs sampler. As described in Akande et al. (2017), at one MCMC iteration \(l\), one samples a value of the latent class indicator \(z_i\) using Equation ((2)), given a draw of the parameters and observed data. In this iteration \(l\), given the sampled \(z_i\), one samples missing values using independent draws from Equation ((1)). This process is repeated for every missing value in the dataset in iteration \(l\), obtaining one imputed dataset.
When used as a data synthesizer, the fully observed confidential dataset is used for model estimation through MCMC, and sensitive variable values are synthesized as an extra step at chosen MCMC iteration. For example, at MCMC iteration \(l\), one samples a value of the latent class indicator \(z_i\) using Equation ((2)). Given the sampled \(z_i\), one samples synthetic values of sensitive variables using independent draws from Equation ((1)). This process is repeated for every record that has sensitive values to be synthesized, obtaining one synthetic dataset.
When structural zeros are present, we need to modify the likelihood to enforce zero probability for impossible combinations. That is, we need to truncate the support of the DPMPM. Following the general description in Manrique-Vallier and Reiter (2014a) and Manrique-Vallier and Hu (2018), let \(\mathcal{C}\) represent all combinations of individuals, including impossible combinations; let \(\mathcal{MCZ} \not\subseteq \mathcal{C}\) be the set of impossible combinations to be excluded. We restrict \(\mathbf{X}\) to the set \(\mathcal{C} \setminus \mathcal{MCZ}\), with \(Pr(\mathbf{X} \in \mathcal{MCZ}) = 0\). The marginal probability in the DPMPM without structural zeros in Equation ((3)) then becomes \[\label{joint2} Pr(\mathbf{X}_i = \mathbf{x}_i \mid \pi, \theta, \mathcal{MCZ}) \propto I(\mathbf{X}_i \notin \mathcal{MCZ}) \sum_{k=1}^{K} \pi_k \prod_{j=1}^{p} \theta_{kx_{ij}}^{(j)}. \tag{8}\] Let \(\mathcal{X}^*\) be the sample that only contains possible combinations, we have the joint likelihood as \[\label{DPMPMzeros_lik} p(\mathcal{X}^* \mid \pi, \theta, \mathcal{MCZ}) \propto \prod_{i=1}^{n} I(\mathbf{X}_i \notin \mathcal{MCZ}) \sum_{k=1}^{K} \pi_k \prod_{j=1}^{p} \theta_{kx_{ij}}^{(j)}. \tag{9}\]
To get the Gibbs sampler to work, we follow the general data augmentation technique proposed by Manrique-Vallier and Reiter (2014a) and assume the existence of an observed sample \(\mathcal{X}^0\) of unknown size \(Nmis\), generated from the DPMPM without structural zeros (i.e., the unrestricted DPMPM). \(\mathcal{X}^0\) only contains records that fall into \(\mathcal{MCZ}\).
The same set of DP priors in Equation ((4)) through Equation ((7)) is used in the DPMPM with structural zeros. In the Gibbs sampler, we keep the generated \(\mathcal{X}^0\) and combine it with \(\mathcal{X}^*\) when estimating the model parameters. For computational expedience, we set the upper bound of the number of observations, \(Nmis\), that can be generated in \(\mathcal{X}^0\), to be fixed at a large \(Nmax\) at every iteration. When used as either an imputation engine or a data synthesizer, missing values or synthetic data are generated from the truncated likelihood Equation ((9)).
The DPMPM has been adapted as a multiple imputation engine to deal with missing values in categorical data. Some imputation applications have focused on the DPMPM without structural zeros, while others have dealt with the DPMPM with structural zeros.
Among the work on multiple imputation using the DPMPM without structural zeros, Si and Reiter (2013) applied the DPMPM imputation model to impute missing background data (categorical) in the 2007 Trends in International Mathematics and Science Study (TIMSS). The 2007 TIMSS data contains 80 background variables on 90,505 students. Among the 80 categorical background variables, 68 have less than 10% missing values, 6 variables have between 10% and 30% missing values, and 1 variable has more than 75% missing values.
Akande et al. (2017) designed simulation studies using data from the American Community Survey (ACS) and compared the DPMPM imputation engine to two other widely used multiple imputation engines: i) chained equations using generalized linear models, and ii) chained equations using classification and regression trees (CART). From a population of 671,153 housing units and 35 categorical variables collected and cleaned from the 2012 ACS data, Akande et al. (2017) performed repeated sampling and empirically compared the three multiple imputation models.
Among the work on multiple imputation using the DPMPM with structural zeros, Manrique-Vallier and Reiter (2014b) followed the data augmentation approach Manrique-Vallier and Reiter (2014a), and imputed missing data of repeated samples from the 5% public use microdata sample from the 2000 United States Census for the state of New York, a population of 953,076 individuals and 10 categorical variables, with the number of levels ranging from 2 to 11.
Finally, Murray (2018) provides an excellent review of practical and theoretical findings of multiple imputation research and highlights the DPMPM imputation engine as a recent development.
The DPMPM has also been used as a synthetic data generator to the public release of useful and private micro-level categorical data. Some work focused on the DPMPM without structural zeros, while others dealt with synthetic data problems using the DPMPM with structural zeros.
Among the work on synthetic data generation using the DPMPM without structural zeros, Hu et al. (2014) used the DPMPM to generate fully synthetic data for a subset of 10,000 individuals and 14 categorical variables from the 2012 ACS public use microdata sample for the state of North Carolina. Drechsler and Hu (2021) generated partially synthetic data for large-scale administrative data containing detailed geographic information in Germany. Hu and Savitsky (2018) also used the DPMPM to generate partially synthetic data for the Consumer Expenditure Surveys (CE) at the U.S. Bureau of Labor Statistics (BLS), disseminating detailed county-level geographic information.
Among the work on synthetic data generation using the DPMPM with structural zeros, Manrique-Vallier and Hu (2018) proposed a data augmentation approach and generated fully synthetic data of repeated samples from the 5% public use microdata from the 2000 United States Census for the state of California, a population of 1,690,642 records measured in 17 categorical variables, with the number of levels ranging from 2 to 11.
Before presenting detailed step-by-step illustrations to use the NPBayesImputeCat package for multiple imputation and data synthesis applications, we introduce two samples from the 2016 1-year American Community Surveys (ACS), which will both be used for our illustrations.
ACS sample 1, ss16pusa_sample_zeros
, contains structural zeros. It
will be used to illustrate how to perform multiple imputation and data
synthesis tasks when structural zeros are present. ACS sample 2,
ss16pusa_sample_nozeros
, is a subset of ACS sample 1 and contains no
structural zeros. It will be used to illustrate how to perform multiple
imputation and data synthesis tasks when structural zeros are not
present.
Variable | Description | \(\#\) | Category details |
---|---|---|---|
AGEP | Age | 7 | 16; 17; [18, 24]; [25, 35]; [36, 50]; [51, 70]; (70, ) |
MAR | Marital status | 5 | Married; Widowed; Divorced; Separated; Never married. |
SCHL | Education attainment | 9 | Up to K0; Some K12, no diploma; High school diploma or GED; Some college, no degree; Associate’s degree; Bachelor’s degree; Master’s degree; Professional degree; Doctorate degree. |
SEX | Sex | 2 | Male; Female |
WKL | When last worked | 3 | Within the last 12 months; 1-5 years ago; Over 5 years ago or never worked. |
ACS sample 1 is a random sample of \(n = 1,000\) observations on \(p = 5\)
variables. See Table 1 for the data dictionary. The
sample is saved as ss16pusa_sample_zeros
, and it contains structural
zeros: 8 combinations, all related to AGEP and SCHL variables, listed in
Table 2. These 8 cases are derived from the original
2016 1-year ACS data (as the population).
\(\#\) | Description |
---|---|
1 | AGEP = 16 & SCHL = Bachelor’s degree. |
2 | AGEP = 16 & SCHL = Doctorate degree. |
3 | AGEP = 16 & SCHL = Master’s degree. |
4 | AGEP = 16 & SCHL = Professional degree. |
5 | AGEP = 17 & SCHL = Bachelor’s degree. |
6 | AGEP = 17 & SCHL = Doctorate degree. |
7 | AGEP = 17 & SCHL = Master’s degree. |
8 | AGEP = 17 & SCHL = Professional degree. |
To obtain a sample without structural zeros, we take a subset of ACS
sample 1, where \(n = 1,000\) and \(p = 3\), dropping variables AGEP and
SCHL to eliminate any structural zeros. This ACS sample 2 is saved as
ss16pusa_sample_nozeros
. See Table 3 for the data
dictionary.
Variable | Description | \(\#\) | Category details |
---|---|---|---|
MAR | Marital status | 5 | Married; Widowed; Divorced; Separated; Never married. |
SEX | Sex | 2 | Male; Female |
WKL | When last worked | 3 | Within the last 12 months; 1-5 years ago; Over 5 years ago or never worked. |
To illustrate the applications of the
NPBayesImputeCat
package to missing data, we introduce 30% missingness for each variable
in the ACS sample 1 and ACS sample 2 datasets, under the missing
completely at random (MCAR) mechanism. The corresponding datasets (data
containing missing values) to ACS sample 1 and ACS sample 2 are saved as
ss16pusa_sample_zeros_miss
and ss16pusa_sample_nozeros_miss
,
respectively. The DPMPM imputation engine is designed to perform
multiple imputations of categorical data that are missing at random
(MAR)–and thus also data that are missing completely at random (MCAR).
We begin with the imputation of the missing values in the ACS sample 2
with 30% missingness, ss16pusa_sample_nozeros_miss
, where no
structural zeros are present. In the next section, we demonstrate how to
impute missing values for ACS sample 1 with 30% missingness,
ss16pusa_sample_zeros_miss
, where structural zeros are present.
For each sample, we also compare the performance of the DPMPM engine to the most popular multiple imputation method, MICE. We implement the latter using the mice package in R. A brief review of mice is included at the beginning of the paper.
First, we load the sample data, the ACS sample 2 with 30% missingness, and make sure that all variables are unordered factors.
data("ss16pusa_sample_nozeros_miss")
<- ss16pusa_sample_nozeros_miss
X <- ncol(X)
p for (j in 1:p){
<- as.factor(X[,j])
X[,j] }
We use the DPMPM_nozeros_imp
function to implement the DPMPM
imputation engine without structural zeros. We first review the process
for creating and initializing the DPMPM model using the CreateModel
function to enable analysts to tune the number of mixture components
through initial runs, before generating imputations using
DPMPM_nozeros_imp
. CreateModel
is a wrapper function for creating an
object of type “Lcm”. Lcm was implemented as an Rcpp module to expose
the C++ implementation for our algorithm. Users can learn more about the
Lcm
class by typing ?‘Lcm’
, which will bring up the R documentation
for this class, including all methods and properties.
CreateModel
takes 7 arguments as input:
X
, the original data with missing values.
MCZ
, the data frame containing the structural zeros definitions -
use NULL
when structural zeros are not present.
K
, the maximum number of mixture components (i.e., the maximum
number of latent classes in the DPMPM imputation engine).
Nmax
, an upper truncation limit for the augmented sample size,
that is, the maximum number of observations allowable in the
augmented \(\mathcal{X}^0\) - use 0
when structural zeros are not
present.
aalpha
, the hyper parameter \(a_{\alpha}\) in stick-breaking prior
distribution in Equation ((6)).
balpha
, the hyper parameter \(b_{\alpha}\) in stick-breaking prior
distribution in Equation ((6)).
seed
, the seed value.
As a quick demonstration, we let K
\(= 30\), aalpha
\(=\) balpha
\(=\)
0.25, and seed
\(= 456\). The code below creates and initializes the
DPMPM imputation engine without structural zeros for the data stored in
X
.
<- CreateModel(X = X,
model MCZ = NULL,
K = 30,
Nmax = 0,
aalpha = 0.25,
balpha = 0.25,
seed = 456)
Next, we run the model
object for a set of user-specified numbers of
burn-ins, MCMC iterations, and thinning. For example, to run the MCMC
sampler for 5 iterations post 2 burn-ins, thin every 1 iteration, and
print the output at each iteration, run the following code.
> model$Run(burnin = 2,
iter = 5,
thinning = 1,
silent = FALSE)
Initializing...
Run model without structural zeros.= 0 kstar = 30 alpha = 1 Nmis = 0
iter = 0 kstar = 30 alpha = 7.81552 Nmis = 0
iter = 1 kstar = 30 alpha = 6.6941 Nmis = 0
iter = 2 kstar = 30 alpha = 4.60622 Nmis = 0
iter = 3 kstar = 30 alpha = 5.67409 Nmis = 0 iter
Here, we show the first few lines of the output. The output prints out
the iteration index as iter
, the value of occupied mixture components
or latent classes as kstar
, posterior estimates of \(\alpha\) (the
concentration parameter in stick-breaking prior distribution in Equation
((6))) as alpha
, and the size of the augmented
sample as Nmis
. In our demonstration, Nmis
is always 0 as the size
of the augmented sample is 0 when there are no structural zeros.
It is important to keep track of the value of kstar
as the
NPBayesImputeCat
package uses the truncated stick-breaking representation of the DP prior
(Sethuraman 1994). If the value of kstar
is always K
, the maximum
number of mixture components, we should re-run the DPMPM model by
specifying a larger value of K
to allow a large enough number of
mixture components to cluster the observations. For additional details
on setting K
, see Hu et al. (2014) and Akande et al. (2017).
The above initial run seems to suggest that the estimation uses almost
all latent classes (kstar
is close or is 30, which is what the maximum
number of latent classes K
set to). It is therefore prudent to
increase the value of K
when executing the CreateModel
command, for
example:
> model <- CreateModel(X = X,
MCZ = NULL,
K = 80,
Nmax = 0,
aalpha = 0.25,
balpha = 0.25,
seed = 456)
> model$Run(burnin = 2,
iter = 5,
thinning = 1,
silent = FALSE)
Initializing...
Run model without structural zeros.= 0 kstar = 80 alpha = 1 Nmis = 0
iter = 0 kstar = 78 alpha = 16.4979 Nmis = 0
iter = 1 kstar = 77 alpha = 17.281 Nmis = 0
iter = 2 kstar = 75 alpha = 24.4488 Nmis = 0
iter = 3 kstar = 79 alpha = 26.1196 Nmis = 0 iter
Again, we only show the first few lines of the output. This time,
setting K
equal to 80 seems sufficiently large. Users should keep
track of the value of kstar
for the entire run and adjust K
accordingly.
To diagnose convergence of parameters in the Gibbs sampler, one can use
the EnableTracer
option before running the sampler to track certain
parameters. Examples of keeping posterior samples of alpha
and kstar
to access convergence are included in the accompanying R file.
After setting K
based on the initial runs, we now run the DPMPM
imputation engine without structural zeros to create \(m\) imputed
datasets. The function DPMPM_nozeros_imp
takes 10 arguments as input:
X
, the original data with missing values.
nrun
, the number of MCMC iterations.
burn
, the number of burn-in.
thin
, the number of thinning.
K
, the maximum number of mixture components (i.e., the maximum
number of latent classes in the DPMPM imputation engine)
aalpha
, the hyper parameter \(a_{\alpha}\) in stick-breaking prior
distribution in Equation ((6)).
balpha
, the hyper parameter \(b_{\alpha}\) in stick-breaking prior
distribution in Equation ((6)).
m
, the number of imputations.
seed
, the seed value.
silent
, default to TRUE. Set this parameter to FALSE if more
iteration info are to be printed.
The output of DPMPM_nozeros_imp
is a list containing:
impdata
, the list of \(m\) imputed datasets.
origdata
, the original data X
.
alpha
, the saved posterior draws of \(\alpha\), which can be used to
check MCMC convergence.
kstar
, the saved numbers of occupied mixture components, which can
be used to check MCMC convergence and track whether the upper bound
K
is set large enough.
To run the DPMPM_nozeros_imp
function to impute missing data for ACS
sample 2 with 30% missingness, we run the code below. For this
demonstration, we set nrun
to 10000, burn
to 5000, thin
to 50, K
to 80, both aalpha
and balpha
0.25, and m
to 10. Finally, we set
the seed
to 211.
<- 10
m <- DPMPM_nozeros_imp(X = X,
Imp_DPMPM nrun = 10000,
burn = 5000,
thin = 50,
K = 80,
aalpha = 0.25,
balpha = 0.25,
m = m,
seed = 211,
silent = TRUE)
The printed output from each iteration are omitted here. For a quick
diagnostic check on whether the upper bound K
is set large enough, we
can use the kstar_MCMCdiag
function which takes the following input
arguments:
kstar
, the vector output of kstar from running the DPMPM model.
nrun
, the number of MCMC iterations used in running the DPMPM
model.
burn
, the number of burn-in iterations used in running the DPMPM
model.
thin
, the number of thinning used in running the DPMPM model.
Its output a list of two MCMC diagnostics figures:
Traceplot
, the traceplot of kstar post burn-in and thinning.
Autocorrplot
, the autocorrelation plot of kstar post burn-in and
thinning.
We first load the
bayesplot package
before using the kstar_MCMCdiag
function.
library(bayesplot)
kstar_MCMCdiag(kstar = Imp_DPMPM$kstar,
nrun = 10000,
burn = 5000,
thin = 50)
Figure 1 shows the traceplot of kstar
value after
burn-in and thinning. It indicates no convergence issues of the MCMC
chain. Moreover, it suggests choosing a smaller K
value if we want to
achieve an even faster computation time. Figure 2 presents
its autocorrelation function plot, which also indicates no convergence
issues.
To access the imputed datasets one at a time, we do the following.
<- Imp_DPMPM$impdata[[1]] #for the first imputed dataset impdata1
Analysts then can compute sample estimates for estimands of interest in each imputed dataset and combine them using the combining rules.
Before demonstrating how to do so, we first use the
mice package to also
generate imputations for the same dataset. We do so to facilitate
comparisons between results based on the DPMPM model and MICE. The
following code runs the MICE algorithm on the same data using the
default options in MICE for all the arguments, except m
, which is set
to 10 to be consistent with the implementations of the DPMPM engine. The
code also reshapes the output of the MICE algorithm so that we are able
to use some of the utility functions in
NPBayesImputeCat
on the imputed datasets. For more details on the implementations of
mice, see Buuren and Groothuis-Oudshoorn (2011).
library(mice)
<- 10
m <- mice(data = X,
Imp_MICE m = m,
defaultMethod = c("norm", "logreg", "polyreg", "polr"),
print = F,
seed = 342)
#Reshape the list of imputed datasets
<- NULL
Imp_MICE_reshape $impdata <- lapply(1:m,function(x) x <- X)
Imp_MICE_reshape<- names(Imp_MICE$imp)
col_names for (l in 1:m){
for(j in col_names){
<- which(is.na((Imp_MICE_reshape$impdata[[l]])[,j])==TRUE)
na_index_j $impdata[[l]][na_index_j,j] <- Imp_MICE$imp[[j]][[l]]
Imp_MICE_reshape
} }
With the Imp_DPMPM
, Imp_MICE
, and Imp_MICE_reshape
objects, we now
demonstrate how to assess the quality of the imputations for the two
methods and also use the combining rules for valid inferences from
multiple imputed datasets.
A very common way to assess the quality of the imputations is to compare
the estimated distributions in the observed and imputed datasets. We can
compare the marginal distributions of any of the variables in the
observed and imputed datasets using the marginal_compare_all_imp
function. The function takes 3 arguments as input:
obsdata
, the observed data.
impdata
, the list of m imputed datasets.
vars
, the variable of interest.
The output is a list containing:
Plot
, a barplot showing the marginal probability (as a percentage)
of each level of the variable in the observed and imputed datasets.
Comparison
, the table of the marginal probabilities (as a
percentage) used to make the barplot.
As an example, we can compare the marginal probability of each level of the variable WKL in the observed and imputed datasets for both MICE and the DPMPM engine by using the following code. We load the tidyverse library for making these plots.
library(tidyverse)
marginal_compare_all_imp(obsdata = X,
impdata = Imp_DPMPM$impdata,
vars = "WKL")
marginal_compare_all_imp(obsdata = X,
impdata = Imp_MICE_reshape$impdata,
vars = "WKL")
The code creates the plots in Figures 3 and 4.
For the most part in Figures 3 and 4, both DPMPM and MICE result in point estimates from the imputed datasets that are very close to the observed data, which are to be expected under MCAR. There are no major noticeable differences between the two methods. The code can be applied in a similar manner to other variables, MAR and SEX, shown in the accompanying R file.
We now demonstrate how to use the combining rules to obtain single point
estimates and corresponding 95% confidence intervals for estimands of
interest from all the imputed datasets. First, we compute the point
estimates and corresponding standard errors for marginal and joint
probabilities from each imputed dataset using the compute_probs
function. The function takes 2 arguments as input:
InputData
, the list of m imputed datasets.
varlist
, a list of variable names (or combination of names) of
interest.
The output is a list of the marginal and/or joint probabilities in each
imputed dataset. Next, we use the pool_estimated_probs
function to
pool the estimates from all the imputed datasets using the combining
rules. The function takes 2 arguments as input:
ComputeProbsResults
, the output from the compute_probs
function.
method
, the combining rules to use, where the options are
"imputation
", "synthesis_full
", "synthesis_partial
".
The output is a list of tables containing the results after applying the combining rules. For example, suppose we are interested in estimating probabilities corresponding to (i) the marginal distribution of MAR, (ii) the marginal distribution of SEX, and (iii) the joint distribution of MAR and WKL, we can use the following code.
<- list(c("MAR"),c("SEX"),c("MAR","WKL")) #probabilities to evaluate
varlist <- compute_probs(InputData = Imp_DPMPM$impdata,
prob_ex1_DPMPM varlist = varlist)
<- pool_estimated_probs(ComputeProbsResults = prob_ex1_DPMPM,
pooledprob_ex1_DPMPM method = "imputation")
<- compute_probs(InputData = Imp_MICE_reshape$impdata,
prob_ex1_MICE varlist = varlist)
<- pool_estimated_probs(ComputeProbsResults = prob_ex1_MICE,
pooledprob_ex1_MICE method = "imputation")
When dealing with missing data imputation, the method
must be set to
"imputation
". The first element of pooledprob_ex1_DPMPM
for MAR is
shown below, whereas the remaining output is omitted for brevity.
MAR Estimate Std.Error Df Statistic CI_Lower CI_Upper1 Divorced 0.1083 0.011865562 9.000000 9.127254 0.08145823 0.13514177
2 Married 0.5125 0.020191254 9.000001 25.382277 0.46682421 0.55817579
3 Never married or age<15 0.2953 0.018326210 9.000000 16.113534 0.25384323 0.33675677
4 Separated 0.0204 0.005678460 9.000000 3.592523 0.00755443 0.03324557
5 Widowed 0.0635 0.008683588 9.000000 7.312645 0.04385636 0.08314364
Each row represents the different levels of the corresponding
variable(s). From left to right, the columns give the variable names and
levels, the overall point estimates averaged across all imputed
datasets, and the corresponding standard errors, degrees of freedom,
test statistics, and confidence intervals. Similarly, for
pooledprob_ex1_MICE
, we have
MAR Estimate Std.Error Df Statistic CI_Lower CI_Upper1 Divorced 0.1055 0.011501682 9 9.172571 0.079481387 0.13151861
2 Married 0.5204 0.017156032 9 30.333355 0.481590360 0.55920964
3 Never married or age<15 0.2912 0.015438579 9 18.861839 0.256275507 0.32612449
4 Separated 0.0175 0.005031389 9 3.478165 0.006118207 0.02888179
5 Widowed 0.0654 0.009065137 9 7.214452 0.044893235 0.08590676
As the output shows, the results from both MICE and DPMPM are once again similar when looking at marginal probabilities of MAR, and both are indeed close to the results from the original sample without any missing data (which are excluded for brevity).
The
NPBayesImputeCat
package also includes similar functions, fit_GLMs
and
pool_fitted_GLMs
, for fitting generalized linear models (GLMs) to each
imputed datasets and pooling the results across all the datasets. The
fit_GLMs
function takes 2 arguments as input:
InputData
, the list of m
imputed datasets.
exp
, the GLM expression for the model of interest (for nnet
which must be loaded first).
The output is a list containing the estimated parameters from the GLM
model fitted to each imputed dataset. The pool_fitted_GLMs
pools the
GLM estimates from all the imputed datasets using the combining rules.
The function takes 2 arguments as input:
GLMResults
, the output from the fit_GLMs
function.
method
, the combining rules to use, where the options are
"imputation
", "synthesis_full
", "synthesis_partial
".
For example, to fit a multinomial logistic model of MAR on SEX, we can use the following code.
library(nnet)
<- fit_GLMs(InputData = Imp_DPMPM$impdata,
model_ex1_DPMPM exp = multinom(formula = MAR~SEX))
pool_fitted_GLMs(GLMResults = model_ex1_DPMPM,
method = "imputation")
The second line yields the following output, with the numbers rounded up to 4 decimal places.
Levels Parameter Estimate Std.Error Df Statistic CI_Lower CI_Upper1 Married (Intercept) 1.5431 0.1783 9.0033 8.6536 1.1398 1.9465
2 Married SEXMale 0.0282 0.2597 9.0150 0.1085 -0.5591 0.6155
3 Never married or age<15 (Intercept) 0.8719 0.1849 9.0034 4.7145 0.4536 1.2902
4 Never married or age<15 SEXMale 0.2626 0.2632 9.0139 0.9976 -0.3327 0.8578
5 Separated (Intercept) -1.6225 0.4380 9.1346 -3.7044 -2.6111 -0.6339
6 Separated SEXMale -0.2124 0.7787 10.3909 -0.2728 -1.9386 1.5138
7 Widowed (Intercept) -0.0666 0.2174 9.0059 -0.3062 -0.5584 0.4252
8 Widowed SEXMale -1.5906 0.4781 9.1614 -3.3265 -2.6693 -0.5118
The fit_GLMs
and pool_fitted_GLMs
functions perform a similar role
to the with
and pool
functions in the
mice package. To fit the
same model under MICE, we use the following code.
<- with(data = Imp_MICE,
model_ex1_MICE exp = multinom(formula = MAR~SEX))
summary(pool(model_ex1_MICE))
The second line yields the following output, with the numbers rounded up to 4 decimal places.
y.level term estimate std.error statistic df p.value1 Married (Intercept) 1.6828 0.1835 9.1720 78.1913 0.0000
2 Married SEXMale -0.1735 0.2745 -0.6321 53.7652 0.5300
3 Never married or age<15 (Intercept) 0.8643 0.1962 4.4051 95.0509 0.0000
4 Never married or age<15 SEXMale 0.2828 0.2980 0.9491 48.8560 0.3473
5 Separated (Intercept) -1.5754 0.6032 -2.6117 19.3114 0.0170
6 Separated SEXMale -0.7507 1.0821 -0.6937 17.6531 0.4969
7 Widowed (Intercept) 0.1187 0.2481 0.4783 49.8076 0.6345
8 Widowed SEXMale -2.2404 0.6741 -3.3235 32.0590 0.0022
The results are mostly similar, although we note that for most of the estimands, the standard errors are larger for MICE than the DPMPM engine. However, we also note that this illustration is based on data containing only \(n=1000\) observations and \(30\%\) missing data, so that differences in point estimates are not particularly surprising. Additional examples of fitting GLM models to the imputed datasets are shown in the accompanying R file.
We now illustrate how to impute missing values for ACS sample 1 with 30%
missingness, using
NPBayesImputeCat,
where there are structural zeros are present. Recall that the data is
stored in the file, ss16pusa_sample_zeros_miss
. The general procedure
is very similar to the one in the previous section, where structural
zeros are not present. However, we need to specify additional inputs to
account for the structural zeros when generating the imputed datasets.
Once the imputed datasets have been created, the utility functions used
to compute sample estimates and pool them using the combining rules are
exactly the same as before. First, we begin by creating MCZ
, the data
frame containing the structural zeros definition.
Previously, when there are no structural zeros, MCZ
is set to NULL
.
Here, when there are structural zeros cases in the application, one
should write the MCZ
data frame following two general rules:
Variables in MCZ
must be factors with the same levels as the
original data.
Placeholder components are represented with NAs.
The script below is a sample script to store the structural zeros definition shown in Table 2.
<- c(16, 16, 16, 16, 17, 17, 17, 17)
AGEP <- c("Bachelor's degree", "Doctorate degree", "Master's degree",
SCHL "Professional degree", "Bachelor's degree", "Doctorate degree",
"Master's degree", "Professional degree")
<- rep(NA, 8)
MAR <- rep(NA, 8)
SEX <- rep(NA, 8)
WKL <- as.data.frame(cbind(AGEP, MAR, SCHL, SEX, WKL)) MCZ
First, we create a vector of AGEP consisting of 4 replicates of value 16
and 4 replicates of value 17 and a vector of SCHL consisting of the
degree types which induce structural zeros cases with AGEP. Second, we
create vectors of MAR, SEX, and WKL. Each is a vector of length 8, with
each element being NA. These are placeholder components, and since the
structural zeros cases do not involve these three variables, all
elements are NAs. Third, we need to create a data frame using
as.data.frame
and cbind
. It is necessary to input the variables in
the same order as in the original data (the order of variables in Table
1). We save the data frame MCZ
for later use.
Now, we load the sample data and make sure that all variables are unordered factors.
data("ss16pusa_sample_zeros_miss")
<- ss16pusa_sample_zeros_miss
X <- ncol(X)
p for (j in 1:p){
<- as.factor(X[,j])
X[,j] <- factor(MCZ[,j], levels = levels(X[,j]))
MCZ[,j] }
Initializing the DPMPM engine follows the exact same approach as before.
The only difference is that we can now supply the two arguments specific
to structural zeros, that is, MCZ
and Nmax
. That is, to initialize,
we can run the following code.
<- CreateModel(X = X,
model MCZ = MCZ,
K = 30,
Nmax = 20000,
aalpha = 0.25,
balpha = 0.25,
seed = 521)
As before, we select K
based on initial runs. We now also do the same
for Nmax
. If the value of either always hits the set values, we should
re-run the model by specifying larger values to allow for a large enough
number of mixture components and augmented observations to cluster the
observations appropriately. As before, we can also save and track
posterior samples of the parameters in the sampler using the
EnableTracer
option. Sample scripts are included in the accompanying R
file.
After setting K
and Nmax
, we can now run the DPMPM imputation engine
with structural zeros to create \(m\) imputed datasets. The function
DPMPM_zeros_imp
takes 12 arguments as input:
X
, the original data with missing values.
MCZ
, data frame containing the structural zeros definition.
Nmax
, an upper truncation limit for the augmented sample size.
nrun
, the number of MCMC iterations.
burn
, the number of burn-in.
thin
, the number of thinning.
K
, the maximum number of mixture components (i.e., the maximum
number of latent classes in the DPMPM imputation engine)
aalpha
, the hyper parameter \(a_{\alpha}\) in stick-breaking prior
distribution in Equation ((6)).
balpha
, the hyper parameter \(b_{\alpha}\) in stick-breaking prior
distribution in Equation ((6)).
m
, the number of imputations.
seed
, the seed value.
silent
, default to TRUE. Set this parameter to FALSE if more
iteration info are to be printed.
The output of DPMPM_zeros_imp
is similar to the output of
DPMPM_nozeros_imp
, except that now it also includes Nmis
, the saved
posterior draws of the augmented sample size, which can be used to check
MCMC convergence.
To run the DPMPM_zeros_imp
function to impute missing data for ACS
sample 1 with 30% missingness, we run the code below. For this
demonstration, we set Nmax
to 200000, nrun
to 10000, burn
to 5000,
thin
to 50, K
to 80, both aalpha
and balpha
0.25, and m
to 10.
Finally, we set the seed
to 653.
<- 10
m <- DPMPM_zeros_imp(X = X,
Imp_DPMPM MCZ = MCZ,
Nmax = 200000,
nrun = 10000,
burn = 5000,
thin = 50,
K = 80,
aalpha = 0.25,
balpha = 0.25,
m = m,
seed = 653,
silent = TRUE)
As before, it is straightforward to run MCMC diagnostics on the tracked
elements of Imp_DPMPM
. Also, Imp_DPMPM
contains the list of the
imputed datasets. Analysts then can compute sample estimates for
estimands of interest in each imputed dataset and assess their quality
or also combine them using the combining rules by using all the same
functions as before. That is, using marginal_compare_all_imp
,
compute_probs
, pool_estimated_probs
, fit_GLMs
, and
pool_fitted_GLMs
. Examples are included in the accompanying R file.
Currently, there are no direct options in the mice package to incorporate structural zeros. Therefore, we do not explore comparisons with the MICE engine for data containing structural zeros.
Without loss of generality, suppose we want to generate partially
synthetic datasets for the ACS sample 2 (ss16pusa_sample_nozeros
),
where no structural zeros are present.
NPBayesImputeCat
includes functionality to generate fully synthetic data as well, but we
exclude its illustration for brevity. The
NPBayesImputeCat
package currently does not accommodate data synthesis with structural
zeros.
We also implement a popular synthetic data generation method, CART (using the synthpop package in R), to ACS sample 2 and compare the results (Reiter 2005) to DPMPM.
First, we load the sample data, the ACS sample 2, and make sure that all variables are set as factors.
data(ss16pusa_sample_nozeros)
<- ss16pusa_sample_nozeros
X <- ncol(X)
p for (j in 1:p){
<- as.factor(X[,j])
X[,j] }
Initializing the DPMPM synthesizer follows the exact same approach as before for the missing data imputation applications. We run the following code.
<- CreateModel(X = X,
model MCZ = NULL,
K = 80,
Nmax = 0,
aalpha = 0.25,
balpha = 0.25,
seed = 973)
After setting K
based on the initial runs, we now run the DPMPM
synthesizer without structural zeros to create m
synthetic datasets.
The function DPMPM_nozeros_syn
takes 12 arguments as input:
X
, the original data with missing values.
dj
, the vector recording the number of categories of the
variables.
nrun
, the total number of MCMC iterations.
burn
, the number of burn-ins.
thin
, the number of thinnings.
K
, the maximum number of mixture components.
aalpha
, the hyper parameter \(a_{\alpha}\) in stick-breaking prior
distribution in Equation ((6)).
balpha
, the hyper parameter \(b_{\alpha}\) in stick-breaking prior
distribution in Equation ((6)).
m
, the number of synthetic datasets.
vars
, the names of the variables to be synthesized.
seed
, the seed value.
silent
, default to TRUE. Set this parameter to FALSE if more
iteration info are to be printed.
The output of DPMPM_nozeros_syn
is a list containing:
syndata
, the list of \(m\) synthetic datasets.
origdata
, the original data X
.
alpha
, the saved draws of \(\alpha\), which can be used to check
MCMC convergence.
kstar
, the saved numbers of occupied mixture components, which can
be used to check MCMC convergence and track whether the upper bound
K
is set large enough.
To run the DPMPM_nozeros_syn
function to generate synthetic data for
ACS sample 2, we run the code below. For this demonstration, we create
partially synthetic data where marital status (MAR) and when last worked
(WKL) are synthesized, and we set nrun
to 10000, burn
to 5000,
thin
to 50, K
to 80, both aalpha
and balpha
to 0.25, and seed
to 837. Recall that dj
stores the vector of levels of the variables,
which are 5 for MAR, 2 for SEX, and 3 for WKL.
<- c(5, 2, 3)
dj <- 5
m <- DPMPM_nozeros_syn(X = X,
Syn_DPMPM dj = dj,
nrun = 10000,
burn = 5000,
thin = 50,
K = 80,
aalpha = 0.25,
balpha = 0.25,
m = 5,
vars = c("MAR", "WKL"),
seed = 837,
silent = TRUE)
MCMC diagnostics can be run based on the tracked elements of
Syn_DPMPM
. To access the synthetic datasets one at a time, we do the
following.
<- Syn_DPMPM$syndata[[3]] #for the third synthetic dataset syndata3
Analysts then can compute sample estimates for estimands of interest in
each synthetic dataset and combine them using the combining rules. For
comparison, we use the
synthpop package to
generate synthetic data for the same dataset with CART synthesizer. The
default synthesizer for
synthpop is CART, and
the visit.sequence
input argument allows the users to indicate which
variables to be synthesized. To match with what we have done with the
NPBayesImputeCat,
we set MAR and WKL for visit.sequence
to generate partially synthetic
data.
library(synthpop)
<- 5
m <- syn(data = X,
Syn_CART m = 5,
seed = 123,
visit.sequence = c("MAR", "WKL"))
Next, we demonstrate how to access the utility of the synthetic datasets from the two methods and also use the combining rules.
Similar to what we have introduced for assessing the quality of the
imputations, for synthesis, we compare the marginal distributions of any
of the variables in the observed and imputed datasets, using the
marginal_compare_all_syn
function. The function takes 3 arguments as
input:
obsdata
, the observed data.
syndata
, the list of m synthetic datasets.
vars
, the variable of interest.
The output is a list containing:
Plot
, a barplot showing the marginal probability (as a percentage)
of each level of the variable in the observed and synthetic
datasets.
Comparison
, the table of the marginal probabilities (as a
percentage) used to make the barplot.
The following code compares the marginal probability of each level of MAR.
marginal_compare_all_syn(obsdata = X,
syndata = Syn_DPMPM$syndata,
vars = "MAR")
marginal_compare_all_syn(obsdata = X,
syndata = Syn_CART$syn,
vars = "MAR")
The code creates the plots in Figures 5 and 6.
For the most part in Figures 5 and 6, both DPMPM and CART result in point estimates from the synthetic datasets that are very close to the observed data. There are no major noticeable differences between the two methods. The code can be applied in a similar manner to other variables, and examples are included in the accompanying R file.
We now demonstrate how to use the combining rules to obtain single point
estimates and corresponding 95% confidence intervals for estimands of
interest from all the synthetic datasets. Similar to what we have done
with the multiple imputation combining rules previously, we use the same
set of functions: compute_probs
and pool_estimated_prob
.
For example, suppose we are interested in estimating probabilities corresponding to (i) the marginal distribution of MAR, (ii) the marginal distribution of SEX, (iii) the marginal distribution of WKL, and (iv) the joint distribution of MAR and WKL, we can use the following code:
<- list(c("MAR"), c("SEX"), c("WKL"), c("MAR","WKL")) #probabilities to evaluate
varlist <- compute_probs(InputData = Syn_DPMPM$syndata,
prob_ex1_DPMPM varlist = varlist)
<- pool_estimated_probs(ComputeProbsResults = prob_ex1_DPMPM,
pooledprob_ex1_DPMPM method = "synthesis_partial")
<- compute_probs(InputData = Syn_CART$syn,
prob_ex1_CART varlist = varlist)
<- pool_estimated_probs(ComputeProbsResults = prob_ex1_CART,
pooledprob_ex1_CART method = "synthesis_partial")
As noted before, when dealing with partially synthetic data, the
method
must be set to “synthesis_partial”. We only include the first
element of pooledprob_ex1_DPMPM
for MAR for brevity.
MAR Estimate Std.Error Df Statistic CI_Lower CI_Upper1 Divorced 0.1072 0.010406613 293.63208 10.301142 0.086718995 0.12768100
2 Married 0.5112 0.016738961 338.91020 30.539531 0.478274660 0.54412534
3 Never married or age<15 0.2944 0.016234667 88.41528 18.134034 0.262139123 0.32666088
4 Separated 0.0158 0.004718411 43.64386 3.348585 0.006288476 0.02531152
5 Widowed 0.0714 0.008826970 178.61153 8.088846 0.053981434 0.08881857
Each row represents the different levels of the corresponding
variable(s). From left to right, the columns give the variable names and
levels, the overall point estimates averaged across all imputed
datasets, and the corresponding standard errors, degrees of freedom,
test statistics, and confidence intervals. Similarly, for
pooledprob_ex1_CART
, we have
MAR Estimate Std.Error Df Statistic CI_Lower CI_Upper1 Divorced 0.1104 0.011044293 104.5372 9.996113 0.088500078 0.13229992
2 Married 0.5244 0.017530887 111.6932 29.912919 0.489663753 0.55913625
3 Never married or age<15 0.2842 0.015588823 149.5745 18.231011 0.253397249 0.31500275
4 Separated 0.0138 0.004054405 134.0083 3.403705 0.005781098 0.02181890
5 Widowed 0.0672 0.008348413 392.0400 8.049434 0.050786740 0.08361326
As the output shows, the results from both CART and DPMPM are once again similar when looking at marginal probabilities of MAR, and both are indeed close to the results from the original sample (excluded for brevity).
Lastly, we demonstrate the use of fit_GLMs
and pool_fitted_GLMs
for
fitting generalized linear models (GLMs) to each synthetic datasets and
pooling the results across all the datasets. For example, to fit a
logistic model of SEX given MAR and WKL, we can use the following code.
<- fit_GLMs(InputData = Syn_DPMPM$syndata,
model_ex1_DPMPM exp = glm(formula = SEX~WKL+MAR,
family = binomial))
pool_fitted_GLMs(GLMResults = model_ex1_DPMPM,
method = "synthesis_partial")
The second line yields the following output, with the numbers rounded up to 4 decimal places.
Estimate Std.Error Df Statistic CI_Lower CI_Upper-0.5954 0.3548 44.1182 -1.6781 -1.3105 0.1196
(Intercept) 5 years ago or never worked 0.0946 0.2823 179.2380 0.3351 -0.4625 0.6517
WKLOver 12 months 0.3629 0.3042 33.3456 1.1929 -0.2558 0.9816
WKLWithin the last 0.3517 0.2581 48.6246 1.3625 -0.1671 0.8706
MARMarried <15 0.4293 0.2560 119.8199 1.6771 -0.0775 0.9360
MARNever married or age0.0710 0.6589 545.7313 0.1078 -1.2232 1.3652
MARSeparated -1.0156 0.3864 543.1589 -2.6285 -1.7746 -0.2566 MARWidowed
We use the following code to fit the same model under CART.
<- fit_GLMs(InputData = Syn_CART$syn,
model_ex1_CART exp = glm(formula = as.factor(SEX)~WKL+MAR,
family = binomial))
pool_fitted_GLMs(GLMResults = model_ex1_CART,
method = "synthesis_partial")
The second line yields the following output, with the numbers rounded up to 4 decimal places.
Estimate Std.Error Df Statistic CI_Lower CI_Upper-0.0292 0.3155 103.8094 -0.0927 -0.6549 0.5964
(Intercept) 5 years ago or never worked -0.0610 0.2941 56.6928 -0.2074 -0.6499 0.5279
WKLOver 12 months -0.1309 0.2602 125.4832 -0.5032 -0.6460 0.3841
WKLWithin the last 0.0316 0.2151 3096.9978 0.1471 -0.3902 0.4534
MARMarried <15 0.0230 0.2401 319.0640 0.0956 -0.4495 0.4954
MARNever married or age0.3378 0.6530 187.2752 0.5172 -0.9505 1.6260
MARSeparated 0.1148 0.3365 477.2065 0.3412 -0.5464 0.7760 MARWidowed
The output of fitting this GLM to the original sample is used as the benchmark for our utility evaluation.
Pr(>|z|)
Estimate Std. Error z value -0.5937 0.2897 -2.050 0.040395 *
(Intercept) 5 years ago or never worked 0.2859 0.2539 1.126 0.260163
WKLOver 12 months 0.5054 0.2358 2.144 0.032065 *
WKLWithin the last 0.1380 0.2129 0.649 0.516647
MARMarried <15 0.3994 0.2273 1.757 0.078965 .
MARNever married or age-0.3538 0.5463 -0.648 0.517183
MARSeparated -1.3673 0.3867 -3.536 0.000406 *** MARWidowed
The utility results are different between DPMPM and CART, and we note that for most of the estimands, the DPMPM produces more accurate estimation than the CART when compared to the results from the original sample. These indicate that for this particular sample, a joint model fitted by the DPMPM does a better job preserving relationships among variables compared to a series of the conditional models fitted by the CART. Additional examples of fitting GLM models to the synthetic datasets are shown in the accompanying R file.
In this paper, we have presented the DPMPM models for multivariate categorical data and illustrations of using the NPBayesImputeCat package for multiple imputation and synthetic data applications. Users can take the output and extract imputed and synthetic datasets, then conduct statistical analyses of their choice and use the appropriate combining rules to obtain valid estimates. Interested readers can refer to the package documentation for additional features.
While the NPBayesImputeCat package has been developed primarily for multiple imputation and synthetic data purposes, users can also use it for DPMPM model estimation. For example, following the illustrations for synthetic data, a data analyst is able to obtain parameter draws of several key parameters from the MCMC chain: i) the DP concentration parameter \(\alpha\), ii) the mixture probability vectors \(\{\pi_k\}\), and iii) the Multinomial probability vectors \(\{\theta_k^{(j)}\}\). The analyst can then further conduct analyses of the clustering of the observations in the MCMC chain and other questions of interest.
Users can report bugs at our GitHub repo: https://github.com/monika76five/NPBayesImputeCat.
NPBayesImputeCat, mice, synthpop, bayesplot, tidyverse
MissingData, MixedModels, OfficialStatistics
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Hu, et al., "Multiple Imputation and Synthetic Data Generation with NPBayesImputeCat", The R Journal, 2021
BibTeX citation
@article{RJ-2021-080, author = {Hu, Jingchen and Akande, Olanrewaju and Wang, Quanli}, title = {Multiple Imputation and Synthetic Data Generation with NPBayesImputeCat}, journal = {The R Journal}, year = {2021}, note = {https://rjournal.github.io/}, volume = {13}, issue = {2}, issn = {2073-4859}, pages = {90-110} }