Multiple Imputation and Synthetic Data Generation with NPBayesImputeCat

In many contexts, missing data and disclosure control are ubiquitous and challenging issues. In particular, at statistical agencies, the respondent-level data they collect from surveys and censuses can suffer from high rates of missingness. Furthermore, agencies are obliged to protect respondents’ privacy when publishing the collected data for public use. The NPBayesImputeCat R package, introduced in this paper, provides routines to i) create multiple imputations for missing data and ii) create synthetic data for statistical disclosure control, for multivariate categorical data, with or without structural zeros. We describe the Dirichlet process mixture of products of the multinomial distributions model used in the package and illustrate various uses of the package using data samples from the American Community Survey (ACS). We also compare results of the missing data imputation to the mice R package and those of the synthetic data generation to the synthpop R package. Introduction and background Multiple imputation for missing data Missing data problems arise in many statistical analyses. To impute missing values, multiple imputation, first proposed by Rubin (1987), has been widely adopted. This approach estimates predictive models based on the observed data, fills in missing values with draws from the predictive models, and produces multiple imputed and completed datasets. Data analysts then apply standard statistical analyses (e.g., regression analysis) on each imputed dataset and use appropriate combining rules to obtain valid point estimates and variance estimates (Rubin, 1987). As a brief review of the multiple imputation combining rules for missing data, let q be the completed data estimator of some estimand of interest Q, and let u be the estimator of the variance of q. For l = 1, . . . , m, let q(l) and u(l) be the values of q and u in the lth completed dataset. The multiple imputation estimate of Q is equal to q̄m = ∑l=1 q (l)/m, and the estimated variance associated with q̄m is equal to Tm = (1 + 1/m)bm + ūm , where bm = ∑l=1(q (l) − q̄m)/(m − 1) and ūm = ∑l=1 u (l)/m. Inferences for Q are based on (q̄m − Q) ∼ tv(0, Tm), where tv is a t-distribution with v = (m − 1)(1 + ūm/[(1 + 1/m)bm]) degrees of freedom. Multiple imputation by chained equations (MICE, Buuren and Groothuis-Oudshoorn (2011)) remains the most popular method for generating multiple completed datasets after multiple imputation. Under MICE, one specifies univariate conditional models separately for each variable, usually using generalized linear models (GLMs) or classification and regression trees (CART Breiman et al. (1984); Burgette and Reiter (2010)), and then iteratively samples plausible predicted values from the sequence of conditional models . For implementing MICE in R, most analysts use the mice package. For an in-depth review of the MICE algorithm, see Buuren and Groothuis-Oudshoorn (2011). For more details and reviews, see Rubin (1996), Harel and Zhou (2007), Reiter and Raghunathan (2007). Synthetic data for statistical disclosure control Statistical agencies regularly collect information from surveys and censuses and make such information publicly available for various purposes, including research and policymaking. In numerous countries around the world, statistical agencies are legally obliged to protect respondents’ privacy when making this information available to the public. Statistical disclosure control (SDC) is the collection of techniques applied to confidential data before public release for privacy protection. Popular SDC techniques for tabular data include cell suppression and adding noise, and popular SDC techniques for respondent-level data (also known as microdata) include swapping, adding noise, and aggregation. Hundepool et al. (2012) provide a comprehensive review of SDC techniques and applications. The multiple imputation methodology has been generalized to SDC. One approach to facilitating microdata release is to provide synthetic data. First proposed by Little (1993) and Rubin (1993), the synthetic data approach estimates predictive models based on the original, confidential data, simulates synthetic values with draws from the predictive models, and produces multiple synthetic datasets. Data analysts then apply standard statistical analyses (e.g., regression analysis) on each synthetic dataset and use appropriate combining rules (different from those in multiple imputation) to obtain valid point estimates and variance estimates (Reiter and Raghunathan, 2007; Drechsler, The R Journal Vol. 13/2, December 2021 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 91 2011). Moreover, synthetic data comes in two flavors: fully synthetic data (Rubin, 1993), where every variable is deemed sensitive and therefore synthesized, and partially synthetic data (Little, 1993), where only a subset of variables is deemed sensitive and synthesized, while the remaining variables are un-synthesized. Statistical agencies can choose between these two approaches depending on their protection goals, and subsequent analyses also differ. When dealing with fully synthetic data, q̄m estimates Q as in the multiple imputation setting, but the estimated variance associated with q̄m becomes Tf = (1 + 1/m)bm − ūm , where bm and ūm are defined as in previous section on multiple imputation. Inferences for Q are now based on (q̄m − Q) ∼ tv(0, Tf ), where the degrees of freedom is v f = (m − 1)(1 − mūm/((m + 1)bm)). For partially synthetic data, q̄m still estimates Q but the estimated variance associated with q̄m is Tp = bm/m + ūm , where bm and ūm are defined as in the multiple imputation setting. Inferences for Q are based on (q̄m − Q) ∼ tv(0, Tp), where the degrees of freedom is vp = (m − 1)(1 + ūm/[bm/m]). For synthetic data with R, synthpop provides synthetic data generated by drawing from conditional distributions fitted to the confidential data. The conditional distributions are estimated by models chosen by the user, whose choices include parametric or CART models. For more details and reviews of synthetic data for statistical disclosure control, see Drechsler (2011).


Multiple imputation for missing data
Missing data problems arise in many statistical analyses. To impute missing values, multiple imputation, first proposed by Rubin (1987), has been widely adopted. This approach estimates predictive models based on the observed data, fills in missing values with draws from the predictive models, and produces multiple imputed and completed datasets. Data analysts then apply standard statistical analyses (e.g., regression analysis) on each imputed dataset and use appropriate combining rules to obtain valid point estimates and variance estimates (Rubin, 1987).
As a brief review of the multiple imputation combining rules for missing data, let q be the completed data estimator of some estimand of interest Q, and let u be the estimator of the variance of q. For l = 1, . . . , m, let q (l) and u (l) be the values of q and u in the lth completed dataset. The multiple imputation estimate of Q is equal toq m = ∑ m l=1 q (l) /m, and the estimated variance associated withq m is equal to T m = (1 + 1/m)b m +ū m , where b m = ∑ m l=1 (q (l) −q m ) 2 /(m − 1) andū m = ∑ m l=1 u (l) /m. Inferences for Q are based on (q m − Q) ∼ t v (0, T m ), where t v is a t-distribution with v = (m − 1)(1 +ū m /[(1 + 1/m)b m ]) 2 degrees of freedom.
Multiple imputation by chained equations (MICE, Buuren and Groothuis-Oudshoorn (2011)) remains the most popular method for generating multiple completed datasets after multiple imputation. Under MICE, one specifies univariate conditional models separately for each variable, usually using generalized linear models (GLMs) or classification and regression trees (CART Breiman et al. (1984); Burgette and Reiter (2010)), and then iteratively samples plausible predicted values from the sequence of conditional models . For implementing MICE in R, most analysts use the mice package. For an in-depth review of the MICE algorithm, see Buuren and Groothuis-Oudshoorn (2011). For more details and reviews, see Rubin (1996), Harel and Zhou (2007), Reiter and Raghunathan (2007).

Synthetic data for statistical disclosure control
Statistical agencies regularly collect information from surveys and censuses and make such information publicly available for various purposes, including research and policymaking. In numerous countries around the world, statistical agencies are legally obliged to protect respondents' privacy when making this information available to the public. Statistical disclosure control (SDC) is the collection of techniques applied to confidential data before public release for privacy protection. Popular SDC techniques for tabular data include cell suppression and adding noise, and popular SDC techniques for respondent-level data (also known as microdata) include swapping, adding noise, and aggregation. Hundepool et al. (2012) provide a comprehensive review of SDC techniques and applications.
The multiple imputation methodology has been generalized to SDC. One approach to facilitating microdata release is to provide synthetic data. First proposed by Little (1993) and Rubin (1993), the synthetic data approach estimates predictive models based on the original, confidential data, simulates synthetic values with draws from the predictive models, and produces multiple synthetic datasets. Data analysts then apply standard statistical analyses (e.g., regression analysis) on each synthetic dataset and use appropriate combining rules (different from those in multiple imputation) to obtain valid point estimates and variance estimates (Reiter and Raghunathan, 2007;Drechsler, 2011). Moreover, synthetic data comes in two flavors: fully synthetic data (Rubin, 1993), where every variable is deemed sensitive and therefore synthesized, and partially synthetic data (Little, 1993), where only a subset of variables is deemed sensitive and synthesized, while the remaining variables are un-synthesized. Statistical agencies can choose between these two approaches depending on their protection goals, and subsequent analyses also differ.
When dealing with fully synthetic data,q m estimates Q as in the multiple imputation setting, but the estimated variance associated withq m becomes T f = (1 + 1/m)b m −ū m , where b m and u m are defined as in previous section on multiple imputation. Inferences for Q are now based on (q m − Q) ∼ t v (0, T f ), where the degrees of freedom is v f = (m − 1)(1 − mū m /((m + 1)b m )) 2 .
For partially synthetic data,q m still estimates Q but the estimated variance associated withq m is T p = b m /m +ū m , where b m andū m are defined as in the multiple imputation setting. Inferences for Q are based on (q m − Q) ∼ t v (0, T p ), where the degrees of freedom is v p = (m − 1)(1 +ū m /[b m /m]) 2 .
For synthetic data with R, synthpop provides synthetic data generated by drawing from conditional distributions fitted to the confidential data. The conditional distributions are estimated by models chosen by the user, whose choices include parametric or CART models. For more details and reviews of synthetic data for statistical disclosure control, see Drechsler (2011).

Structural zeros
An important feature of survey data is the existence of structural zeros, which are combinations of variables with probability zero. For example, in the combinations of variables of vital signs, there should not exist a deceased patient with a pulse. For the household surveys, in the combinations of variables of relationship and age, there should not exist a household where a son is older than his biological father. As another example, if a dataset contains information of a record's age and educational attainment in the form of categorical variables, there can be no record having the combination of being younger than 5 and having a doctorate degree.
In survey data with many variables, cross-tabulations of variables could result in sparse tables, containing non-structural zeros (combinations that are possible but happen not to exist in the particular dataset) and structural zeros (combinations that are simply impossible). To deal with structural zeros, many advanced statistical models are designed to assign zero probability for every impossible combination, which is a challenging task.

What NPBayesImputeCat does
The NPBayesImputeCat package specializes in estimating and performing multiple imputation and synthetic data generation for multivariate categorical data. Unlike mice and synthpop, both of which specify conditional models, the NPBayesImputeCat implements the Dirichlet process mixture of products of multinomial distributions (DPMPM), which specifies a joint latent class model on multivariate categorical variables. It uses Dirichlet process (DP) priors to allow effective clustering of the observations. Therefore, the NPBayesImputeCat package adds to the tools of imputation and synthesis, where a joint model might be more suitable than a series of conditional models for multivariate categorical data.
NPBayesImputeCat also allows imputation with structural zeros. It, therefore, helps fill an important gap in missing data imputation techniques, as currently available R packages do not facilitate imputation with structural zeros, and users might have to post-process, such as rejection sampling to delete generated but impossible cases.
For multiple imputation, the NPBayesImputeCat package allows data with and without structural zeros. For synthetic data, currently, the package only allows data without structural zeros.

The structure of this paper
The rest of the paper is organized as follows. We first introduce the joint latent class models for multivariate categorical data that the NPBayesImputeCat package applies, that is, the DPMPM model. In addition, we review applications of multiple imputation and synthetic data generation using the DPMPM in the literature. Next, we introduce the sample datasets from the American Community Survey (ACS) to be used in the demonstration, and provide illustrations for both multiple imputation and synthetic data generation using the NPBayesImputeCat package while comparing to other existing R packages. The paper concludes with a summary and discussion.

The DPMPM model
Proposed by Dunson and Xing (2009), the DPMPM is a Bayesian latent class model developed for multivariate categorical data. To allow for effective clustering of the observations based on all categorical variables, DP priors are specified for the mixture probabilities and multinomial probability vectors of the categorical data. The DPMPM has been shown to capture the complex dependencies in multivariate categorical data while being computationally efficient. In addition, it empowers the data to select the number of latent classes to be used in the model estimation. The model has also been extended to account for structural zeros in categorical data (Manrique-Vallier and Reiter, 2014a).
The NPBayesImputeCat package includes two versions of the DPMPM: i) DPMPM without structural zeros, and ii) DPMPM with structural zeros. In this section, we introduce the details of both versions and review previous work on using the DPMPM for multiple imputation and synthetic data.

DPMPM without structural zeros
Our review of the DPMPM without structural zeros closely follows the review in Hu and Hoshino (2018). Consider a sample X consisting of n records, where each ith record, with i = 1, . . . , n, has p unordered categorical variables. The basic assumption of the DPMPM is that every record X i = (X i1 , · · · , X ip ) belongs to one of K underlying unobserved/latent classes. Given the latent class assignment z i of record i, as in Equation (2), each variable X ij independently follows a multinomial distribution, as in Equation (1), where d j is the number of categories of variable j, and j = 1, . . . , p.
The marginal probability Pr(X i1 = x i1 , · · · , X ip = x ip | π, θ) can be expressed as averaging over the latent classes: ( As pointed out in Si and Reiter (2013), Hu et al. (2014), Akande et al. (2017), such averaging over latent classes results in dependence among the variables. Equation (3) will also help illustrate the DPMPM with structural zeros in the next section.
The DPMPM clusters records with similar characteristics based on all p variables. Relationships among all the variables are induced by integrating out the latent class assignment z i . To empower the DPMPM to pick the effective number of occupied latent classes, the truncated stick-breaking representation (Sethuraman, 1994) of the DP prior is used as in Equation (4) through Equation (7), and a blocked Gibbs sampler is implemented for the Markov chain Monte Carlo (MCMC) sampling procedure (Ishwaran and James, 2001;Si and Reiter, 2013;Hu et al., 2014;Akande et al., 2017;Manrique-Vallier and Hu, 2018;Drechsler and Hu, 2021;Hu and Savitsky, 2018).
When used as an imputation engine, missing values are handled within the Gibbs sampler. As described in Akande et al. (2017), at one MCMC iteration l, one samples a value of the latent class indicator z i using Equation (2), given a draw of the parameters and observed data. In this iteration l, given the sampled z i , one samples missing values using independent draws from Equation (1). This process is repeated for every missing value in the dataset in iteration l, obtaining one imputed dataset.
When used as a data synthesizer, the fully observed confidential dataset is used for model estimation through MCMC, and sensitive variable values are synthesized as an extra step at chosen MCMC iteration. For example, at MCMC iteration l, one samples a value of the latent class indicator z i using Equation (2). Given the sampled z i , one samples synthetic values of sensitive variables using independent draws from Equation (1). This process is repeated for every record that has sensitive values to be synthesized, obtaining one synthetic dataset.

DPMPM with structural zeros
When structural zeros are present, we need to modify the likelihood to enforce zero probability for impossible combinations. That is, we need to truncate the support of the DPMPM. Following the general description in Manrique-Vallier and Reiter (2014a) and Manrique-Vallier and Hu (2018), let C represent all combinations of individuals, including impossible combinations; let MCZ ̸ ⊆ C be the set of impossible combinations to be excluded. We restrict X to the set C \ MCZ, with Pr(X ∈ MCZ ) = 0. The marginal probability in the DPMPM without structural zeros in Equation (3) then becomes Let X * be the sample that only contains possible combinations, we have the joint likelihood as To get the Gibbs sampler to work, we follow the general data augmentation technique proposed by Manrique-Vallier and Reiter (2014a) and assume the existence of an observed sample X 0 of unknown size Nmis, generated from the DPMPM without structural zeros (i.e., the unrestricted DPMPM). X 0 only contains records that fall into MCZ.
The same set of DP priors in Equation (4) through Equation (7) is used in the DPMPM with structural zeros. In the Gibbs sampler, we keep the generated X 0 and combine it with X * when estimating the model parameters. For computational expedience, we set the upper bound of the number of observations, Nmis, that can be generated in X 0 , to be fixed at a large Nmax at every iteration. When used as either an imputation engine or a data synthesizer, missing values or synthetic data are generated from the truncated likelihood Equation (9).

Applications of DPMPM for multiple imputation
The DPMPM has been adapted as a multiple imputation engine to deal with missing values in categorical data. Some imputation applications have focused on the DPMPM without structural zeros, while others have dealt with the DPMPM with structural zeros.
Among the work on multiple imputation using the DPMPM without structural zeros, Si and Reiter (2013) applied the DPMPM imputation model to impute missing background data (categorical) in the 2007 Trends in International Mathematics and Science Study (TIMSS). The 2007 TIMSS data contains 80 background variables on 90,505 students. Among the 80 categorical background variables, 68 have less than 10% missing values, 6 variables have between 10% and 30% missing values, and 1 variable has more than 75% missing values. Akande et al. (2017) designed simulation studies using data from the American Community Survey (ACS) and compared the DPMPM imputation engine to two other widely used multiple imputation engines: i) chained equations using generalized linear models, and ii) chained equations using classification and regression trees (CART). From a population of 671,153 housing units and 35 categorical variables collected and cleaned from the 2012 ACS data, Akande et al. (2017) performed repeated sampling and empirically compared the three multiple imputation models.
Among the work on multiple imputation using the DPMPM with structural zeros, Manrique-Vallier and Reiter (2014b) followed the data augmentation approach Manrique-Vallier and Reiter (2014a), and imputed missing data of repeated samples from the 5% public use microdata sample from the 2000 United States Census for the state of New York, a population of 953,076 individuals and 10 categorical variables, with the number of levels ranging from 2 to 11. Finally, Murray (2018) provides an excellent review of practical and theoretical findings of multiple imputation research and highlights the DPMPM imputation engine as a recent development.

Applications of DPMPM for synthetic data
The DPMPM has also been used as a synthetic data generator to the public release of useful and private micro-level categorical data. Some work focused on the DPMPM without structural zeros, while others dealt with synthetic data problems using the DPMPM with structural zeros.
Among the work on synthetic data generation using the DPMPM without structural zeros, Hu et al. (2014) used the DPMPM to generate fully synthetic data for a subset of 10,000 individuals and 14 categorical variables from the 2012 ACS public use microdata sample for the state of North Carolina. Drechsler and Hu (2021) generated partially synthetic data for large-scale administrative data containing detailed geographic information in Germany. Hu and Savitsky (2018) also used the DPMPM to generate partially synthetic data for the Consumer Expenditure Surveys (CE) at the U.S. Bureau of Labor Statistics (BLS), disseminating detailed county-level geographic information.
Among the work on synthetic data generation using the DPMPM with structural zeros, Manrique-Vallier and  proposed a data augmentation approach and generated fully synthetic data of repeated samples from the 5% public use microdata from the 2000 United States Census for the state of California, a population of 1,690,642 records measured in 17 categorical variables, with the number of levels ranging from 2 to 11.

Two ACS samples for illustrations
Before presenting detailed step-by-step illustrations to use the NPBayesImputeCat package for multiple imputation and data synthesis applications, we introduce two samples from the 2016 1-year American Community Surveys (ACS), which will both be used for our illustrations.
ACS sample 1, 'ss16pusa_sample_zeros', contains structural zeros. It will be used to illustrate how to perform multiple imputation and data synthesis tasks when structural zeros are present. ACS sample 2, 'ss16pusa_sample_nozeros', is a subset of ACS sample 1 and contains no structural zeros. It will be used to illustrate how to perform multiple imputation and data synthesis tasks when structural zeros are not present. Within the last 12 months; 1-5 years ago; Over 5 years ago or never worked. ACS sample 1 is a random sample of n = 1, 000 observations on p = 5 variables. See Table 1 for the data dictionary. The sample is saved as 'ss16pusa_sample_zeros', and it contains structural zeros: 8 combinations, all related to AGEP and SCHL variables, listed in Table 2. These 8 cases are derived from the original 2016 1-year ACS data (as the population).

ACS sample 2, without structural zeros
To obtain a sample without structural zeros, we take a subset of ACS sample 1, where n = 1, 000 and p = 3, dropping variables AGEP and SCHL to eliminate any structural zeros. This ACS sample 2 is saved as 'ss16pusa_sample_nozeros'. See Table 3 for the data dictionary.

Missing data applications
To illustrate the applications of the NPBayesImputeCat package to missing data, we introduce 30% missingness for each variable in the ACS sample 1 and ACS sample 2 datasets, under the missing completely at random (MCAR) mechanism. The corresponding datasets (data containing missing values) to ACS sample 1 and ACS sample 2 are saved as 'ss16pusa_sample_zeros_miss' and 'ss16pusa_sample_nozeros_miss', respectively. The DPMPM imputation engine is designed to perform multiple imputations of categorical data that are missing at random (MAR)-and thus also data that are missing completely at random (MCAR).

Multiple imputation for data without structural zeros
We begin with the imputation of the missing values in the ACS sample 2 with 30% missingness, 'ss16pusa_sample_nozeros_miss', where no structural zeros are present. In the next section, we demonstrate how to impute missing values for ACS sample 1 with 30% missingness, 'ss16pusa_sample_zeros_miss', where structural zeros are present.
For each sample, we also compare the performance of the DPMPM engine to the most popular multiple imputation method, MICE. We implement the latter using the mice package in R. A brief review of mice is included at the beginning of the paper.

Initialize the DPMPM imputation engine
We use the DPMPM_nozeros_imp function to implement the DPMPM imputation engine without structural zeros. We first review the process for creating and initializing the DPMPM model using the CreateModel function to enable analysts to tune the number of mixture components through initial runs, before generating imputations using DPMPM_nozeros_imp. CreateModel is a wrapper function for creating an object of type "Lcm". Lcm was implemented as an Rcpp module to expose the C++ implementation for our algorithm. Users can learn more about the Lcm class by typing ?`Lcm', which will bring up the R documentation for this class, including all methods and properties.
CreateModel takes 7 arguments as input: 1. X, the original data with missing values.
2. MCZ, the data frame containing the structural zeros definitions -use NULL when structural zeros are not present.
3. K, the maximum number of mixture components (i.e., the maximum number of latent classes in the DPMPM imputation engine).
4. Nmax, an upper truncation limit for the augmented sample size, that is, the maximum number of observations allowable in the augmented X 0 -use 0 when structural zeros are not present.
As a quick demonstration, we let K = 30, aalpha = balpha = 0.25, and seed = 456. The code below creates and initializes the DPMPM imputation engine without structural zeros for the data stored in X. model <-CreateModel(X = X, MCZ = NULL, K = 30, Nmax = 0, aalpha = 0.25, balpha = 0.25, seed = 456) Next, we run the model object for a set of user-specified numbers of burn-ins, MCMC iterations, and thinning. For example, to run the MCMC sampler for 5 iterations post 2 burn-ins, thin every 1 iteration, and print the output at each iteration, run the following code.
It is important to keep track of the value of kstar as the NPBayesImputeCat package uses the truncated stick-breaking representation of the DP prior (Sethuraman, 1994). If the value of kstar is always K, the maximum number of mixture components, we should re-run the DPMPM model by specifying a larger value of K to allow a large enough number of mixture components to cluster the observations. For additional details on setting K, see Hu et al. (2014) and Akande et al. (2017).
The above initial run seems to suggest that the estimation uses almost all latent classes (kstar is close or is 30, which is what the maximum number of latent classes K set to). It is therefore prudent to increase the value of K when executing the CreateModel command, for example: Again, we only show the first few lines of the output. This time, setting K equal to 80 seems sufficiently large. Users should keep track of the value of kstar for the entire run and adjust K accordingly.
To diagnose convergence of parameters in the Gibbs sampler, one can use the EnableTracer option before running the sampler to track certain parameters. Examples of keeping posterior samples of alpha and kstar to access convergence are included in the accompanying R file.

Generate the imputed datasets
After setting K based on the initial runs, we now run the DPMPM imputation engine without structural zeros to create m imputed datasets. The function DPMPM_nozeros_imp takes 10 arguments as input: 1. X, the original data with missing values.
3. burn, the number of burn-in.
4. thin, the number of thinning.
5. K, the maximum number of mixture components (i.e., the maximum number of latent classes in the DPMPM imputation engine) 6. aalpha, the hyper parameter a α in stick-breaking prior distribution in Equation (6).
8. m, the number of imputations.
10. silent, default to TRUE. Set this parameter to FALSE if more iteration info are to be printed.
The output of DPMPM_nozeros_imp is a list containing: 1. impdata, the list of m imputed datasets.
3. alpha, the saved posterior draws of α, which can be used to check MCMC convergence.
4. kstar, the saved numbers of occupied mixture components, which can be used to check MCMC convergence and track whether the upper bound K is set large enough.
To run the DPMPM_nozeros_imp function to impute missing data for ACS sample 2 with 30% missingness, we run the code below. For this demonstration, we set nrun to 10000, burn to 5000, thin to 50, K to 80, both aalpha and balpha 0.25, and m to 10. Finally, we set the seed to 211. m <-10 Imp_DPMPM <-DPMPM_nozeros_imp(X = X, nrun = 10000, burn = 5000, thin = 50, K = 80, aalpha = 0.25, balpha = 0.25, m = m, seed = 211, silent = TRUE) The printed output from each iteration are omitted here. For a quick diagnostic check on whether the upper bound K is set large enough, we can use the kstar_MCMCdiag function which takes the following input arguments: 1. kstar, the vector output of kstar from running the DPMPM model. library(bayesplot) kstar_MCMCdiag(kstar = Imp_DPMPM$kstar, nrun = 10000, burn = 5000, thin = 50) Figure 1 shows the traceplot of kstar value after burn-in and thinning. It indicates no convergence issues of the MCMC chain. Moreover, it suggests choosing a smaller K value if we want to achieve an even faster computation time. Figure 2 presents its autocorrelation function plot, which also indicates no convergence issues.
To access the imputed datasets one at a time, we do the following.

impdata1 <-Imp_DPMPM$impdata[[1]] #for the first imputed dataset
Analysts then can compute sample estimates for estimands of interest in each imputed dataset and combine them using the combining rules.
Before demonstrating how to do so, we first use the mice package to also generate imputations for the same dataset. We do so to facilitate comparisons between results based on the DPMPM model and MICE. The following code runs the MICE algorithm on the same data using the default options in MICE for all the arguments, except m, which is set to 10 to be consistent with the implementations of the DPMPM engine. The code also reshapes the output of the MICE algorithm so that we are able to use some of the utility functions in NPBayesImputeCat on the imputed datasets. For more details on the implementations of mice, see Buuren and Groothuis-Oudshoorn (2011). library(mice) m <-10 Imp_MICE <-mice(data = X, m = m, defaultMethod = c("norm", "logreg", "polyreg", "polr"), print = F, seed = 342) #Reshape the list of imputed datasets Imp_MICE_reshape <-NULL Imp_MICE_reshape$impdata <-lapply(1:m,function(x) x <-X) col_names <-names(Imp_MICE$imp) for (l in 1:m){ for(j in col_names){ na_index_j <-which(is.

Assess quality of the imputations
A very common way to assess the quality of the imputations is to compare the estimated distributions in the observed and imputed datasets. We can compare the marginal distributions of any of the variables in the observed and imputed datasets using the marginal_compare_all_imp function. The function takes 3 arguments as input: 1. obsdata, the observed data.
2. impdata, the list of m imputed datasets.

vars, the variable of interest.
The output is a list containing: 1. Plot, a barplot showing the marginal probability (as a percentage) of each level of the variable in the observed and imputed datasets.
2. Comparison, the table of the marginal probabilities (as a percentage) used to make the barplot.
As an example, we can compare the marginal probability of each level of the variable WKL in the observed and imputed datasets for both MICE and the DPMPM engine by using the following code. We load the tidyverse library for making these plots.
library(tidyverse) marginal_compare_all_imp(obsdata = X, impdata = Imp_DPMPM$impdata, vars = "WKL") marginal_compare_all_imp(obsdata = X, impdata = Imp_MICE_reshape$impdata, vars = "WKL") The code creates the plots in Figures 3 and 4. For the most part in Figures 3 and 4, both DPMPM and MICE result in point estimates from the imputed datasets that are very close to the observed data, which are to be expected under MCAR. There are no major noticeable differences between the two methods. The code can be applied in a similar manner to other variables, MAR and SEX, shown in the accompanying R file.

Using the multiple imputation combining rules
We now demonstrate how to use the combining rules to obtain single point estimates and corresponding 95% confidence intervals for estimands of interest from all the imputed datasets. First, we compute There is some variability across the imputed datasets. Overall, they resemble the observed well.
the point estimates and corresponding standard errors for marginal and joint probabilities from each imputed dataset using the compute_probs function. The function takes 2 arguments as input: 1. InputData, the list of m imputed datasets.

varlist, a list of variable names (or combination of names) of interest.
The output is a list of the marginal and/or joint probabilities in each imputed dataset. Next, we use the pool_estimated_probs function to pool the estimates from all the imputed datasets using the combining rules. The function takes 2 arguments as input: 1. ComputeProbsResults, the output from the compute_probs function.
The output is a list of tables containing the results after applying the combining rules. For example, suppose we are interested in estimating probabilities corresponding to (i) the marginal distribution of MAR, (ii) the marginal distribution of SEX, and (iii) the joint distribution of MAR and WKL, we can use the following code.

MAR Estimate Std.Error
1. InputData, the list of m imputed datasets.
2. exp, the GLM expression for the model of interest (for nnet which must be loaded first).
The output is a list containing the estimated parameters from the GLM model fitted to each imputed dataset. The pool_fitted_GLMs pools the GLM estimates from all the imputed datasets using the combining rules. The function takes 2 arguments as input: 1. GLMResults, the output from the fit_GLMs function.
For example, to fit a multinomial logistic model of MAR on SEX, we can use the following code.
library(nnet) model_ex1_DPMPM <-fit_GLMs(InputData = Imp_DPMPM$impdata, exp = multinom(formula = MAR~SEX)) pool_fitted_GLMs(GLMResults = model_ex1_DPMPM, method = "imputation") The second line yields the following output, with the numbers rounded up to 4 decimal places. The fit_GLMs and pool_fitted_GLMs functions perform a similar role to the with and pool functions in the mice package. To fit the same model under MICE, we use the following code. model_ex1_MICE <-with(data = Imp_MICE, exp = multinom(formula = MAR~SEX)) summary(pool (model_ex1_MICE) The results are mostly similar, although we note that for most of the estimands, the standard errors are larger for MICE than the DPMPM engine. However, we also note that this illustration is based on data containing only n = 1000 observations and 30% missing data, so that differences in point estimates are not particularly surprising. Additional examples of fitting GLM models to the imputed datasets are shown in the accompanying R file.

Multiple imputation for data with structural zeros
We now illustrate how to impute missing values for ACS sample 1 with 30% missingness, using NPBayesImputeCat, where there are structural zeros are present. Recall that the data is stored in the file, 'ss16pusa_sample_zeros_miss'. The general procedure is very similar to the one in the previous section, where structural zeros are not present. However, we need to specify additional inputs to account for the structural zeros when generating the imputed datasets. Once the imputed datasets have been created, the utility functions used to compute sample estimates and pool them using the combining rules are exactly the same as before. First, we begin by creating MCZ, the data frame containing the structural zeros definition.

Create a file to store structural zeros cases
Previously, when there are no structural zeros, MCZ is set to NULL . Here, when there are structural zeros cases in the application, one should write the MCZ data frame following two general rules: 1. Variables in MCZ must be factors with the same levels as the original data.
2. Placeholder components are represented with NAs.
The script below is a sample script to store the structural zeros definition shown in Table 2. AGEP <-c (16,16,16,16,17,17,17,17) SCHL <-c("Bachelor's degree", "Doctorate degree", "Master's degree", "Professional degree", "Bachelor's degree", "Doctorate degree", "Master's degree", "Professional degree") MAR <-rep(NA, 8) SEX <-rep(NA, 8) WKL <-rep(NA, 8) MCZ <-as.data.frame(cbind(AGEP, MAR, SCHL, SEX, WKL)) First, we create a vector of AGEP consisting of 4 replicates of value 16 and 4 replicates of value 17 and a vector of SCHL consisting of the degree types which induce structural zeros cases with AGEP. Second, we create vectors of MAR, SEX, and WKL. Each is a vector of length 8, with each element being NA. These are placeholder components, and since the structural zeros cases do not involve these three variables, all elements are NAs. Third, we need to create a data frame using as.data.frame and cbind. It is necessary to input the variables in the same order as in the original data (the order of variables in Table 1). We save the data frame MCZ for later use.

Generate the imputed datasets
Initializing the DPMPM engine follows the exact same approach as before. The only difference is that we can now supply the two arguments specific to structural zeros, that is, MCZ and Nmax. That is, to initialize, we can run the following code. As before, we select K based on initial runs. We now also do the same for Nmax. If the value of either always hits the set values, we should re-run the model by specifying larger values to allow for a large enough number of mixture components and augmented observations to cluster the observations appropriately. As before, we can also save and track posterior samples of the parameters in the sampler using the EnableTracer option. Sample scripts are included in the accompanying R file.
After setting K and Nmax, we can now run the DPMPM imputation engine with structural zeros to create m imputed datasets. The function DPMPM_zeros_imp takes 12 arguments as input: 1. X, the original data with missing values.
2. MCZ, data frame containing the structural zeros definition.
3. Nmax, an upper truncation limit for the augmented sample size.
4. nrun, the number of MCMC iterations. 5. burn, the number of burn-in.
6. thin, the number of thinning. 7. K, the maximum number of mixture components (i.e., the maximum number of latent classes in the DPMPM imputation engine) 8. aalpha, the hyper parameter a α in stick-breaking prior distribution in Equation (6).
10. m, the number of imputations.
12. silent, default to TRUE. Set this parameter to FALSE if more iteration info are to be printed.
The output of DPMPM_zeros_imp is similar to the output of DPMPM_nozeros_imp, except that now it also includes Nmis, the saved posterior draws of the augmented sample size, which can be used to check MCMC convergence.
To run the DPMPM_zeros_imp function to impute missing data for ACS sample 1 with 30% missingness, we run the code below. For this demonstration, we set Nmax to 200000, nrun to 10000, burn to 5000, thin to 50, K to 80, both aalpha and balpha 0.25, and m to 10. Finally, we set the seed to 653. The R Journal Vol. 13/2, December 2021 ISSN 2073-4859 As before, it is straightforward to run MCMC diagnostics on the tracked elements of Imp_DPMPM. Also, Imp_DPMPM contains the list of the imputed datasets. Analysts then can compute sample estimates for estimands of interest in each imputed dataset and assess their quality or also combine them using the combining rules by using all the same functions as before. That is, using marginal_compare_all_imp, compute_probs, pool_estimated_probs, fit_GLMs, and pool_fitted_GLMs. Examples are included in the accompanying R file.
Currently, there are no direct options in the mice package to incorporate structural zeros. Therefore, we do not explore comparisons with the MICE engine for data containing structural zeros.

Synthetic data applications
Without loss of generality, suppose we want to generate partially synthetic datasets for the ACS sample 2 ('ss16pusa_sample_nozeros'), where no structural zeros are present. NPBayesImputeCat includes functionality to generate fully synthetic data as well, but we exclude its illustration for brevity. The NPBayesImputeCat package currently does not accommodate data synthesis with structural zeros.
We also implement a popular synthetic data generation method, CART (using the synthpop package in R), to ACS sample 2 and compare the results (Reiter, 2005) to DPMPM.

Load the sample data
First, we load the sample data, the ACS sample 2, and make sure that all variables are set as factors.

Generate the synthetic datasets
Initializing the DPMPM synthesizer follows the exact same approach as before for the missing data imputation applications. We run the following code. After setting K based on the initial runs, we now run the DPMPM synthesizer without structural zeros to create m synthetic datasets. The function DPMPM_nozeros_syn takes 12 arguments as input: 1. X, the original data with missing values.
2. dj, the vector recording the number of categories of the variables.
4. burn, the number of burn-ins.
6. K, the maximum number of mixture components. 7. aalpha, the hyper parameter a α in stick-breaking prior distribution in Equation (6).
8. balpha, the hyper parameter b α in stick-breaking prior distribution in Equation (6). 9. m, the number of synthetic datasets.
10. vars, the names of the variables to be synthesized. 11. seed, the seed value.
12. silent, default to TRUE. Set this parameter to FALSE if more iteration info are to be printed.
The R Journal Vol. 13/2, December 2021 ISSN 2073-4859 The output of DPMPM_nozeros_syn is a list containing: 1. syndata, the list of m synthetic datasets.
3. alpha, the saved draws of α, which can be used to check MCMC convergence.
4. kstar, the saved numbers of occupied mixture components, which can be used to check MCMC convergence and track whether the upper bound K is set large enough.
To run the DPMPM_nozeros_syn function to generate synthetic data for ACS sample 2, we run the code below. For this demonstration, we create partially synthetic data where marital status (MAR) and when last worked (WKL) are synthesized, and we set nrun to 10000, burn to 5000, thin to 50, K to 80, both aalpha and balpha to 0.25, and seed to 837. Recall that dj stores the vector of levels of the variables, which are 5 for MAR, 2 for SEX, and 3 for WKL.

syndata3 <-Syn_DPMPM$syndata[[3]] #for the third synthetic dataset
Analysts then can compute sample estimates for estimands of interest in each synthetic dataset and combine them using the combining rules. For comparison, we use the synthpop package to generate synthetic data for the same dataset with CART synthesizer. The default synthesizer for synthpop is CART, and the visit.sequence input argument allows the users to indicate which variables to be synthesized. To match with what we have done with the NPBayesImputeCat, we set MAR and WKL for visit.sequence to generate partially synthetic data. library(synthpop) m <-5 Syn_CART <-syn(data = X, m = 5, seed = 123, visit.sequence = c("MAR", "WKL")) Next, we demonstrate how to access the utility of the synthetic datasets from the two methods and also use the combining rules.

Assess utility of the synthetic datasets
Similar to what we have introduced for assessing the quality of the imputations, for synthesis, we compare the marginal distributions of any of the variables in the observed and imputed datasets, using the marginal_compare_all_syn function. The function takes 3 arguments as input: 1. obsdata, the observed data.
2. syndata, the list of m synthetic datasets.
3. vars, the variable of interest.
The output is a list containing: 1. Plot, a barplot showing the marginal probability (as a percentage) of each level of the variable in the observed and synthetic datasets.
2. Comparison, the table of the marginal probabilities (as a percentage) used to make the barplot.
The following code compares the marginal probability of each level of MAR.
For the most part in Figures 5 and 6, both DPMPM and CART result in point estimates from the synthetic datasets that are very close to the observed data. There are no major noticeable differences between the two methods. The code can be applied in a similar manner to other variables, and examples are included in the accompanying R file.

Using the synthetic data combining rules
We now demonstrate how to use the combining rules to obtain single point estimates and corresponding 95% confidence intervals for estimands of interest from all the synthetic datasets. Similar to what we have done with the multiple imputation combining rules previously, we use the same set of functions: compute_probs and pool_estimated_prob.

MAR
Lastly, we demonstrate the use of fit_GLMs and pool_fitted_GLMs for fitting generalized linear models (GLMs) to each synthetic datasets and pooling the results across all the datasets. For example, to fit a logistic model of SEX given MAR and WKL, we can use the following code. model_ex1_DPMPM <-fit_GLMs(InputData = Syn_DPMPM$syndata, exp = glm(formula = SEX~WKL+MAR, family = binomial)) pool_fitted_GLMs(GLMResults = model_ex1_DPMPM, method = "synthesis_partial") The second line yields the following output, with the numbers rounded up to 4 decimal places. We use the following code to fit the same model under CART.
model_ex1_CART <-fit_GLMs(InputData = Syn_CART$syn, exp = glm(formula = as.factor(SEX)~WKL+MAR, family = binomial)) pool_fitted_GLMs(GLMResults = model_ex1_CART, method = "synthesis_partial") The second line yields the following output, with the numbers rounded up to 4 decimal places. The utility results are different between DPMPM and CART, and we note that for most of the estimands, the DPMPM produces more accurate estimation than the CART when compared to the results from the original sample. These indicate that for this particular sample, a joint model fitted by the DPMPM does a better job preserving relationships among variables compared to a series of the conditional models fitted by the CART. Additional examples of fitting GLM models to the synthetic datasets are shown in the accompanying R file.

Concluding remarks
In this paper, we have presented the DPMPM models for multivariate categorical data and illustrations of using the NPBayesImputeCat package for multiple imputation and synthetic data applications. Users can take the output and extract imputed and synthetic datasets, then conduct statistical analyses of their choice and use the appropriate combining rules to obtain valid estimates. Interested readers can refer to the package documentation for additional features.
While the NPBayesImputeCat package has been developed primarily for multiple imputation and synthetic data purposes, users can also use it for DPMPM model estimation. For example, following the illustrations for synthetic data, a data analyst is able to obtain parameter draws of several key parameters from the MCMC chain: i) the DP concentration parameter α, ii) the mixture probability vectors {π k }, and iii) the Multinomial probability vectors {θ  [p90]