CRTgeeDR: an R Package for Doubly Robust Generalized Estimating Equations Estimations in Cluster Randomized Trials with Missing Data.

Semi-parametric approaches based on generalized estimating equation (GEE) are widely used to analyze correlated outcomes in longitudinal settings. In this paper, we present a package CRTgeeDR developed for cluster randomized trials with missing data (CRTs). For use of inverse probability weighting to adjust for missing data in cluster randomized trials (CRTs), we show that other software lead to biased estimation for non-independence working correlation structure. CRTgeeDR solve this problem. We also extend the ability of existing packages to allow augmented Doubly Robust GEE estimation (DR). Simulations studies demonstrate the consistency of estimators implemented in CRTgeeDR compared to packages such as geepack and the gains associated with the use of the DR for analyzing a binary outcome using a logistic regression. Finally, we illustrate the method on data from a sanitation CRT in developing countries.


Introduction
We describe the R package CRTgeeDR, for estimating coefficients of regression in a marginal mean model.The method is designed to analyze data collected in cluster randomized trials (CRTs) where 1) observations within a cluster may be correlated, 2) observations in separate clusters are independent, 3) a monotone transformation of expectation of the outcome is linearly related to the explanatory variables, and 4) treatment is randomized at a cluster level.The estimation approach generalizes the Generalized Estimating Equation (GEE) (?) for fitting marginal generalized linear models to clustered data with possibly informative missingness of the outcome.It combines existing methods for accommodating missing data that use inverse probability weighting (IPW) (?) and for increasing precision of estimation by appropriate use of baseline covariates (AUG) (?).We have developed a method for estimating the intervention effect in cluster randomized trials that combines the IPW and the AUG and is doubly robust (DR), meaning that the resulting estimator is consistent if either the model predicting the outcome or the model predicting the missing data is correctly specified -that is, they reflect the true data generation processes (?).Below we illustrate the use of the software on a real dataset and clarify its benefits.
The package CRTgeeDR not only implements the DR estimator but also the standard GEE, the IPW and the AUG.Regarding IPW, our package differs from most of those currently available in that it avoids the bias that can result from conventional implementation applied to CRTs.?pointed out that implementation of GEE for complete longitudinal data in the current version of SAS (GENMOD procedure) requires use of an independence correlation structure if the observation of the outcome at one time point depends on covariates obtained at another time point; this problem had been corrected in the new GEE procedure in SAS/STAT 13.2 (?). ?made a similar comment regarding the analysis of incomplete longitudinal data in which time-varying covariates and previous outcome values are needed to model the missingness process.This article clarifies this issue for CRTs and proposes an implementation in R that allows for unbiased IPW (and thus DR) estimation with non-independence working correlation structure.GEE-based approaches for estimating the coefficients in marginal models, in particular the marginal effect of an intervention, have been implemented in only a limited number of R packages and other software for general use.Of note, most of the availablesoftware was initially developed to deal with correlated longitudinal data rather than data from CRTs.There are three R packages on CRAN, which will solve GEEs and produce standard errors: whereas gee (?) and geepack (???) are computationally demanding, the package geeM allows a fast estimation through the use of sparse matrix representation (?).When interest lies in adjusting for missing outcomes using the IPW, all the packages mentioned above require specification of weights.These weights can be computed using packages such as ipw (??) or directly plugged from a user-defined function.These approaches require the missing data process to be known or correctly specified.Some packages, such as drgee (?), implement doubly robust approaches for uncorrelated data arising from observational studies.These packages provide estimates that are doubly robust in the sense that the consistency of the parameter estimator from the marginal models is guaranteed if the model linking the outcome to covariates and treatment or the The R Journal Vol.XX/YY, AAAA 20ZZ ISSN 2073-4859 model linking the treatment assignment to covariates correctly reflects the true data generation process.These methods have been extended to deal with missing data with IPW approaches CausalGAM (??), but these packages are intended for analysis of observational studies, not CRTs.Finally, targeted maximum likelihood estimation (tMLE) method allows estimation of the marginal additive effect of a treatment (?).It is implemented in the packages tmle (?) and tmlenet (?) for longitudinal and correlated data.Except for ?, there has been little published discussion about the differences between GEE-based and tMLE estimation, and we do not delve into a comparison of the two methods.The focus of this article is only on software implementation of the doubly robust GEE for CRTs.
The paper is organized as follows.Section 2.2 introduces the theory of the doubly robust estimator and Section 2.3 describes the features of the CRTgeeDR and the estimating function denoted GeedrEstimation.Section 2.4 compares the performance of CRTgeeDR to geepack for the IPW in CRTs and illustrates that the DR is consistent and more efficient than the IPW.Section 2.5 illustrates the analysis of a dataset on sanitation in developing countries (?) and illustrates the benefit of using the DR approach compared to standard GEE.Section 2.6 presents a discussion.

IPW in CRTs and doubly robust estimation Notation
Consider a CRTs comprised of n clusters or communities, each with n i individuals.The cluster sample sizes are assumed fixed and non-informative.Let Y i = [Y ij ] j=1,...,n i denote the outcome vector for cluster i, some elements of which may be unobserved.
..,n i ;r=1,...,P denote the P baseline covariates for subject j in cluster i, which is fully observed.Let A i be the treatment assigned to cluster i; the indicator for treated condition is A i = 1, and A i = 0 for control condition.We assume that the probability of treatment assignment is known and fixed to p A = P(A i = 1).The conditional mean of Y ij is denoted and we let µ i = [µ ij ] j=1,...,n i denote the full vector of means in the i th cluster.We assume that the mean structure of Y ij depends on the covariate vector for subject j in cluster i (?), and consider a model for the mean as follow: where g(.) is a monotone differentiable link function and β = (β A , β X ) is a (P + 1) × 1 is a vector of regression coefficients of interest.In this article, we focus on estimation of the marginal effect of an intervention β A for a binary outcome using the logit link.We assume the variance is , where h(.) is the variance function and φ is the dispersion parameter.Thus for our specific example, v ij = φµ ij (1 − µ ij ) When data are missing at random (MAR), the observation indicator R ij is a function of covariates, treatment condition and observed outcomes.For CRTs, we assume a restricted version of MAR (rMAR), which requires that R ij cannot be a function of observed outcomes.Although all the theory would hold for classical MAR assumption, it is most of the time difficult to specify the function linking the observation indicator and the observed outcomes of other individuals in the same cluster because there is no ordering.Thus, the probability of being observed π ij for individual j in cluster i, called the propensity score (PS), is: The parameters η W are nuisance parameters and must be estimated.

IPW in CRTs
In presence of rMAR outcome, as in ?, we estimate β by using inverse probability weighted generalized estimating equation (IPW).Therefore, we must include a weight matrix W i to the usual GEE, that is: This matrix W i (X ij , A i , η W ), denoted simply as W i in the following, adjusts the contribution of each individual in a given cluster by up weighting the contribution of individuals who are less likely to be observed according to their characteristics.Thus, if the propensity score is correctly specified, i.e., correspond to the true missingness process, the IPW equation provides consistent estimates: where D i = ∂µ i /∂β is a derivative matrix and V i is the working covariance matrix for the response Y i .In particular, The R Journal Vol.XX/YY, AAAA 20ZZ ISSN 2073-4859 correlation structure with non-diagonal terms α.For example, for an independence correlation structure α is zero; for exchangeable structure, all the elements of α are identical.Parameters α could also depend on the treatment assignment C(α(A i )) but we do not consider this possibility in our implementation.In the package CRTgeeDR, we estimate the α and φ parameters using moment estimators from the Pearson residuals and the Pearson Chi-Square statistic as in geeM (?) also described in ?.In the absence of missing data, W i = I is set to identity, and the standard GEE is performed by CRTgeeDR.
In existing packages such as geepack, the Equation 1 is implemented as 0 to ensure the fast invertibility of V i .It is easy to verify that when an independence correlation structure is used C(α) = I, the two implementations are identical.Therefore, one can always use geepack with an independence working correlation structure.In contrast, if a non-independence working correlation structure is used, the consistency of IPW estimators do not hold.See Web-Supplementary Material for demonstration.Regarding other packages such as geeM, although the implementation was the same as in geepack up to version 0.8.0, it is now implemented as in Equation 1 in version 0.10.0.In SAS GEE procedure, one can use the option "type=obslevel" (in the missing statement) in order to use the same implementation as in Equation 1.In general, it is necessary to check the formula used for implementation of the estimating equation in any desired software to avoid confusion.

Augmentation and doubly robust estimation
Recent advances in methods for analysis of data from CRTs have used augmented GEE to improve efficiency of inferences by incorporating baseline covariates (?); we denote this estimator the AUG.They have also been extended to accommodate missing data using an approach based on the IPW which is doubly robust GEE (DR).The DR properties are described in ?and the estimating equation is given by : Each element of the vector ..,n i is an arbitrary function linking Y ij with X ij for each treatment arm, which we refer to as the outcome model (OM) The η B are nuisance parameters.The estimator in Equation 2 , that is, the OM is correctly specified.If the OM is not correctly specified, i.e., does not correspond to the true data generation process, the estimation remains consistent provided that the PS model is correctly specified, but one may have a loss in efficiency.Without missing data, W i = I is set to identity, and the AUG is performed by CRTgeeDR.
Without missing data or with data missing completely at random, the use of augmentation may allow a gain in efficiency by incorporating information on baseline covariates.The PS should not be used because it will be misspecified and therefore may lead to an increase of the variance of the estimates.In presence of rMAR data, IPW alone can be used but DR should be preferred in order to increase the chances to have an unbiased estimator.Finally, as mentioned above, for data missing non at random, none of the methods implemented in CRTgeeDR is adequate.

The R package CRTgeeDR
The main function for estimation in the package CRTgeeDR The marginal model, to be estimated on the R dataframe data, is given in formula.The link function, g, depends on the nature of the outcome, which is specified in the argument family.The name of the outcome nameY, the clustering variable id, the binary treatment nameTRT (with the convention 1 is treated and 0 is control), and the missing indicator nameMISS must be specified if they differ from default values.The algorithm iterates between the estimation the working correlation structure and regression parameters with a stopping rule based on stabilisation of estimates (tolerance can be set by the user; default is tol= 10 −5 or max.iter=20).Depending on the specification or not of the PS and the OM, geeDREstimation allows the implementation of standard GEE, the IPW, the AUG and the DR approaches.The algorithm is defined as follow: 1. Determine the PS: Either the π ij are known from prior analysis or by design and the weights can be specified directly in the weights argument.
Alternatively one can compute the PR by fitting a logistic regression of R ij on (X ij , A i ).In this case, the PS regression formula can be directly entered in model.weights.A glm with logit link function is internally processed with or without variables selection, depending on the value of the stepwise.weightsargument.If all of the above are set to NULL or default, no IPW adjustment will be made -GEE or AUG will be used.Finally, if despite our concern about the implementation of weights, one wants to use the same implementation as in packages geepack or proc GENMOD in SAS, then one can set typeweights="GENMOD".
When the B i are know from prior analysis, they can be directly entered in aug=c(ctrl=B ij (X ij , Alternatively, we can regress Y ij on X ij within each treatment group.In this case, the OM regression formulas can be directly entered in model.augmentation.trt and model.augmentation.ctrl. A glm is then internally processed with or without variable selection depending on the value of the argument stepwise.augmentation.If all of the above are set to NULL or default, no augmentation adjustment will be made-GEE or IPW will be used.The probability of treatment assignment which is known in CRTs must be specified in the argument pi.a.Of note for steps 1 and 2, when using the stepwise option to compute the OM or the PS, one runs the risk of overfitting (?).Avoiding this is possible by sparsely including only relevant variables in the selection and also by running a bootstrap diagnostic using outputs (ps.model, om.model.trtand om.model.ctrl).The underlying assumption is that the true OM or PS are selected at the end of the stepwise selection and then held fixed in the estimating equation in further steps.
3. Determine the working correlation structure.Available structures are independence, exchangeable, M-dependent (using Mv), unstructured, or user-defined (using corr.mat).Using the scale.fixargument, the dispersion parameter φ can be either estimated or held fixed to a specified value.
4. Obtain initial values.They are either specified by the user (init.beta,init.alpha, and init.phi) or internally defined by fitting a glm under independence to obtain initial values for β(0) and by setting φ (0) = 1 and α (0) = 0.
5. Enter/Continue the iterative procedure : (a) Use the fit from β(n) to compute Pearson residuals.Use Pearson residuals based formulas to compute the scale parameter (φ (n+1) , except if scale.fix=TRUE) and the parameters in the working correlation matrix (α (n+1) ).
(b) Construct the augmented equation given in Equation 2 and solve it numerically using Newton-Raphson algorithm for β(n+1) .n) +prec.machine> tol and n + 1 ≤ max.iter go back to 5 else go to 6, where prec.machine∼ 10 −16 .6. Compute the requested variances of β(n+1) .If, sandwich and sandwich.nuisanceare set to TRUE, classical and nuisance-adjusted (for the estimation of parameters η W in the PS and η B in the OM) sandwich estimators of the variance are provided, see ? for their definition.The nuisance-adjusted version is computed using numerical derivatives of score equations for PS, OM and estimating equations jointly, which are obtained by using the jacobian function of The R Journal Vol.XX/YY, AAAA 20ZZ ISSN 2073-4859 the package numDeriv (?); this is recommended if the AUG, the IPW or the DR estimator are considered.Finally, a small-sample-adjusted sandwich estimator of the variance can also be computed using Fay's adjustment (?) setting the argument fay.adjustment to TRUE.Its implementation is derived from the function gee.var.fg in the package geesmv (?).

Adequacy of the PS and the OM to data
Consistency and efficiency of the DR estimator depend on the correct specification of the PS and the OM, see ? for theoretical demonstrations.User may want to check the adequacy of the selected OM model to the data by using the function getOMPlot, which provide plots to check the glm model assumption.The "Residuals vs. Fitted" and the "Scale-location" graphics allow verification of the homogeneity of the variance and the adequacy of the link function.The "Normal Q-Q" checks for the normal distribution of the residuals.The "Residuals vs Leverage" plot allows detection of points that have high leverage on the regression coefficients and that should be investigated as outliers.In the same spirit, the "Cook's distance" and the "Cook's distance vs leverage" provide measures of the effect of deleting a given observation.Of note these graphs are only interpretable for continuous outcome.
In addition, for the PS model the function getPSPlot provides a histogram of the weights.If weights are too large then the IPW and DR approaches are likely to be unstable.In this case, the user should compute weights externally using, for example, stabilized weights with the associated package ipw (?) or other approaches such as described in ?. Finally, the user can access the glm objects created during the PS and OM initial steps as objects named ps.model,om.model.trtand om.model.ctrlfrom the main function geeDREstimation.

Simulations
The properties of DR to accommodate complex correlation structure, rMAR outcomes, and presence of imbalance in baseline covariates have already been demonstrated in ?.In this article, we focus on the superiority of implementation of weights in the package CRTgeeDR compared to package geepack.
We focus on a simple example to illustrate that, even in very simple cases, estimators implemented in broadly used R package geepack for IPW can be inconsistent when using an exchangeable working correlation structure.This is the case when is used in the estimating equation.We simulate data from a CRT with 100 communities of 90, 100, or 110 individuals with probability 1/3 for each.The treatment A is randomly assigned with probability p A = 1/2.One covariate is of interest: X ij ∼ N (2, 1).We simulate correlated outcome with exchangeable structure, and correlation between individuals is set to 0.05.This is done by using a cluster-level bridge distribution b i ∼ B(0.05).Data generation process is as follow: We simulated R=10,000 replicates.The observed average proportion of missing observations is around 25% and the observed average intraclass correlation is 0.08.Missingness is associated strongly with individual covariates and, therefore, the weights differ between individuals in the same cluster.The true value of the odd-ratio for the marginal effect of treatment is computed for each dataset k without missing data by obtaining the counterfactual values with and without treatment under this model: The true OR is given by 1 R ∑ R k=1 OR k =2.56 with associated parameter for marginal intervention effect in the marginal regression β A = 0.941.For each dataset, we first ran the analysis on the dataset without missing data for the standard GEE and the AUG using CRTgeeDR.Then, we ran the analysis on the dataset with missing data for the IPW using geepack and for the standard GEE, the IPW, the AUG and the DR using CRTgeeDR.Two types of DR are presented here: DR1 is the estimator using the correct models for the OM and the PS, and DR2 omits treatment-covariate interaction terms in the PS.The models for the PS and OM for analysis are described in the Table 1.Table 1 shows the bias, empirical standard error, sandwich standard error and coverages for each analysis using independence (-I) and exchangeable (-E) working correlation structure.The code to replicate this study is available in Web-Supplementary Material.
OM used for AUG, DR1 and DR2 (fitted for each group a): Table 1: Comparison of the standard GEE, the IPW, the AUG and the DR analysis with the package CRTgeeDR, geepack and geeM using independence and exchangeable working correlation structure.
True value for the parameter β A is 0.91 (OR=2.56).The bias, the empirical and the estimated standard errors (SE) and the coverages for parameter β A are computed over 10000 replicates.The true data generation process for outcome and missingness is provided in Equation 3. The PS and OM models for analysis are correctly specified and given in the footnote of the table.
The results for standard GEE are unbiased in the absence of missing data (<0.003 for GEE-I and GEE-E with all packages) and biased in presence of rMAR outcomes reflecting the fact that the missingness is informative.Using the IPW-I corrects for this bias (0.008 for geepack).All packages give a similar estimated standard error leading to acceptable coverage close to their nominal value of 95%.When using an exchangeable correlation structure, the coverage (93.7%) remains close to the nominal value for IPW-E using CRTgeeDR, but it drops to 19.4% using geepack.This is mainly driven by an increase in the bias from 0.003 for CRTgeeDR to 0.582 for geepack for IPW-E.Using the DR1 version of CRTgeeDR provides consistent estimates (bias ≤0.004 for DR1-I and DR1-E).DR-1 yields coverage that is close to or greater than 95% and gains, on average, in efficiency.For example, the empirical standard error is 0.108 for IPW-I and 0.107 for DR-I.DR2, which omits the term X ij A i in the PS, yields consistent and efficient estimates even when the treatment-covariate interactions are not explicitly specified in the PS.As demonstrated in ?, DR1 and DR2 have similar properties.

Illustration on the sanitation data
In this section, we present a step-by-step analysis of data from a CRT to investigate the efficacy of alternatives policies on the investment in hygienic latrines in developing countries.A total of 380 communities in rural Bangladesh were assigned to different marketing interventions -community motivation, subsidies, supply side-market, a combination of the three and a control group.Results of this study were published in (?).All the code and data associated with this study are available on dataverse, see url in ?.
We consider only the comparison of a supply side-market versus control.The published analysis used a mixed effect model and showed that the supply side-market alone did not increase the hygienic latrine ownership (+0.3 percentage points, p-value=0.90).We reanalyze   3 presents the PS and OM for analysis, the estimates, the nuisance-adjusted sandwich estimates of the variance, the confidence intervals for the odd-ratios, the p-values, and the computation times for each of these analysis.For DR the computation time is 20 seconds, most of which is required for the computation of the nuisance-adjusted sandwich estimator of the variance (the estimation is < 3 seconds otherwise).Whereas GEE and IPW lead to non-significant effect of supply side-market, the DR estimates are significantly different from 0 at the 0.05 level (p=0.025).Using the DR, we conclude that there is 55% [8% -121%] greater chance of owning hygienic latrine after one year if there is a supply side-market.This effect is significant (p<0.05) even using a nuisance-adjusted SE, which is generally larger than the standard sandwich SE due to incorporation of additional variability from estimation of the nuisance parameters in the PS and the OM (η W and η B ). Information about the PS and the OM can be obtained by using the following command lines: Table 3: Effects of the supply side-market vs. control on the probability of hygienic latrine ownership in the sanitation data analysis (?) using the standard GEE, the IPW adjustment (IPW and DR), and the augmentation for imbalance (AUG and DR) assuming outcomes are rMAR.
Description of models for OM, PS and histogram of weights are given in Web-Supplementary Material Table 1 and Figure 1.As noted in Table 3, the estimates for IPW are close to those for GEE, reflecting the fact that only 3.5% of data are missing.We also note that all of the non-null weights are close to 1 (1.035 [1.02; 1.04]) showing that no covariate of the PS explains the missingness pattern.Thus, the increased significance of the intervention in the DR analysis compared to GEE is mainly driven by the augmentation.In both groups, households with higher education and economic status (as evidenced by stoves, water pipes, phones, and other factors) are more likely to have a hygienic latrine.For cluster-level covariates the patterns differs by intervention group: a high number of doctors is positively associated with the hygienic latrine ownership only in the intervention group indicating a potential synergy between the number of doctors and the presence of side-supply markets.

Conclusion
We demonstrated that the IPW can be biased in CRTs if the weights are not implemented as described in ?and a non-independence working correlation structure is chosen.In particular, we discuss problems that arise in the package geepack implemented in R.These concerns apply not only for outcome data in CRTs but also to longitudinal outcome data, when the probability that an observations is missing at a given time depends on time-varying covariates measured at other times.We recommend to always check the implementation in the software that has been chosen for analysis.The CRTgeeDR package protects against this bias and allows for adjustment in imbalance in baseline covariates in CRTs.The package can accommodate for a wide range of outcome types, link functions, and working correlation structures.The CRTgeeDR package is easy to use and does not require extensive programming.It therefore makes the augmented GEE (AUG) and the Doubly robust (DR) methodology for CRTs more accessible to applied researchers.Of note, although the CRTgeeDR package had been designed for CRTs, it can also be used for analysis of correlated longitudinal data from a randomized trial.The use of version 2.0 of the CRTgeeDR package to analyze observational clustered data (in which treatment attribution may be informative) is not straightforward, but updates with these capability are under development.