Semi-parametric approaches based on generalized estimating equations (GEE) are widely used to analyze correlated outcomes in longitudinal settings. In this paper, we present a package CRTgeeDR developed for cluster randomized trials with missing data (CRTs). For use of inverse probability weighting to adjust for missing data in cluster randomized trials, we show that other software lead to biased estimation for non-independence working correlation structure. CRTgeeDR solves this problem. We also extend the ability of existing packages to allow augmented Doubly Robust GEE estimation (DR). Simulation studies demonstrate the consistency of estimators implemented in CRTgeeDR compared to packages such as geepack and the gains associated with the use of the DR for analyzing a binary outcome using a logistic regression. Finally, we illustrate the method on data from a sanitation CRT in developing countries.
We describe the R package CRTgeeDR, for estimating coefficients of regression in a marginal mean model. The method is designed to analyze data collected in cluster randomized trials (CRTs) where 1) observations within a cluster may be correlated, 2) observations in separate clusters are independent, 3) a monotone transformation of expectation of the outcome is linearly related to the explanatory variables, and 4) treatment is randomized at a cluster level. The estimation approach generalizes the Generalized Estimating Equation (GEE) (Zeger and Liang 1986) for fitting marginal generalized linear models to clustered data with possibly informative missingness of the outcome. It combines existing methods for accommodating missing data that use inverse probability weighting (IPW) (Robins et al. 1995) and for increasing precision of estimation by appropriate use of baseline covariates (AUG) (Stephens et al. 2012). We have developed a method for estimating the intervention effect in cluster randomized trials that combines the IPW and the AUG and is doubly robust (DR), meaning that the resulting estimator is consistent if either the model predicting the outcome or the model predicting the missing data is correctly specified—that is, they reflect the true data generation processes (Prague et al. 2016). Below we illustrate the use of the software on a real dataset and clarify its benefits.
The package CRTgeeDR not only implements the DR estimator but also the
standard GEE, the IPW and the AUG. Regarding IPW, our package differs
from most of those currently available in that it avoids the bias that
can result from conventional implementation applied to CRTs. Lin et al. (2015)
pointed out that implementation of GEE for complete longitudinal data in
the current version of SAS (GENMOD procedure
) requires use of an
independence correlation structure if the observation of the outcome at
one time point depends on covariates obtained at another time point;
this problem had been corrected in the new GEE procedure
in SAS/STAT
13.2 (SAS Institute Inc. 2015). Tchetgen Tchetgen et al. (2012) made a similar comment
regarding the analysis of incomplete longitudinal data in which
time-varying covariates and previous outcome values are needed to model
the missingness process. This article clarifies this issue for CRTs and
proposes an implementation in R that allows for unbiased IPW (and thus
DR) estimation with non-independence working correlation structure.
GEE-based approaches for estimating the coefficients in marginal models, in particular the marginal effect of an intervention, have been implemented in only a limited number of R packages and other software for general use. Of note, most of the available software was initially developed to deal with correlated longitudinal data rather than data from CRTs. There are three R packages on CRAN, which will solve GEEs and produce standard errors: whereas gee (Carey et al. 2012) and geepack (Jun 2002; Halekoh et al. 2006; Højsgaard and Halekoh 2016) are computationally demanding, the package geeM allows a fast estimation through the use of sparse matrix representation (McDaniel et al. 2013). When interest lies in adjusting for missing outcomes using the IPW, all the packages mentioned above require specification of weights. These weights can be computed using packages such as ipw (Wal and Geskus 2011; Geskus and Wal 2015) or directly assigned from a user-defined function. These approaches require the missing data process to be known or correctly specified. Some packages, such as drgee (Zetterqvist and Sjölander 2015), implement doubly robust approaches for uncorrelated data arising from observational studies. These packages provide estimates that are doubly robust in the sense that the consistency of the parameter estimator from the marginal models is guaranteed if the model linking the outcome to covariates and treatment or the model linking the treatment assignment to covariates correctly reflects the true data generation process. These methods have been extended to deal with missing data with IPW approaches in CausalGAM (Glynn and Quinn 2010b; Glynn and Quinn 2010a), but these packages are intended for analysis of observational studies, not CRTs. Finally, the targeted maximum likelihood estimation (tMLE) method allows estimation of the marginal additive effect of a treatment (Laan 2014a). It is implemented in the packages tmle (Gruber 2014) and tmlenet (Sofrygin and Laan 2015) for longitudinal and correlated data. Except for (Porter et al. 2011), there has been little published discussion about the differences between GEE-based and tMLE estimation, and we do not delve into a comparison of the two methods. The focus of this article is only on software implementation of the doubly robust GEE for CRTs.
The paper is organized as follows. Section 2
introduces the theory of the doubly robust estimator and Section
3 describes the features of the CRTgeeDR and the
estimating function denoted GeedrEstimation
. Section 4
compares the performance of CRTgeeDR to geepack for the IPW in CRTs
and illustrates that the DR is consistent and more efficient than the
IPW. Section 5 illustrates the analysis of a dataset on
sanitation in developing countries (Guiteras et al. 2015a) and
illustrates the benefit of using the DR approach compared to standard
GEE. Section 6 presents a discussion.
Consider a CRT comprised of
In presence of rMAR outcome, as in Robins et al. (1995), we estimate
In existing packages such as geepack, the Equation (1) is
implemented as
GEE
procedure,
one can use the option "type=obslevel"
(in the missing
statement) in
order to use the same implementation as in Equation (1). In
general, it is necessary to check the formula used for implementation of
the estimating equation in any desired software to avoid confusion.
Recent advances in methods for analysis of data from CRTs have used augmented GEE to improve efficiency of inferences by incorporating baseline covariates (Stephens et al. 2012); we denote this estimator the AUG. They have also been extended to accommodate missing data using an approach based on the IPW which is doubly robust GEE (DR). The DR properties are described in Prague et al. (2016) and the estimating equation is given by :
Each element of the vector
Without missing data or with data missing completely at random, the use of augmentation may allow a gain in efficiency by incorporating information on baseline covariates. The PS should not be used because it will be misspecified and therefore may lead to an increase of the variance of the estimates. In presence of rMAR data, IPW alone can be used but DR should be preferred in order to increase the chances to have an unbiased estimator. Finally, as mentioned above, for data missing not at random, none of the methods implemented in CRTgeeDR are adequate.
CRTgeeDR
The call function for performing estimation is geeDREstimation
:
R> geeDREstimation(formula, id, data = parent.frame(), family = gaussian,
+ corstr = "independence", Mv = 1, corr.mat = NULL, init.beta = NULL,
+ init.alpha = NULL, init.phi = 1, scale.fix = FALSE, maxit = 20,
+ tol=1e-05, print.log = FALSE, nameTRT = "TRT", nameMISS = "MISSING",
+ nameY = "OUTCOME", sandwich = TRUE, sandwich.nuisance = FALSE,
+ fay.adjustment = FALSE, fay.bound = 0.75, aug = NULL, pi.a = 1/2,
+ model.augmentation.trt = NULL, model.augmentation.ctrl = NULL,
+ stepwise.augmentation = FALSE, weights = NULL, typeweights = "VW",
+ model.weights = NULL, stepwise.weights = FALSE)
The marginal model, to be estimated on the R dataframe data
, is given
in formula
. The link function, family
. The name of the
outcome nameY
, the clustering variable id
, the binary treatment
nameTRT
(with the convention 1 is treated and 0 is control), and the
missing indicator nameMISS
must be specified if they differ from
default values. The algorithm iterates between the estimation the
working correlation structure and regression parameters with a stopping
rule based on stabilisation of estimates (tolerance can be set by the
user; default is tol
max.iter=20
). Depending on the
specification or not of the PS and the OM, geeDREstimation
allows the
implementation of standard GEE, the IPW, the AUG and the DR approaches.
The algorithm is defined as follow:
Determine the PS:
weights
argument. Alternatively one can compute the PS by
fitting a logistic regression of model.weights
. A glm
with logit link
function is internally processed with or without variable selection,
depending on the value of the stepwise.weights
argument. If all of
the above are set to NULL or default, no IPW adjustment will be
made—GEE or AUG will be used. Finally, if despite our concern
about the implementation of weights, one wants to use the same
implementation as in packages geepack or proc GENMOD
in SAS,
then one can set typeweights="GENMOD"
.
Determine group-specific OM:
aug=c(ctrl=
,trt=
)
.
Alternatively, we can regress model.augmentation.trt
and
model.augmentation.ctrl
. A glm
is then internally processed with
or without variable selection depending on the value of the argument
stepwise.augmentation
. If all of the above are set to NULL or
default, no augmentation adjustment will be made—GEE or IPW will
be used. The probability of treatment assignment, which is known in
CRTs, must be specified in the argument pi.a
. Of note for steps 1
and 2, when using the stepwise option to compute the OM or the PS,
one runs the risk of overfitting (Laan 2014b). Avoiding this is
possible by sparsely including only relevant variables in the
selection and also by running a bootstrap diagnostic using outputs
(ps.model
, om.model.trt
and om.model.ctrl
). The underlying
assumption is that the true OM or PS are selected at the end of the
stepwise selection and then held fixed in the estimating equation in
further steps.
Determine the working correlation structure. Available structures
are independence
, exchangeable
, M-dependent
(using Mv
),
unstructured
, or user-defined
(using corr.mat
). Using the
scale.fix
argument, the dispersion parameter
Obtain initial values. They are either specified by the user
(init.beta
, init.alpha
, and init.phi
) or internally defined by
fitting a glm under independence to obtain initial values for
Enter/continue the iterative procedure :
Use the fit from scale.fix
=TRUE) and the parameters in the working correlation
matrix (
Construct the augmented equation given in Equation
(2) and solve it numerically using Newton-Raphson
algorithm for
If
maxtol
and max.iter
go back to 5 else go to 6, where
Compute the requested variances of
sandwich
and
sandwich.nuisance
are set to TRUE, classical and nuisance-adjusted
(for the estimation of parameters jacobian
function of the package
numDeriv
(Gilbert and Varadhan 2015); this is recommended if the AUG, the IPW or the DR
estimator are considered. Finally, a small-sample-adjusted sandwich
estimator of the variance can also be computed using Fay’s
adjustment (Fay and Graubard 2001) setting the argument fay.adjustment
to
TRUE. Its implementation is derived from the function gee.var.fg
in the package geesmv
(Wang 2015).
Consistency and efficiency of the DR estimator depend on the correct
specification of the PS and the OM, see Prague et al. (2016) for theoretical
demonstrations. The user may want to check the adequacy of the selected
OM model to the data by using the function getOMPlot
, which provides
plots to check the glm
model assumption. The “Residuals vs. Fitted"
and the "Scale-location" graphics allow verification of the
homogeneity of the variance and the adequacy of the link function. The”Normal Q-Q" checks for the normal distribution of the residuals. The
“Residuals vs Leverage" plot allows detection of points that have high
leverage on the regression coefficients and that should be investigated
as outliers. In the same spirit, the”Cook’s distance" and the “Cook’s
distance vs leverage" provide measures of the effect of deleting a
given observation. Of note, these graphs are only interpretable for a
continuous outcome. In addition, for the PS model the function
getPSPlot
provides a histogram of the weights. If weights are too
large then the IPW and DR approaches are likely to be unstable. In this
case, the user should compute weights externally using, for example,
stabilized weights with the associated package ipw (Wal and Geskus 2011) or
other approaches such as described in Wang and Paik (2011). Finally, the
user can access the glm
objects created during the PS and OM initial
steps as objects named ps.model
, om.model.trt
, and om.model.ctrl
from the main function geeDREstimation
.
The properties of DR to accommodate complex correlation structure, rMAR
outcomes, and the presence of imbalance in baseline covariates have
already been demonstrated in Prague et al. (2016). In this article, we focus
on the superiority of implementation of weights in the package
CRTgeeDR compared to package geepack. We focus on a simple example
to illustrate that, even in very simple cases, estimators implemented in
broadly used R package geepack for IPW can be inconsistent when using
an exchangeable working correlation structure. This is the case when
We simulated R=10,000 replicates. The observed average proportion of
missing observations is around 25% and the observed average intraclass
correlation is 0.08. Missingness is associated strongly with individual
covariates and, therefore, the weights differ between individuals in the
same cluster. The true value of the odds-ratio for the marginal effect
of treatment is computed for each dataset
Independence (-I) | Exchangeable (-E) | |||||||
Method | Bias | Emp. SE | SE | Cov. | Bias | Emp. SE | SE | Cov. |
No missing data: | ||||||||
GEE CRTgeeDR | 0.002 | 0.102 | 0.099 | 94.3 | 0.002 | 0.108 | 0.099 | 93.2 |
GEE geepack | 0.003 | 0.102 | 0.101 | 94.6 | 0.003 | 0.102 | 0.101 | 94.6 |
AUG CRTgeeDR | 0.002 | 0.101 | 0.099 | 94.3 | 0.002 | 0.109 | 0.114 | 95.8 |
With missing data: | ||||||||
GEE CRTgeeDR | -0.257 | 0.103 | 0.177 | 82.0 | -0.256 | 0.104 | 0.081 | 18.1 |
AUG CRTgeeDR | 0.249 | 0.092 | 0.109 | 35.7 | 0.307 | 0.115 | 0.139 | 37.1 |
With missing data and adjustment for it: | ||||||||
IPW CRTgeeDR | 0.003 | 0.108 | 0.106 | 95.0 | 0.003 | 0.118 | 0.110 | 93.7 |
IPW geepack | 0.008 | 0.107 | 0.104 | 94.8 | 0.582 | 0.577 | 0.357 | 19.4 |
DR1 CRTgeeDR | 0.003 | 0.107 | 0.104 | 94.5 | 0.004 | 0.120 | 0.125 | 96.1 |
DR2 CRTgeeDR | 0.003 | 0.105 | 0.102 | 94.4 | 0.004 | 0.118 | 0.123 | 96.0 |
Marginal mean model: |
PS used for IPW and DR (true): |
PS used for DR2 (omitting interactions in PS): |
OM used for AUG, DR1 and DR2 (fitted for each group |
The results for standard GEE are unbiased in the absence of missing data
(<0.003 for GEE-I and GEE-E with all packages) and biased in presence
of rMAR outcomes reflecting the fact that the missingness is
informative. Using the IPW-I corrects for this bias (0.008 for
geepack). All packages give a similar estimated standard error leading
to acceptable coverage close to their nominal value of 95%. When using
an exchangeable correlation structure, the coverage (93.7%) remains
close to the nominal value for IPW-E using CRTgeeDR, but it drops to
19.4% using geepack. This is mainly driven by an increase in the bias
from 0.003 for CRTgeeDR to 0.582 for geepack for IPW-E. Using the
DR1 version of CRTgeeDR provides consistent estimates (bias
In this section, we present a step-by-step analysis of data from a CRT to investigate the efficacy of alternatives policies on the investment in hygienic latrines in developing countries. A total of 380 communities in rural Bangladesh were assigned to different marketing interventions—community motivation, subsidies, supply side-market, a combination of the three, and a control group. Results of this study were published in (Guiteras et al. 2015a). All the code and data associated with this study are available on dataverse, see url in Guiteras et al. (2015b).
Side-Market supply | Control | All | ||||
---|---|---|---|---|---|---|
Cluster structure | ||||||
36 (n = 1651) | 66 (n = 3186) | 100 (n = 4837) | ||||
49 (15) | 48 (16) | 48 (16) | ||||
Outcome |
Mean | Missing % | Mean | Missing % | Mean | Missing % |
Hygienic Latrine Ownership | 34.8% | 4.2% | 30.3% | 3.1% | 31.8% | 3.5% |
Individual-level |
Mean | Missing % | Mean | Missing % | Mean | Missing % |
Report diarrhea | 4.3% | 0% | 4.8% | 0% | 4.6% | 0% |
Male | 91.1% | 90.0% | 90.1% | |||
Education | 49.2% | 0% | 45.8% | 0% | 46.9% | 0% |
Muslim | 83.2% | 86.3% | 85.2% | |||
Bengali | 85.6% | 88.5% | 87.6% | |||
Agricultor | 75.0% | 70.2% | 71.9% | |||
Stoves | 58.2% | 62.9% | 61.3% | |||
Water Pipes | 89.9% | 91.3% | 90.8% | |||
Phone | 64.1% | 57.2% | 59.5% | |||
Age | 39 (13) | 39 (14) | 39 (14) | |||
Cluster-level |
Mean | Missing % | Mean | Missing % | Mean | Missing % |
Village size | 230 (120) | 0% | 270 (190) | 0% | 256 (170) | 0% |
Nb doctors | 7 (7) | 0% | 9 (18) | 0% | 8 (15) | 0% |
% Landless | 41.6 (12) | 0% | 34.4 (15) | 0% | 36.9 (15) | 0% |
% Almost Landless | 19.3 (11) | 0% | 24.0 (8) | 0% | 22.4 (9) | 0% |
% Access electricity | 59.9 (26) | 0% | 59.1 (20) | 0% | 59.4 (22) | 0% |
We consider only the comparison of a supply side-market versus control. The published analysis used a mixed effect model and showed that the supply side-market alone did not increase the hygienic latrine ownership (+0.3 percentage points, p-value=0.90). We reanalyze the dataset using the GEE approaches in order to get the marginal effect of intervention. Description of the outcome and variables for adjustment are available in Table 2. Because covariates were missing in less than 0.01% of the observations, we assume that covariates are missing completely at random and exclude individuals with missing covariates. The final dataset contains 4774 individuals and 380 clusters. We assume the outcomes are rMAR. As there is some evidence of imbalance in baseline covariates across arms, i.e., the descriptive distributions of covariates in Table 2 are different between treated and control groups, we use the DR approach. We assume that the correlation between any pair of individuals in the same cluster is the same and hence use an exchangeable working correlation structure. In this example, the PS and OM are fitted using a logistic regression with a linear combination of all the individual-level and cluster-level covariates described in Table 2. Variables for these models are selected using a forward stepwise regression before solving the estimating equation. Adequacy of the model has been verified. The code for analysis is available in the Web-Supplementary Material. To illustrate the use of the package CRTgeeDR, we provide instructions for the DR estimator:
R> DR <- geeDREstimation(OUTCOME ~ TRT, id = CLUSTER, data = Sanitation,
+ family = binomial("logit"), corstr = "exchangeable", typeweights = "VW",
+ model.weights = MISSING ~ TRT + DIARRHEA + ... + ELEC_ACCESS,
+ model.augmentation.trt = OUTCOME ~ DIARRHEA + ... + ELEC_ACCESS,
+ model.augmentation.ctrl = OUTCOME ~ DIARRHEA + ... + ELEC_ACCESS,
+ stepwise.weights = TRUE, stepwise.augmentation = TRUE)
R> summary(DR)
The output displays statistics for estimated coefficients
Estimates Model SE Robust SE wald p
(Intercept) -0.8106 0.09396 0.1088 -7.452 0.000000
TRT 0.4365 0.12890 0.1425 3.062 0.002198
Est. Correlation: 0.07306
Correlation Structure: exchangeable
Est. Scale Parameter: 0.9955
Number of GEE iterations: 2
Number of Clusters: 100 Maximum Cluster Size: 87
Number of observations with nonzero weight: 4612
Table 3 presents the PS and OM for analysis, the
estimates, the nuisance-adjusted sandwich estimates of the variance, the
confidence intervals for the odd-ratios, the p-values, and the
computation times for each of these analysis. For DR the computation
time is 20 seconds, most of which is required for the computation of the
nuisance-adjusted sandwich estimator of the variance (the estimation is
R> summary(DR$ps.model)
R> summary(DR$om.model.trt)
R> summary(DR$om.model.ctrl)
R> getPSPlot(DR)
Sandwich | Nuis-adj. | p-value | time | ||||||
SE | SE | OR | Unadj. | Nuis-adj. | (sec.) | ||||
GEE | 0.19 | 0.171 | - | 1.21 | 0.87 | 1.69 | 0.262 | - | 1 |
IPW | 0.19 | 0.182 | 0.219 | 1.21 | 0.79 | 1.86 | 0.290 | 0.386 | 32 |
AUG | 0.45 | 0.141 | 0.176 | 1.57 | 1.12 | 2.22 | 0.001 | 0.010 | 11 |
DR | 0.44 | 0.143 | 0.183 | 1.55 | 1.08 | 2.21 | 0.002 | 0.016 | 20 |
Marginal mean model: |
PS: logit |
OM: logit |
Description of models for OM, PS and histogram of weights are given in the Web-Supplementary Material Table 1 and Figure 1. As noted in Table 3, the estimates for IPW are close to those for GEE, reflecting the fact that only 3.5% of data are missing. We also note that all of the non-null weights are close to 1 (1.035 [1.02; 1.04]) showing that no covariate of the PS explains the missingness pattern. Thus, the increased significance of the intervention in the DR analysis compared to GEE is mainly driven by the augmentation. In both groups, households with higher education and economic status (as evidenced by stoves, water pipes, phones, and other factors) are more likely to have a hygienic latrine. For cluster-level covariates the patterns differ by intervention group: a high number of doctors is positively associated with the hygienic latrine ownership only in the intervention group indicating a potential synergy between the number of doctors and the presence of side-supply markets.
We demonstrated that the IPW can be biased in CRTs if the weights are not implemented as described in Robins et al. (1995) and a non-independence working correlation structure is chosen. In particular, we discuss problems that arise in the package geepack implemented in R. These concerns apply not only for outcome data in CRTs but also to longitudinal outcome data, when the probability that an observations is missing at a given time depends on time-varying covariates measured at other times. We recommend to always check the implementation in the software that has been chosen for analysis. The CRTgeeDR package protects against this bias and allows for adjustment in imbalance in baseline covariates in CRTs. The package can accommodate a wide range of outcome types, link functions, and working correlation structures. The CRTgeeDR package is easy to use and does not require extensive programming. It therefore makes the augmented GEE (AUG) and the Doubly robust (DR) methodology for CRTs more accessible to applied researchers. Of note, although the CRTgeeDR package had been designed for CRTs, it can also be used for analysis of correlated longitudinal data from a randomized trial. The use of version 2.0 of the CRTgeeDR package to analyze observational clustered data (in which treatment attribution may be informative) is not straightforward, but updates with these capabilities are under development.
We thank R. Guiteras for sharing the Sanitation study on the dataverse website. This work was founded by NIH grants R37 AI 51164 and R01 MH100974. Portions of this research were conducted on the Cluster at Harvard Medical (NIH grant NCRR 1S10RR028832-01).
CRTgeeDR, gee, geepack, geeM, ipw, drgee, CausalGAM, tmle, tmlenet, numDeriv, geesmv
CausalInference, Econometrics, MissingData, MixedModels, NumericalMathematics, Robust
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Prague, et al., "CRTgeeDR: an R Package for Doubly Robust Generalized Estimating Equations Estimations in Cluster Randomized Trials with Missing Data", The R Journal, 2017
BibTeX citation
@article{RJ-2017-041, author = {Prague, Melanie and Wang, Rui and Gruttola, Victor De}, title = {CRTgeeDR: an R Package for Doubly Robust Generalized Estimating Equations Estimations in Cluster Randomized Trials with Missing Data}, journal = {The R Journal}, year = {2017}, note = {https://rjournal.github.io/}, volume = {9}, issue = {2}, issn = {2073-4859}, pages = {105-115} }