Though longitudinal data often contain missing responses and error-prone covariates, relatively little work has been available to simultaneously correct for the effects of response missingness and covariate measurement error on analysis of longitudinal data. Yi (2008) proposed a simulation based marginal method to adjust for the bias induced by measurement error in covariates as well as by missingness in response. The proposed method focuses on modeling the marginal mean and variance structures, and the missing at random mechanism is assumed. Furthermore, the distribution of covariates are left unspecified. These features make the proposed method applicable to a broad settings. In this paper, we develop an R package, called swgee, which implements the method proposed by Yi (2008). Moreover, our package includes additional implementation steps which extend the setting considered by Yi (2008). To describe the use of the package and its main features, we report simulation studies and analyses of a data set arising from the Framingham Heart Study.
Longitudinal studies are commonly conducted in the health sciences, biochemical, and epidemiology fields; these studies typically collect repeated measurements on the same subject over time. Missing observations and covariate measurement error frequently arise in longitudinal studies and they present considerable challenges in statistical inference about such data (Carroll et al. 2006; Yi 2008). It has been well documented that ignoring missing responses and covariate measurement error may lead to severely biased results, thus leading to invalid inferences (Fuller 1987; Carroll et al. 2006).
Regarding longitudinal data with missing responses, there has been extensive methods such as maximum likelihood, multiple imputation, and weighted generalized estimating equations (GEE) method (Little and Rubin 2002). In terms of methods of handling measurement error in covariate, many methods have been developed for various settings. Comprehensive discussions can be found in Fuller (1987), Gustafson (2003), Carroll et al. (2006), Buonaccorsi (2010) and Yi (2017). However, there has been relatively little work on simultaneously addressing the effects of response missingness and covariate measurement error in longitudinal data analysis, although some work such as Wang et al. (2008), Liu and Wu (2007) and Yi et al. (2012), are available. In particular, Yi (2008) proposed an estimation method based on the marginal model for the response process, which does not require the full specification of the distribution of the response variable but models only the mean and variance structures. Furthermore, a functional method is applied to relax the need of modeling the covariate process. These features make the method of Yi (2008) flexible for many applications.
Relevant to our R package, a set of R packages and statistical software have been available for performing the GEE and weighted GEE analyses for longitudinal data with missing observations. In particular, package gee (Carey 2015) and yags (Carey 2011) perform the GEE analyses under the strong assumption of missing completely at random (MCAR) (Kenward 1998). Package wgeesel (Xu et al. 2018) can perform the multiple model selection based on weighted GEE/GEE. Package geepack (Hojsgaard et al. 2016) implements the weighted GEE analyses under the missing at random (MAR) assumption, in which an optional vector of weights can be used in the fitting process but the weight vector has to be externally calculated. In addition, the statistical software SAS/STAT version 13.2 (SAS Institute Inc. 2014) includes an experimental version of the function PROC GEE (Lin and Rodriguez 2015), which fits weighted GEE models.
Our swgee package has several features distinguishing from existing packages. First, swgee is designed to analyze longitudinal data with both missing responses and error-prone covariates. To the best of our knowledge, this is the first R package that can simultaneously account for response missingness and covariate measurement error. Secondly, this simulation based marginal method can be applied to a broad range of problems because the associated model assumptions are minimal. swgee can be directly applied to handle continuous and binary responses as well as count data with dropouts under the MAR and MCAR mechanisms. Thirdly, observations are weighted inversely proportional to their probability of being observed, with weights calculated internally. Lastly, the swgee package employs the simulation extrapolation (SIMEX) algorithm to account for the effect of measurement error in covariates.
The remainder is organized as follows. Section 2 introduces the notation and model setup. In Section 3, we describe the method proposed by (Yi 2008) and its implementation in R in Section 4. The developed R package is illustrated with simulation studies and analyses of a data set arising from the Framingham Heart Study in Section 5. General discussion is included in Section 6.
For
For
To model the variance of
For
For
The inverse probability weighted generalized estimating equations method
is often employed to accommodate the missing data effects (e. g. , Robins et al. 1995; Preisser et al. 2002; Qu et al. 2011) when primary interest lies in the estimation of the
marginal mean parameters
In the absence of measurement error, that is, covariates
When measurement error is present in covariates
Now, we describe the SIMEX method developed by Yi (2008). Let
The SIMEX approach is very appealing because of its simplicity of
implementation and no requirement of modeling the true covariates
Finally, we extend the method by (Yi 2008) to accommodating the case where
the covariance matrix
A simple way to generate a contrast
We implement the SIMEX procedure described in Section
3 in R and develop the package, called swgee. Our
package swgee takes the advantage of existing R packages geepack
(Hojsgaard et al. 2016) and mvtnorm
(Genz and Bretz 2009; Genz et al. 2018). Specifically, the function swgee
produces the estimates
for elements of the parameter vector
Our R function swgee
requires the input data set to be sorted by
subject swgee
can
internally generate the missing data indicators swgee
output.
The form of calling function swgee
is given by
swgee(formula, data, id, family, corstr, missingmodel, SIMEXvariable,
SIMEX.err, repeated = FALSE, repind = NULL, B, lambda)
where the arguments are described as follows:
formula
: This argument specifies the model to be fitted, with the
variables coming with data. See the documentation of geeglm
and
its formula
for details.data
: This is the same as the data argument in the R function
geeglm
, which specifies the data frame showing how variables occur
in the formula, along with the id variable.id
: This is the vector which identifies the labels of subjects.
i.e., the id for subject family
: This argument describes the error distribution together
with the link function for model (1). See the documentation of
geeglm
and its argument family
for details.corstr
: This is a character string specifying the correlation
structure. See the documentation of geeglm
and its argument
corstr
for details.missingmodel
: This argument specifies the formula to be fitted for
the missing data model (4). See the documentation of geeglm
and
its formula
for details.SIMEXvariable
: This is the vector of characters containing the
names of the covariates which are subject to measurement error.SIMEX.err
: This argument specifies the covariance matrix of
measurement errors in the measurement error model (5).repeated
: This is the indicator whether measurement error model is
given by (5) or by (8). The default value FALSE corresponding to
model (5).repind
: This is the index of the repeated surrogate measurements
B
: This argument sets the number of simulated samples for the
simulation step. The default is set to be 50.lambda
: This is the vector
To illustrate the usage of the developed R package swgee, we apply the
package to a subset of GWA13 (Genetic Analysis Workshops) data arising
from the Framingham Heart Study. The data set consists of measurements
of 100 patients from a series of exams with 5 assessments for each
individual. Measurements such as height, weight, age, systolic blood
pressure (SBP) and cholesterol level (CHOL) are collected at each
assessment, and
where
We now apply the developed R package swgee, which can be downloaded from CRAN and then loaded in R:
R> library("swgee")
Next, load the data that are properly organized with the variable names specified. In the example here, the data set, named as bmidata, is included by issuing
R> data("BMI")
R> bmidata <- BMI
We are concerned how measurement error in SBP and CHOL impacts
estimation of parameter
The naive GEE approach without considering missingness and measurement error effects in covariates gives the output:
R> output1 <- gee(bbmi~sbp+chol+age, id=id, data=bmidata,
+ family=binomial(link="logit"), corstr="independence")
R> summary(output1)
GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
gee S-function, version 4.13 modified 98/01/27 (1998)
Model:
Link: Logit
Variance to Mean Relation: Binomial
Correlation Structure: Independent
Call:
gee(formula = bbmi ~ sbp + chol + age, id = id, data = bmidata,
family = binomial(link = "logit"), corstr = "independence")
Summary of Residuals:
Min 1Q Median 3Q Max
-0.26533967 -0.11385369 -0.08572483 -0.06279540 0.95475735
Coefficients:
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) -5.43746374 1.42090827 -3.8267521 1.64320527 -3.3090593
sbp 0.59071183 0.30643396 1.9276970 0.24338420 2.4270755
chol 0.11109496 0.13654324 0.8136247 0.23086218 0.4812177
age 0.01297337 0.01339946 0.9682008 0.01814546 0.7149652
Estimated Scale Parameter: 1.017131
Number of Iterations: 1
Working Correlation
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 0 0 0
[2,] 0 1 0 0 0
[3,] 0 0 1 0 0
[4,] 0 0 0 1 0
[5,] 0 0 0 0 1
To adjust for possible effects of missingness as well as measurement
error in variables SBP and CHOL, we call the developed function swgee
for the analysis:
R> set.seed(1000)
R> sigma <- diag(rep(0.25, 2))
R> output2 <- swgee(bbmi~sbp+chol+age, data=bmidata, id=id,
+ family=binomial(link="logit"), corstr="independence",
+ missingmodel=O~bbmi+sbp+chol+age, SIMEXvariable=c("sbp","chol"),
+ SIMEX.err=sigma, repeated=FALSE, B=100, lambda=seq(0, 2, 0.5))
> summary(output2)
Call: beta
Estimate StdErr t.value p.value
(Intercept) -8.004577 2.060967 -3.8839 0.0001028 ***
sbp 1.196363 0.356868 3.3524 0.0008011 ***
chol 0.099984 0.264180 0.3785 0.7050810
age 0.012718 0.017201 0.7394 0.4596520
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Call: alpha
Estimate StdErr t.value p.value
alpha1 9.019084 3.086533 2.9221 0.003477 **
alpha2 -0.786135 0.656843 -1.1968 0.231370
alpha3 -0.568740 0.732885 -0.7760 0.437732
alpha4 -0.128941 0.247757 -0.5204 0.602761
alpha5 -0.064257 0.025982 -2.4731 0.013395 *
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
The function swgee
can store individual estimated coefficients in the
simulation step, and this enables us to show the extrapolation curve
through the developed R function plot.swgee
. The plot.swgee
function
plots the extrapolation of the estimate of each covariate effect with
the quadratic extrapolants. Figure 1 displays the graph for the variable
SBP in the example for which the quadratic extrapolation function is
applied from the following command:
R> plot(output2,"sbp")
In this section, we conduct simulation studies to investigate the impact
of ignoring covariate measurement error and response missingness on
estimation, where the implementation is carried out using the usual GEE
method. Furthermore, we assess the performance of the swgee method which
accommodates the effects induces from error-prone covariates and missing
responses. We set
In Table 1, we report on the results of the biases of the estimates
(Bias), the empirical standard error (SE), and the coverage rate (CR in
percent) for
In summary, ignoring measurement error may lead to substantially biased results. Properly addressing covariate measurement error in estimation procedures is necessary. The proposed swgee method performs reasonably well under various configurations. As expected, its performance may become less satisfactory when measurement error becomes substantial. However, the swgee method does significantly improve the performance of the gee analysis.
Method | ||||||||||||||
Bias | SE | CR | Bias | SE | CR | Bias | SE | CR | ||||||
0.25 | 0.25 | gee | -0.0310 | 0.1228 | 92.6 | -0.0158 | 0.1246 | 92.6 | 0.0063 | 0.2121 | 94.6 | |||
0.25 | 0.25 | swgee | -0.0062 | 0.1420 | 95.0 | 0.0104 | 0.1425 | 95.2 | 0.0036 | 0.2354 | 95.6 | |||
0.25 | 0.50 | gee | -0.0019 | 0.1212 | 95.4 | -0.0997 | 0.1156 | 83.4 | 0.0082 | 0.2110 | 94.2 | |||
0.25 | 0.50 | swgee | -0.0003 | 0.1415 | 95.0 | -0.0087 | 0.1543 | 93.0 | 0.0035 | 0.2361 | 95.6 | |||
0.25 | 0.75 | gee | 0.0328 | 0.1189 | 95.4 | -0.1841 | 0.1022 | 51.0 | 0.0101 | 0.2100 | 94.0 | |||
0.25 | 0.75 | swgee | 0.0205 | 0.1407 | 95.8 | -0.0660 | 0.1562 | 86.4 | 0.0046 | 0.2359 | 95.6 | |||
0.50 | 0.25 | gee | -0.1156 | 0.1114 | 78.2 | 0.0139 | 0.1236 | 94.2 | 0.0078 | 0.2113 | 94.6 | |||
0.50 | 0.25 | swgee | -0.0282 | 0.1520 | 93.2 | 0.0177 | 0.1431 | 95.4 | 0.0031 | 0.2362 | 95.2 | |||
0.50 | 0.50 | gee | -0.0948 | 0.1114 | 81.8 | -0.0780 | 0.1161 | 85.6 | 0.0102 | 0.2099 | 94.2 | |||
0.50 | 0.50 | swgee | -0.0228 | 0.1510 | 93.8 | -0.0022 | 0.1542 | 93.6 | 0.0030 | 0.2370 | 95.4 | |||
0.50 | 0.75 | gee | -0.0629 | 0.1103 | 87.8 | -0.1727 | 0.1036 | 55.6 | 0.0125 | 0.2088 | 94.2 | |||
0.50 | 0.75 | swgee | -0.0052 | 0.1499 | 94.8 | -0.0608 | 0.1570 | 87.2 | 0.0042 | 0.2369 | 95.2 | |||
0.75 | 0.25 | gee | -0.1991 | 0.0966 | 45.6 | 0.0484 | 0.1216 | 94.2 | 0.0092 | 0.2107 | 94.6 | |||
0.75 | 0.25 | swgee | -0.0870 | 0.1508 | 86.4 | 0.0395 | 0.1430 | 93.6 | 0.0034 | 0.2366 | 95.2 | |||
0.75 | 0.50 | gee | -0.1889 | 0.0976 | 50.0 | -0.0458 | 0.1154 | 89.8 | 0.0121 | 0.2091 | 94.0 | |||
0.75 | 0.50 | swgee | -0.0831 | 0.1509 | 87.8 | 0.0165 | 0.1539 | 94.0 | 0.0034 | 0.2375 | 95.4 | |||
0.75 | 0.75 | gee | -0.1636 | 0.0974 | 58.8 | -0.1468 | 0.1039 | 66.4 | 0.0147 | 0.2077 | 94.2 | |||
0.75 | 0.75 | swgee | -0.0678 | 0.1505 | 90.0 | -0.0442 | 0.1574 | 88.8 | 0.0046 | 0.2374 | 95.2 |
Missing observations and covariate measurement error commonly arise in longitudinal data. However, there has been relatively little work on simultaneously accounting for the effects of response missingness and covariate measurement error on estimation of response model parameters for longitudinal data. Yi (2008) described a simulation based marginal method to adjust for the biases induced by both missingness and covariate measurement error. The proposed method does not require the full specification of the distribution of the response vector but only requires modeling its mean and covariance structure. In addition, the distribution of covariates is left unspecified, which is desirable for many practical problems. These features make the proposed method flexible.
Here we not only develop the R package swgee to implement the method by Yi (2008), but also include an extended setting in the package. Our aim is to provide analysts an accessible tool for the analysis of longitudinal data with missing responses and error-prone covariates. Our illustrations show that the developed package has the advantages of simplicity and versatility.
Juan Xiong was supported by the Natural Science Foundation of SZU (grant no.2017094). Grace Y. Yi was supported by the Natural Sciences and Engineering Research Council of Canada. The authors thanks Boston University and the National Heart, Lung, and Blood Institute (NHLBI) for providing the data set from the Framingham Heart Study (No. N01-HC-25195) in the illustration. The Framingham Heart Study is conducted and supported by the NHLBI in collaboration with Boston University. This manuscript was not prepared in collaboration with investigators of the Framingham Heart Study and does not necessarily reflect the opinions or views of the Framingham Heart Study, Boston University, or NHLBI.
Conflict of Interest: None declared.
gee, yags, wgeesel, geepack, mvtnorm
Distributions, Econometrics, Finance, MixedModels
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Xiong & Yi, "swgee: An R Package for Analyzing Longitudinal Data with Response Missingness and Covariate Measurement Error", The R Journal, 2019
BibTeX citation
@article{RJ-2019-031, author = {Xiong, Juan and Yi, Grace Y.}, title = {swgee: An R Package for Analyzing Longitudinal Data with Response Missingness and Covariate Measurement Error}, journal = {The R Journal}, year = {2019}, note = {https://rjournal.github.io/}, volume = {11}, issue = {1}, issn = {2073-4859}, pages = {416-426} }