Though longitudinal data often contain missing responses and error-prone covariates, relatively little work has been available to simultaneously correct for the effects of response missingness and covariate measurement error on analysis of longitudinal data. Yi (2008) proposed a simulation based marginal method to adjust for the bias induced by measurement error in covariates as well as by missingness in response. The proposed method focuses on modeling the marginal mean and variance structures, and the missing at random mechanism is assumed. Furthermore, the distribution of covariates are left unspecified. These features make the proposed method applicable to a broad settings. In this paper, we develop an R package, called swgee, which implements the method proposed by Yi (2008). Moreover, our package includes additional implementation steps which extend the setting considered by Yi (2008). To describe the use of the package and its main features, we report simulation studies and analyses of a data set arising from the Framingham Heart Study.
Longitudinal studies are commonly conducted in the health sciences, biochemical, and epidemiology fields; these studies typically collect repeated measurements on the same subject over time. Missing observations and covariate measurement error frequently arise in longitudinal studies and they present considerable challenges in statistical inference about such data (Carroll et al. 2006; Yi 2008). It has been well documented that ignoring missing responses and covariate measurement error may lead to severely biased results, thus leading to invalid inferences (Fuller 1987; Carroll et al. 2006).
Regarding longitudinal data with missing responses, there has been extensive methods such as maximum likelihood, multiple imputation, and weighted generalized estimating equations (GEE) method (Little and Rubin 2002). In terms of methods of handling measurement error in covariate, many methods have been developed for various settings. Comprehensive discussions can be found in Fuller (1987), Gustafson (2003), Carroll et al. (2006), Buonaccorsi (2010) and Yi (2017). However, there has been relatively little work on simultaneously addressing the effects of response missingness and covariate measurement error in longitudinal data analysis, although some work such as Wang et al. (2008), Liu and Wu (2007) and Yi et al. (2012), are available. In particular, Yi (2008) proposed an estimation method based on the marginal model for the response process, which does not require the full specification of the distribution of the response variable but models only the mean and variance structures. Furthermore, a functional method is applied to relax the need of modeling the covariate process. These features make the method of Yi (2008) flexible for many applications.
Relevant to our R package, a set of R packages and statistical software have been available for performing the GEE and weighted GEE analyses for longitudinal data with missing observations. In particular, package gee (Carey 2015) and yags (Carey 2011) perform the GEE analyses under the strong assumption of missing completely at random (MCAR) (Kenward 1998). Package wgeesel (Xu et al. 2018) can perform the multiple model selection based on weighted GEE/GEE. Package geepack (Hojsgaard et al. 2016) implements the weighted GEE analyses under the missing at random (MAR) assumption, in which an optional vector of weights can be used in the fitting process but the weight vector has to be externally calculated. In addition, the statistical software SAS/STAT version 13.2 (SAS Institute Inc. 2014) includes an experimental version of the function PROC GEE (Lin and Rodriguez 2015), which fits weighted GEE models.
Our swgee package has several features distinguishing from existing packages. First, swgee is designed to analyze longitudinal data with both missing responses and error-prone covariates. To the best of our knowledge, this is the first R package that can simultaneously account for response missingness and covariate measurement error. Secondly, this simulation based marginal method can be applied to a broad range of problems because the associated model assumptions are minimal. swgee can be directly applied to handle continuous and binary responses as well as count data with dropouts under the MAR and MCAR mechanisms. Thirdly, observations are weighted inversely proportional to their probability of being observed, with weights calculated internally. Lastly, the swgee package employs the simulation extrapolation (SIMEX) algorithm to account for the effect of measurement error in covariates.
The remainder is organized as follows. Section 2 introduces the notation and model setup. In Section 3, we describe the method proposed by (Yi 2008) and its implementation in R in Section 4. The developed R package is illustrated with simulation studies and analyses of a data set arising from the Framingham Heart Study in Section 5. General discussion is included in Section 6.
For \({i}=1, \dots, n\) and \({j}=1, \dots, m\), let \(Y_{ij}\) be the response variable for subject \({i}\) at time point \({j}\), let \(\mathbf{X}_{ij}\) be the vector of covariates subject to error, and \(\mathbf{Z}_{ij}\) be the vector of covariates which are error-free. Write \(\mathbf{Y}_{i} =(Y_{i1}, Y_{i2}, \dots, Y_{im})'\), \(\mathbf{X}_{i} =(\mathbf{X}'_{i1}, \mathbf{X}'_{i2}, \dots, \mathbf{X}'_{im})'\), and \(\mathbf{Z}_{i} =(\mathbf{Z}'_{i1}, \mathbf{Z}'_{i2}, \dots, \mathbf{Z}'_{im})'\).
For \({i}=1, \dots, n\) and \({j}=1, \dots, m\), let \(\mu_{ij}=E(Y_{ij}|\mathbf{X}_{i}, \mathbf{Z}_{i})\) and \({v}_{ij}=var(Y_{ij}|\mathbf{X}_{i}, \mathbf{Z}_{i})\) be the conditional expectation and variance of \(Y_{ij}\), given the covariates \(\mathbf{X}_{i}\) and \(\mathbf{Z}_{i}\), respectively. We model the influence of the covariates on the marginal response mean by means of a regression model: \[\label{responseR} g(\mu_{ij})=\mathbf{X}'_{ij}\boldsymbol{\beta}_x + \mathbf{Z}'_{ij}\boldsymbol{\beta}_z, \tag{1}\] where \(\boldsymbol{\beta}=(\boldsymbol{\beta}'_x, \boldsymbol{\beta}'_z)'\) is the vector of regression parameters and \(g(\cdot)\) is a specified monotone function. The intercept term, if any, of the model may be included as the first element of \(\boldsymbol{\beta}_z\) by including the unit vector as the first column of \(\mathbf{Z}_{i}\).
To model the variance of \(Y_{ij}\), we consider \[\label{varianceY} v_{ij}=h(\mu_{ij}; \phi), \tag{2}\] where \(h(\cdot;\cdot)\) is a given function and \(\phi\) is the dispersion parameter that is known or to be estimated. We treat \(\phi\) as known with emphasis setting on estimation of the \(\boldsymbol{\beta}\) parameter. Here we assume that \(E(Y_{ij}^k|\mathbf{X}_{i}, \mathbf{Z}_{i}) = E(Y_{ij}^k|\mathbf{X}_{ij}, \mathbf{Z}_{ij})\) for \(k=1\) and \(2\), that is, the dependence of the mean \(\mu_{ij}\) and the variance \(v_{ij}\) on the subject-level covariates \(\mathbf{X}_{i}\) and \(\mathbf{Z}_{i}\) is completely reflected by the dependence on the time-specific covariates \(\mathbf{X}_{ij}\) and \(\mathbf{Z}_{ij}\). This assumption has been widely used in marginal analysis of longitudinal analysis (e. g. , Diggle and Kenward 1994; Lai and Small 2007). The necessity of these assumptions was discussed by Yi (2017 5.1.1).
For \({i}=1, \dots, n\) and \({j}=1, \dots, m\), let \(O_{ij}\) be 1 if \(Y_{ij}\) is observed and 0 otherwise, and let \(\mathbf{O}_{i}=(O_{i1}, O_{i2}, \dots, O_{im})'\) be the vector of missing data indicators. Dropouts or monotone missing data patterns are considered here. That is, \(O_{ij}=0\) implies \(O_{ij'}=0\) for all \(j' > j\). We assume that \(O_{i1}=1\) for every subject \({i}\). To reflect the dynamic nature of the observation process over time, we assume an MAR mechanism for the missing process. That is, given the covariates, the missingness probability depends on the observed responses but not unobserved response components (Little and Rubin 2002). Let \(\lambda_{ij} = P(O_{ij}=1|O_{i,j-1}=1, \mathbf{X}_{i}, \mathbf{Z}_{i}, \mathbf{Y}_{i})\) and \(\pi_{ij} = P(O_{ij}=1|\mathbf{X}_{i}, \mathbf{Z}_{i}, \mathbf{Y}_{i})\), then \[\label{missingv} \pi_{ij} = \prod_{t=2}^{j}\lambda_{it}. \tag{3}\] Logistic regression models are used to model the dropout process: \[\label{logisticR} logit(\lambda_{ij})=\mathbf{u}'_{ij}\boldsymbol{\alpha}, \tag{4}\] for \(j=2, \dots, m\), where \(\mathbf{u}_{ij}\) is the vector consisting of the information of the covariates \(\mathbf{X}_{i}\), \(\mathbf{Z}_{i}\) and the observed responses, and \(\boldsymbol{\alpha}\) is the vector of regression parameters. Write \(\boldsymbol{\theta} = (\boldsymbol{\alpha}', \boldsymbol{\beta}')'\) and let \(q=dim(\boldsymbol{\theta})\).
For \({i}=1, \dots, n\) and \({j}=1, \dots, m\), let \(\mathbf{W}_{ij}\) be the observed measurements of the covariates \(\mathbf{X}_{ij}\). Covariates \(\mathbf{X}_{ij}\) and their observed measurements \(\mathbf{W}_{ij}\) are assumed to follow a classical additive measurement error model: \[\label{MEmodel} \mathbf{W}_{ij} = \mathbf{X}_{ij} + \mathbf{e}_{ij}, \tag{5}\] where the \(\mathbf{e}_{ij}\) are independent of \(\mathbf{X}_{i}\) , \(\mathbf{Z}_{i}\) and \(\mathbf{Y}_{i}\). And \(\mathbf{e}_{ij}\) follows \(N(\mathbf{0}, \boldsymbol{\Sigma}_e)\) with the covariance matrix \(\boldsymbol{\Sigma}_e\). This model has been widely used in the context of handling measurement error problems. (Yi 2008) assumed that \(\boldsymbol{\Sigma}_e\) is known or can be estimated from replication experiments (e. g. , Carroll et al. 2006; Yi 2017).
The inverse probability weighted generalized estimating equations method is often employed to accommodate the missing data effects (e. g. , Robins et al. 1995; Preisser et al. 2002; Qu et al. 2011) when primary interest lies in the estimation of the marginal mean parameters \(\boldsymbol{\beta}\) in the model (1). For \({i}=1, \dots, n\), let \(M_i\) be the random dropout time for subject \({i}\) and \(m_i\) be a realization. Define \(L_{i}(\boldsymbol{\alpha}) = (1-\lambda_{im_i})\prod_{t=2}^{m_i-1}\lambda_{it}\), where \(\lambda_{it}\) is determined by model (4). Let \(\mathbf{S}_i(\boldsymbol{\alpha}) = \partial{log L_i(\boldsymbol{\alpha})}/\partial{\boldsymbol{\alpha}}\) be the vector of score functions contributed from subject \({i}\). Let \(\mathbf{D}_i = \partial{\boldsymbol{\mu}'_i}/\partial{\boldsymbol{\beta}}\) be the matrix of the derivatives of the mean vector \(\boldsymbol{\mu}_i=(\mu_{i1}, \dots, \mu_{im})'\) with respect to \(\boldsymbol{\beta}\) and let \(\boldsymbol{\Delta}_i = diag(I(O_{ij}=1)/\pi_{ij}, j=1, 2, \dots, m)\) be the weighted matrix accommodating missingness, where \(I(\cdot)\) is the indicator function. Let \(\mathbf{V}_{i} = \mathbf{A}_{i}^{1/2}\mathbf{C}_{i}\mathbf{A}_{i}^{1/2}\) be the conditional covariance matrix of \(\mathbf{Y}_{i}\), given \(\mathbf{X}_{i}\) and \(\mathbf{Z}_{i}\), where \(\mathbf{A}_{i} = diag(v_{ij}, j=1, 2, \dots, m)\) and \(\mathbf{C}_{i} = [\rho_{i;jk}]\) is the correlation matrix with diagonal elements equal 1 and \(\rho_{i;jk}\) being the conditional correlation coefficient of response components \(Y_{ij}\) and \(Y_{ik}\) for \(j \neq k\), given \(\mathbf{X}_{i}\) and \(\mathbf{Z}_{i}\). Define \[\mathbf{U}_i(\boldsymbol{\theta}) = \mathbf{D}_{i}\mathbf{V}_{i}^{-1}\boldsymbol{\Delta}_i(\mathbf{Y}_i - \boldsymbol{\mu}_i)\] and \[\label{wee} \mathbf{H}_{i}(\boldsymbol{\theta}) = (\mathbf{U}'_i(\boldsymbol{\theta}), \mathbf{S}'_i(\boldsymbol{\alpha}))'. \tag{6}\]
In the absence of measurement error, that is, covariates \(\mathbf{X}_{ij}\) are precisely observed, we have \(E[\mathbf{H}_{i}(\boldsymbol{\theta})] = \mathbf{0}\). Hence, \(\mathbf{H}(\boldsymbol{\theta}) = \sum_{i=1}^n\mathbf{H}_{i}(\boldsymbol{\theta})\) are unbiased estimation functions for \(\boldsymbol{\theta}\) (e. g. , Yi 2017 1). Under regularity conditions, the consistent estimator \(\widehat{\boldsymbol{\theta}}\) of \(\boldsymbol{\theta}\) can be obtained by solving \[\label{IPWGEE} \mathbf{H}(\boldsymbol{\theta}) = \mathbf{0}, \tag{7}\] where the weight matrix \(\boldsymbol{\Delta}_i\) is used to adjust for the contributions of subject \(i\) with his/her missingness probabilities incorporated. Specifically, the probability \(\pi_{ij}\) is determined by (3) in conjunction with (4). Correlation matrix \(\mathbf{C}_{i}\) can be replaced by the moment estimate, or alternatively, a working independence matrix \(\mathbf{A}_{i}\) may be used to replace \(\mathbf{V}_{i}\) (Liang and Zeger 1986). A detail discussion can be found in Yi (2017 4).
When measurement error is present in covariates \(\mathbf{X}_{ij}\), \(\mathbf{H}(\boldsymbol{\theta})\) is no longer unbiased if naively replacing \(\mathbf{X}_{ij}\) with its observed measurement \(\mathbf{W}_{ij}\). Yi (2008) developed a simulation-extrapolation (SIMEX) method to adjust for the bias induced by using \(\mathbf{W}_{ij}\), as well as the missingness effects in the response variables. This method originates from the SIMEX method by Cook and Stefanski (1994) who considered cross-sectional data with measurement error alone. The basic idea of the SIMEX method is to first add additional variability to the observed measurement \(\mathbf{W}_{ij}\), then establish the trend how different degrees of measurement error may induce bias in estimation of the model parameters, and finally extrapolate this trend to the case of no measurement error.
Now, we describe the SIMEX method developed by Yi (2008). Let \(B\) be a given positive integer and \(\boldsymbol{\Lambda} = \{\lambda_1, \lambda_2, \dots, \lambda_M\}\) be a sequence of nonnegative numbers taken from \([0, \lambda_M]\) with \(\lambda_1 = 0\).
The SIMEX approach is very appealing because of its simplicity of implementation and no requirement of modeling the true covariates \(\mathbf{X}_i\). However, to use this method, several aspects need to be considered. As suggested by Carroll et al. (2006), the specification of \(\Lambda\) is not unique; a typical choice of grid \(\Lambda\) is the equal cut points of interval \([0, 2]\) with \(M=5\) or 9. Choosing \(B=100\) or 200 is often sufficient for many applications. The quadratic regression function is commonly used for Step 3 to yield reasonable results. (e. g. , He et al. 2012).
Finally, we extend the method by (Yi 2008) to accommodating the case where the covariance matrix \(\boldsymbol{\Sigma}_e\) for model (5) is unknown but repeated surrogate measurements of \(\mathbf{X}_{ij}\) are available. Let \(\mathbf{W}_{ijk}\) denote the repeated surrogate measurements of \(\mathbf{X}_{ij}\) for \(i=1, \dots, n; j=1,\dots,m\); and \(k=1,\dots, K\). The surrogate measurements \(\mathbf{W}_{ijk}\) and the true covariate \(\mathbf{X}_{ij}\) are linked by the model \[\label{MEmodelSurr} \mathbf{W}_{ijk} = \mathbf{X}_{ij} + \mathbf{e}_{ijk}, \tag{8}\] where the \(\mathbf{e}_{ijk}\) are independent of \(\mathbf{X}_{i}\) , \(\mathbf{Z}_{i}\) and \(\mathbf{Y}_{i}\), and \(\mathbf{e}_{ijk}\) follows \(N(\mathbf{0}, \boldsymbol{\Sigma}_e)\) with the covariance matrix \(\boldsymbol{\Sigma}_e\). We now adapt the arguments of (Devanarayan and Stefanski 2002) to modify the simulation step of the preceding SIMEX method. For a given \(b\) and \(\lambda \in \boldsymbol{\Lambda}\), set \[\label{measurementSurro} \mathbf{W}_{ij}(b, \lambda) = \overline{\mathbf{W}}_{ij} + \sqrt{\lambda/K}\sum_{k=1}^K{c}_{ijk}(b)\mathbf{W}_{ijk}, \tag{9}\] where \(\overline{\mathbf{W}}_{ij}={K}^{-1}\sum_{k=1}^K\mathbf{W}_{ijk}\) and \(\mathbf{c}_{ij}(b)=(c_{ij1}(b), \dots, c_{ijk}(b))'\) is a normalized contrast satisfying \(\sum_{k=1}^K{c}_{ijk}=0\) and \(\sum_{k=1}^K{c}_{ijk}^2=1\).
A simple way to generate a contrast \(\mathbf{c}_{ij}(b)\) can be done by independently generating \(K\) variates, \(d_{ijk}(b)\), from \(N(0, 1)\) for \(k=1,\dots, K\) and a given \(b\). Let \(\overline{d}_{ij}(b) = {K}^{-1}\sum_{k=1}^K{d}_{ijk}(b)\). Then \({c}_{ijk}(b)\) is set as \[{c}_{ijk}(b, \lambda) = \frac{d_{ijk}(b) - \overline{d}_{ij}(b)}{\sqrt{\sum_{k=1}^K\{d_{ijk}(b) - \overline{d}_{ij}(b)\}^2}}.\] Once \(\mathbf{W}_{ij}(b, \lambda)\) of (9) is available, we repeat Steps 2 and 3 to obtain the SIMEX estimator and the associated standard error.
We implement the SIMEX procedure described in Section
3 in R and develop the package, called swgee. Our
package swgee takes the advantage of existing R packages geepack
(Hojsgaard et al. 2016) and mvtnorm
(Genz and Bretz 2009; Genz et al. 2018). Specifically, the function swgee
produces the estimates
for elements of the parameter vector \(\boldsymbol{\beta}\), which are of
primary interest, the associated standard errors, and P-values.
Our R function swgee
requires the input data set to be sorted by
subject \(i\) and visit time \(j\) for \({i}=1, \dots, n\) and
\({j}=1, \dots, m\). If a subject is missing at a certain time, the
corresponding measurements should be recorded as NAs. As long as the
user provides the missing data model (4), the function swgee
can
internally generate the missing data indicators \(O_{ij}\) for
\({i}=1, \dots, n\) and \({j}=1, \dots, m\), and then apply the
user specified model (4) to fit the data. The missingness probabilities
\(\pi_{ij}\) are calculated by (3) and then used to construct the weight
matrix \(\boldsymbol{\Delta}_i\) for the estimating equation (6). The
estimate of the missing data model (4) parameter \(\boldsymbol{\alpha}\)
can also be retrieved from the function swgee
output.
The form of calling function swgee
is given by
swgee(formula, data, id, family, corstr, missingmodel, SIMEXvariable,
repeated = FALSE, repind = NULL, B, lambda) SIMEX.err,
where the arguments are described as follows:
formula
: This argument specifies the model to be fitted, with the
variables coming with data. See the documentation of geeglm
and
its formula
for details.data
: This is the same as the data argument in the R function
geeglm
, which specifies the data frame showing how variables occur
in the formula, along with the id variable.id
: This is the vector which identifies the labels of subjects.
i.e., the id for subject \(i\) is \(i\), using the notation of Section
(1), where \(i=1, 2,\dots, n\). Data are arranged so
that observations for the same subject are listed in consecutive
rows in order of time, and consequently, the id for a subject would
repeat the same number of times as the observation times.family
: This argument describes the error distribution together
with the link function for model (1). See the documentation of
geeglm
and its argument family
for details.corstr
: This is a character string specifying the correlation
structure. See the documentation of geeglm
and its argument
corstr
for details.missingmodel
: This argument specifies the formula to be fitted for
the missing data model (4). See the documentation of geeglm
and
its formula
for details.SIMEXvariable
: This is the vector of characters containing the
names of the covariates which are subject to measurement error.SIMEX.err
: This argument specifies the covariance matrix of
measurement errors in the measurement error model (5).repeated
: This is the indicator whether measurement error model is
given by (5) or by (8). The default value FALSE corresponding to
model (5).repind
: This is the index of the repeated surrogate measurements
\(\mathbf{W}_{ijk}\) for each covariate \(\mathbf{X}_{ij}\). It has an R
list form. If repeated = TRUE, repind must be specified.B
: This argument sets the number of simulated samples for the
simulation step. The default is set to be 50.lambda
: This is the vector
\(\{\lambda_1, \lambda_2, \dots, \lambda_M\}\) we describe in Step 1
of Section 3.2. Its values need to be specified by the
user.To illustrate the usage of the developed R package swgee, we apply the package to a subset of GWA13 (Genetic Analysis Workshops) data arising from the Framingham Heart Study. The data set consists of measurements of 100 patients from a series of exams with 5 assessments for each individual. Measurements such as height, weight, age, systolic blood pressure (SBP) and cholesterol level (CHOL) are collected at each assessment, and \(14\%\) patients dropped out of the study. The original data were analyzed by (Yi 2008). It is of interest to study how an individual’s obesity may change with age (\(Z_{ij}\)) and how it is associated with SBP (\(X_{ij1}\)) and CHOL (\(X_{ij2}\)), where \(i=1, \dots, 100\), and \(j=1, \dots, 5\). The response \(Y_i\) is the indicator of obesity status of subject \(i\) as in (Yi 2008); SBP is rescaled as \(log(SBP-50)\) as in (Carroll et al. 2006); and CHOL is standardized. The response and the covariates are postulated by the logistic regression model: \[logit \mu_{ij} = \beta_0 + \beta_{x1}X_{ij1} + \beta_{x2}X_{ij2} + \beta_{z}Z_{ij},\]
where \(\beta_0\), \(\beta_{x1}\), \(\beta_{x2}\) and \(\beta_{z}\) are regression coefficients of interest. We assume that errors in both risk factors \(X_{ij1}\) and \(X_{ij2}\) can be represented by model (5). The missing data process is characterized by the logistic regression model: \[logit \lambda_{ij} = \alpha_1 + \alpha_{2}Y_{i,j-1} + \alpha_{3}X_{i,j-1,1} +\alpha_{4}X_{i,j-1,2} + \alpha_{5c}z_{i,j-1},\] for \(j=2, \dots, 5\).
We now apply the developed R package swgee, which can be downloaded from CRAN and then loaded in R:
> library("swgee") R
Next, load the data that are properly organized with the variable names specified. In the example here, the data set, named as bmidata, is included by issuing
> data("BMI")
R> bmidata <- BMI R
We are concerned how measurement error in SBP and CHOL impacts estimation of parameter \(\boldsymbol{\beta} = (\beta_0, \beta_{x1}, \beta_{x2}, \beta_z)'\). For illustrative purposes, we use setting with \(B=100\), \(\lambda_M=2\) and \(M=5\). In this example, we assume that parameters in \(\boldsymbol{\Sigma}_e=\left(\begin{array}{cc} \sigma_1^2 & \sigma_{12} \\ \sigma_{21} & \sigma_2^2 \end{array} \right)\) with \(\sigma_{12}=\sigma_{21}\) are known. This is a typical case when conducting sensitivity analysis. Here we set \(\sigma_{1} = \sigma_2=0.5\) and \(\sigma_{12}=\sigma_{21}=0\) as an example.
The naive GEE approach without considering missingness and measurement error effects in covariates gives the output:
> output1 <- gee(bbmi~sbp+chol+age, id=id, data=bmidata,
R+ family=binomial(link="logit"), corstr="independence")
> summary(output1)
R
: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
GEE-function, version 4.13 modified 98/01/27 (1998)
gee S
:
Model: Logit
Link: Binomial
Variance to Mean Relation: Independent
Correlation Structure
:
Callgee(formula = bbmi ~ sbp + chol + age, id = id, data = bmidata,
family = binomial(link = "logit"), corstr = "independence")
:
Summary of Residuals1Q Median 3Q Max
Min -0.26533967 -0.11385369 -0.08572483 -0.06279540 0.95475735
:
Coefficients
Estimate Naive S.E. Naive z Robust S.E. Robust z-5.43746374 1.42090827 -3.8267521 1.64320527 -3.3090593
(Intercept) 0.59071183 0.30643396 1.9276970 0.24338420 2.4270755
sbp 0.11109496 0.13654324 0.8136247 0.23086218 0.4812177
chol 0.01297337 0.01339946 0.9682008 0.01814546 0.7149652
age
: 1.017131
Estimated Scale Parameter: 1
Number of Iterations
Working Correlation1] [,2] [,3] [,4] [,5]
[,1,] 1 0 0 0 0
[2,] 0 1 0 0 0
[3,] 0 0 1 0 0
[4,] 0 0 0 1 0
[5,] 0 0 0 0 1 [
To adjust for possible effects of missingness as well as measurement
error in variables SBP and CHOL, we call the developed function swgee
for the analysis:
> set.seed(1000)
R> sigma <- diag(rep(0.25, 2))
R> output2 <- swgee(bbmi~sbp+chol+age, data=bmidata, id=id,
R+ family=binomial(link="logit"), corstr="independence",
+ missingmodel=O~bbmi+sbp+chol+age, SIMEXvariable=c("sbp","chol"),
+ SIMEX.err=sigma, repeated=FALSE, B=100, lambda=seq(0, 2, 0.5))
> summary(output2)
: beta
Call
Estimate StdErr t.value p.value-8.004577 2.060967 -3.8839 0.0001028 ***
(Intercept) 1.196363 0.356868 3.3524 0.0008011 ***
sbp 0.099984 0.264180 0.3785 0.7050810
chol 0.012718 0.017201 0.7394 0.4596520
age ---
: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Signif. codes
: alpha
Call
Estimate StdErr t.value p.value9.019084 3.086533 2.9221 0.003477 **
alpha1 -0.786135 0.656843 -1.1968 0.231370
alpha2 -0.568740 0.732885 -0.7760 0.437732
alpha3 -0.128941 0.247757 -0.5204 0.602761
alpha4 -0.064257 0.025982 -2.4731 0.013395 *
alpha5 ---
: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Signif. codes
The function swgee
can store individual estimated coefficients in the
simulation step, and this enables us to show the extrapolation curve
through the developed R function plot.swgee
. The plot.swgee
function
plots the extrapolation of the estimate of each covariate effect with
the quadratic extrapolants. Figure 1 displays the graph for the variable
SBP in the example for which the quadratic extrapolation function is
applied from the following command:
> plot(output2,"sbp") R
In this section, we conduct simulation studies to investigate the impact of ignoring covariate measurement error and response missingness on estimation, where the implementation is carried out using the usual GEE method. Furthermore, we assess the performance of the swgee method which accommodates the effects induces from error-prone covariates and missing responses. We set \(n=200\) and \(m=3\), and generate 500 simulations for each parameter configuration. Consider the logistic regression model \[logit(\mu_{ij})=\beta_0 + \beta_{x1}x_{ij1} + \beta_{x2}x_{ij2} + \beta_{z}z_{ij},\] where \(\beta_0 = 0\), \(\beta_{x1} = log(1.5)\), \(\beta_{x2} = log(1.5)\), \(\beta_{z} = log(0.75)\) and \(z_{ij}\) is generated independently from \(Bin(1, 0.5)\) to represent a balanced design. The true covariate \(\mathbf{X}_{ij} = (x_{ij1}, x_{ij2})'\) is generated from the normal distribution \(N(\boldsymbol{\mu}_x, \boldsymbol{\Sigma}_x)\), where \(\boldsymbol{\mu}_{x} = (0.5, 0.5)'\) and \(\boldsymbol{\Sigma}_x = \left(\begin{array}{cc} \sigma_{x1}^2 & \rho_{x}\sigma_{x1}\sigma_{x2} \\ \rho_{x}\sigma_{x1}\sigma_{x2} & \sigma_{x2}^2 \end{array} \right)\) with \(\sigma_{x1} = \sigma_{x2} = 1\). The surrogate value \(\mathbf{W}_{ij} = (W_{ij1}, W_{ij2})'\) is generated from \(N(\mathbf{X}_{ij}, \boldsymbol{\Sigma}_e)\) with \(\boldsymbol{\Sigma}_e = \left(\begin{array}{cc} \sigma_{1}^2 & \rho\sigma_{1}\sigma_{2} \\ \rho\sigma_{1}\sigma_{2} & \sigma_{2}^2 \end{array} \right)\). \(\rho\) and \(\rho_{x}\) are set to 0.50 to represent moderate correlations. To feature minor, moderate and severe degrees of measurement error, we consider \(\sigma_1, \sigma_2 =\) 0.25, 0.50 or 0.75. The missing data indicator is generated from model (4), where \(\alpha_{0} = \alpha_{1} = 0.5\), \(\alpha_{2} = \alpha_{3} = 0.1\), and \(\alpha_{z} = 0.2\). In implementing the swgee method, we choose \(B = 100\), \(\lambda_M=2\), \(M = 5\), and a quadratic regression for each extrapolation step.
In Table 1, we report on the results of the biases of the estimates (Bias), the empirical standard error (SE), and the coverage rate (CR in percent) for \(95\%\) confidence intervals. When measurement error is minor, (i.e. \(\sigma_1 = \sigma_2 = 0.25\)), both gee and swgee provide reasonable results with fairly small finite sample biases and coverage rates close to the nominal level \(95\%\). When there is moderate or substantial measurement error in covariates \(\mathbf{X}_{ij}\), the performance of the gee method deteriorates remarkably in estimation of error-prone covariate effects, leading to considerably biased estimates for \(\beta_{x1}\) and \(\beta_{x2}\). The corresponding coverage rates for \(95\%\) confidence intervals can be quite low. In contrast, the swgee method remarkably improve the performance, providing a lot smaller biases and much higher coverage rates. The estimates for \(\beta_z\) are not subject to much impact of measurement error, which is partially attributed by that the precisely observed covariates \(z_{ij}\) are generated independently of error-prone covairates \(\mathbf{X}_{ij}\) under the current simulation study.
In summary, ignoring measurement error may lead to substantially biased results. Properly addressing covariate measurement error in estimation procedures is necessary. The proposed swgee method performs reasonably well under various configurations. As expected, its performance may become less satisfactory when measurement error becomes substantial. However, the swgee method does significantly improve the performance of the gee analysis.
\(\sigma_1\) | \(\sigma_2\) | Method | \(\beta_{x1}\) | \(\beta_{x2}\) | \(\beta_{z}\) | |||||||||
Bias | SE | CR | Bias | SE | CR | Bias | SE | CR | ||||||
0.25 | 0.25 | gee | -0.0310 | 0.1228 | 92.6 | -0.0158 | 0.1246 | 92.6 | 0.0063 | 0.2121 | 94.6 | |||
0.25 | 0.25 | swgee | -0.0062 | 0.1420 | 95.0 | 0.0104 | 0.1425 | 95.2 | 0.0036 | 0.2354 | 95.6 | |||
0.25 | 0.50 | gee | -0.0019 | 0.1212 | 95.4 | -0.0997 | 0.1156 | 83.4 | 0.0082 | 0.2110 | 94.2 | |||
0.25 | 0.50 | swgee | -0.0003 | 0.1415 | 95.0 | -0.0087 | 0.1543 | 93.0 | 0.0035 | 0.2361 | 95.6 | |||
0.25 | 0.75 | gee | 0.0328 | 0.1189 | 95.4 | -0.1841 | 0.1022 | 51.0 | 0.0101 | 0.2100 | 94.0 | |||
0.25 | 0.75 | swgee | 0.0205 | 0.1407 | 95.8 | -0.0660 | 0.1562 | 86.4 | 0.0046 | 0.2359 | 95.6 | |||
0.50 | 0.25 | gee | -0.1156 | 0.1114 | 78.2 | 0.0139 | 0.1236 | 94.2 | 0.0078 | 0.2113 | 94.6 | |||
0.50 | 0.25 | swgee | -0.0282 | 0.1520 | 93.2 | 0.0177 | 0.1431 | 95.4 | 0.0031 | 0.2362 | 95.2 | |||
0.50 | 0.50 | gee | -0.0948 | 0.1114 | 81.8 | -0.0780 | 0.1161 | 85.6 | 0.0102 | 0.2099 | 94.2 | |||
0.50 | 0.50 | swgee | -0.0228 | 0.1510 | 93.8 | -0.0022 | 0.1542 | 93.6 | 0.0030 | 0.2370 | 95.4 | |||
0.50 | 0.75 | gee | -0.0629 | 0.1103 | 87.8 | -0.1727 | 0.1036 | 55.6 | 0.0125 | 0.2088 | 94.2 | |||
0.50 | 0.75 | swgee | -0.0052 | 0.1499 | 94.8 | -0.0608 | 0.1570 | 87.2 | 0.0042 | 0.2369 | 95.2 | |||
0.75 | 0.25 | gee | -0.1991 | 0.0966 | 45.6 | 0.0484 | 0.1216 | 94.2 | 0.0092 | 0.2107 | 94.6 | |||
0.75 | 0.25 | swgee | -0.0870 | 0.1508 | 86.4 | 0.0395 | 0.1430 | 93.6 | 0.0034 | 0.2366 | 95.2 | |||
0.75 | 0.50 | gee | -0.1889 | 0.0976 | 50.0 | -0.0458 | 0.1154 | 89.8 | 0.0121 | 0.2091 | 94.0 | |||
0.75 | 0.50 | swgee | -0.0831 | 0.1509 | 87.8 | 0.0165 | 0.1539 | 94.0 | 0.0034 | 0.2375 | 95.4 | |||
0.75 | 0.75 | gee | -0.1636 | 0.0974 | 58.8 | -0.1468 | 0.1039 | 66.4 | 0.0147 | 0.2077 | 94.2 | |||
0.75 | 0.75 | swgee | -0.0678 | 0.1505 | 90.0 | -0.0442 | 0.1574 | 88.8 | 0.0046 | 0.2374 | 95.2 |
Missing observations and covariate measurement error commonly arise in longitudinal data. However, there has been relatively little work on simultaneously accounting for the effects of response missingness and covariate measurement error on estimation of response model parameters for longitudinal data. Yi (2008) described a simulation based marginal method to adjust for the biases induced by both missingness and covariate measurement error. The proposed method does not require the full specification of the distribution of the response vector but only requires modeling its mean and covariance structure. In addition, the distribution of covariates is left unspecified, which is desirable for many practical problems. These features make the proposed method flexible.
Here we not only develop the R package swgee to implement the method by Yi (2008), but also include an extended setting in the package. Our aim is to provide analysts an accessible tool for the analysis of longitudinal data with missing responses and error-prone covariates. Our illustrations show that the developed package has the advantages of simplicity and versatility.
Juan Xiong was supported by the Natural Science Foundation of SZU (grant no.2017094). Grace Y. Yi was supported by the Natural Sciences and Engineering Research Council of Canada. The authors thanks Boston University and the National Heart, Lung, and Blood Institute (NHLBI) for providing the data set from the Framingham Heart Study (No. N01-HC-25195) in the illustration. The Framingham Heart Study is conducted and supported by the NHLBI in collaboration with Boston University. This manuscript was not prepared in collaboration with investigators of the Framingham Heart Study and does not necessarily reflect the opinions or views of the Framingham Heart Study, Boston University, or NHLBI.
Conflict of Interest: None declared.
gee, yags, wgeesel, geepack, mvtnorm
Distributions, Econometrics, Finance, MixedModels
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Xiong & Yi, "swgee: An R Package for Analyzing Longitudinal Data with Response Missingness and Covariate Measurement Error", The R Journal, 2019
BibTeX citation
@article{RJ-2019-031, author = {Xiong, Juan and Yi, Grace Y.}, title = {swgee: An R Package for Analyzing Longitudinal Data with Response Missingness and Covariate Measurement Error}, journal = {The R Journal}, year = {2019}, note = {https://rjournal.github.io/}, volume = {11}, issue = {1}, issn = {2073-4859}, pages = {416-426} }