mgee2: An R package for marginal analysis of longitudinal ordinal data with misclassified responses and covariates

Yuliang Xu; Shuo Shuo Liu; Grace Y. Yi

1 Introduction

Analysis of longitudinal ordinal data is a common research topic in health science and survey sampling. Typically, Liang and Zeger (1986) introduced the generalized estimating equations (GEE) method that gave consistent estimation with mild assumptions of the joint distribution of the repeated measurements. This method has been used widely in analyzing longitudinal binary and categorical data. The validity of the GEE method hinges on the critical condition that data are precisely observed, which is commonly infeasible and violated in practice (Yi 2017). Extensive discussions about covariate error (Carroll et al. 2006) and response with binary misclassification (Neuhaus 1999; Chen et al. 2011; Yi 2017) have been conducted in the literature. For example, Neuhaus (1999) investigated the bias due to errors in the response. Yi (2008) proposed a simulation–extrapolation (SIMEX) method to handle both dropout and covariate measurement error problems in longitudinal studies. Furthermore, in Yi (2017 Ch5), the impact of covariate measurement error on longitudinal data analysis was investigated, and methods of addressing covariate measurement error effects were described.

To accommodate effects induced from error-prone correlated ordinal responses and ordinal covariates simultaneously, (Chen et al. 2014) proposed GEE-based methods for the estimation of both mean and association parameters. The proposed methods are based on formulating unbiased second-order estimating functions and solving the resulting equations using the Newton-Raphson algorithm. The asymptotic distributions for the proposed estimators are established. While the methods of (Chen et al. 2014) correct for error effects due to misclassified variables, the methods cannot be used by the analysts without programming the implementation procedures. To expedite the use of the methods for problems in applications, in this paper, we develop an R package, called mgee2, to implement the methods of (Chen et al. 2014).

Our work offers an R package complement to available R packages for analyzing longitudinal data with misclassified observations. It is relevant to but differs from available R packages about measurement error. For example, the package SAMBA, developed by (Beesley and Mukherjee 2020), provides resources for fitting logistic regression with misclassified binary outcomes. The R package misclassGLM implements inferential procedures for generalized linear models with misclassified covariates proposed by (Dlugosz et al. 2017); (Zhang and Yi 2019) developed the package augSIMEX to implement the method proposed by (Yi et al. 2015) for fitting generalized linear models with mixed continuous and discrete covariates subject to mismeasurement.

When the degree of measurement error is very severe, the observed surrogate measurements are virtually useless, and hence the corresponding variables may be alternatively treated as subject to missingness. Regarding the analysis of longitudinal data with missing observations, packages kml and kml3d, developed by (Genolini et al. 2015), describe the implementation procedures of \(\textit{k}\)-means for longitudinal clustered data with missing observations. Carey (2015) developed the package gee to solve generalized estimation equations with longitudinal data missing completely at random. (Xu et al. 2018) developed the package wgeesel for using weighted generalized estimating equations approaches to analyze longitudinal clustered data with data missing at random. (Xiong and Yi 2019) developed the package swgee for analyzing longitudinal data with missingness in the response and measurement error in covariates. Our package mgee2 differs from those packages in its ability to simultaneously handle the features of misclassification in correlated ordinal responses and ordinal covariates, which to our best knowledge, is the first software package to address this problem.

The article is organized as follows. Section 2 introduces the notations and estimation procedures proposed by (Chen et al. 2014). Section 3 describes the usage of the package mgee2. Section 4 illustrates the package by simulation studies and a real dataset. We finally conclude the article in Section 6.

2 Model setup

We first review the notation and formulations of (Chen et al. 2014). For \(i=1,\dots, n\) and \(j=1, \dots, m_i\), let \(\texttt{Y}_{i j}\) denote an error-prone ordinal response variable for subject \(i\) at visit \(j\). Suppose that the response variable \(\texttt{Y}_{i j}\) has \((K+1)\) categories, denoted \(0, 1, \dots, K\), and that an error-prone ordinal covariate \(\mathbf X_{ij}\) has \((H + 1)\) categories, denoted \(0, 1, \dots, H\). Let \(\mathbf X_{ij}=\left(X_{ij1},\dots,X_{ijH}\right)^T\) be the misclassification-prone vector of binary variables such that \(X_{ijq}=I\)(the covariate \(\mathbf X_{ij}\) in category \(q\)) for \(q=0, 1, \dots, H\), and let \(\mathbf Z_{ij}\) be the vector of covariates that are free of measurement error, where \(I(\cdot)\) is the indicator function. Furthermore, we define \(\mathbf X_i=\left(\mathbf X_{i1}^T,\dots,\mathbf X_{im_i}^T\right)^T\) and \(\mathbf Z_i=\left(\mathbf Z_{i1}^T,\dots,\mathbf Z_{im_i}^T\right)^T\).

Response process

Let \[\label{eq:lambda_def} \lambda_{ijk}=P\left(\texttt{Y}_{i j} \geq k | \mathbf X_i, \mathbf Z_i\right) \tag{1}\] be the univariate cumulative probability with \(k=1,\dots,K\), and adopt the assumption \(P\left(\texttt{Y}_{i j} \geq k | \mathbf X_i, \mathbf Z_i\right)=P\left(\texttt{Y}_{i j} \geq k | \mathbf X_{ij}, \mathbf Z_{ij}\right)\) (Pepe and Anderson 1994). Consider the proportional odds models \[\operatorname{logit}\, \lambda_{i j k}=\mathbf\beta_{0 k}+\mathbf X_{i j}^T \mathbf\beta_x+\mathbf Z_{i j}^T \mathbf\beta_z,\] where \(\mathbf\beta_{0 k}\), \(\mathbf\beta_x\), and \(\mathbf\beta_z\) are regression parameters, \(k=1,\dots,K\), \(j=1,\dots,m_i\), and \(i=1,\dots,n\). Similar to Williamson et al. (1995), we measure the association between a pair of responses for the same subject at two different visits by the global odds ratio \[\label{eq:psi_def} \psi_{i, j k, j^{\prime} k^{\prime}}=\frac{P\left(\texttt{Y}_{i j} \geq k, \texttt{Y}_{i j^{\prime}} \geq k^{\prime} | \mathbf X_i, \mathbf Z_i\right) \times P\left(\texttt{Y}_{i j}<k, \texttt{Y}_{i j^{\prime}}<k^{\prime} | \mathbf X_i, \mathbf Z_i\right)}{P\left(\texttt{Y}_{i j} \geq k, \texttt{Y}_{i j^{\prime}}<k^{\prime} | \mathbf X_i, \mathbf Z_i\right) \times P\left(\texttt{Y}_{i j}<k, \texttt{Y}_{i j} \geq k^{\prime} | \mathbf X_i, \mathbf Z_i\right)}, \tag{2}\] where \(k, k^{\prime}=1, \dots, K\), and \(j \neq j^{\prime}\). To characterize the dependence of the global odds ratios on covariates, the log-linear models can be expressed as \[\log \, \psi_{i, j k, j^{\prime} k^{\prime}}=\phi+\phi_k+\phi_{k^{\prime}}+\phi_{k k^{\prime}}+\mathbf{u}_{i j j^{\prime}}^T \mathbf \alpha_1,\] where \(\phi\) is the global intercept, \(\phi_k\) and \(\phi_{k^{\prime}}\) correspond to the effect of category \(k\) and of category \(k^{\prime}\), respectively, \(\phi_{k k^{\prime}}\) is the interaction effect between categories \(k\) and \(k^{\prime}\) with \(\phi_{k k^{\prime}}=\phi_{k^{\prime}k}\), and \(\mathbf\alpha_1\) is a vector of parameters corresponding to pair-specific covariates, denoted \(\mathbf{u}_{i j j^{\prime}}\). The constraint \(\phi_1=\phi_{1k}=\phi_{k1}=0\) is set for the model identification for \(k=1,\dots,K\) (Williamson et al. 1995).

Let \(\mathbf\beta=\left(\mathbf\beta_{0k}^T, \mathbf\beta_{x}^T, \mathbf\beta_{z}^T\right)^T\), \(\mathbf\alpha=\left(\phi,\phi_k,\phi_{k k^{\prime}},\mathbf\alpha_1^T\right)^T\), and \(\mathbf\theta=\left(\mathbf\beta^T,\mathbf\alpha^T\right)^T\). For \(k=1,\dots,K\), let \(\textbf{\texttt{Y}}_{ij}=\left(\texttt{Y}_{ij1},\dots,\texttt{Y}_{ijk}\right)^T\) with \(\texttt{Y}_{ijk}=I\left(\texttt{Y}_{i j}=k\right)\). Define \(\textbf{\texttt{Y}}_i=\left(\textbf{\texttt{Y}}_{i1}^T,\dots,\textbf{\texttt{Y}}_{i m_i}^T\right)^T\). For \(j< j^{\prime}\), let \(C_{i,jk,j^{\prime}k^{\prime}}=\texttt{Y}_{ijk}\texttt{Y}_{ij^{\prime}k^{\prime}}\), \(\mathbf C_{ijj^{\prime}}=\left(C_{i,j1,j^{\prime}1},C_{i,j1,j^{\prime}2},\dots,C_{i,jK,j^{\prime}K^{\prime}}\right)^T\), and \(\mathbf C_i=\left(\mathbf C_{ijj^{\prime}}^T,j<j^{\prime}\right)^T\). The univariate and bivariate marginals, \(\mathbf\mu_i=E\left(\textbf{\texttt{Y}}_i|\mathbf X_i,\mathbf Z_i\right)\) and \(\mathbf \xi_i=E\left(\mathbf C_i|\mathbf X_i,\mathbf Z_i\right)\), can be expressed in terms of the global odds ratios and univariate and bivariate cumulative probabilities; the detailed expressions are given by (Chen et al. 2014).

As a result, the estimating functions for the mean and association parameters \(\mathbf\beta\) and \(\mathbf\alpha\) are given by \[\label{u1} \mathbf U_{1 i}\left(\mathbf \theta; \textbf{\texttt{Y}}_i, \mathbf X_i, \mathbf Z_i\right)=\mathbf D_{1 i} \mathbf V_{1 i}^{-1}\left(\textbf{\texttt{Y}}_i-\mathbf \mu_i\right) \tag{3}\] and \[\label{u2} \mathbf U_{2 i}\left(\mathbf \theta; \textbf{\texttt{Y}}_i, \mathbf X_i, \mathbf Z_i\right)=\mathbf D_{2 i} \mathbf V_{2 i}^{-1}\left(\mathbf C_i-\mathbf \xi_i\right), \tag{4}\] respectively, where \(\mathbf D_{1 i}=\partial\mathbf\mu_i^T/\partial\mathbf\beta\), \(\mathbf D_{2 i}=\partial\mathbf\xi_i^T/\partial\mathbf\alpha\), and \(\mathbf V_{1i}\) and \(\mathbf V_{2i}\) are the conditional covariance matrices for \(\textbf{\texttt{Y}}_i\) and \(\mathbf C_i\), given \(\mathbf X_i\) and \(\mathbf Z_i\).

Case 1: Estimation with known misclassification probabilities

If the true measurements of the responses and covariates are available, ((3)) and ((4)) can be used for estimation of \(\mathbf\beta\) and \(\mathbf\alpha\). However, in applications, the \(\textbf{\texttt{Y}}_{i j}\) and the \(\mathbf X_{ij}\) may be subject to misclassification. Let \(\mathbf S_{ij}\) and \(\mathbf W_{ij}\) be surrogate measurements of \(\textbf{\texttt{Y}}_{ij}\) and \(\mathbf X_{ij}\), respectively. Let \(\tau_{ijkl}=P\left(\mathbf S_{ij}=l|\textbf{\texttt{Y}}_{i j}=k, \mathbf Z_i\right)\) be the conditional probability concerning the response for subject \(i\) at visit \(j\) where \(k,l=0,\dots,K\). Let \(\pi_{ijqr}=P\left(\mathbf W_{ij}=r|\mathbf X_{ij}=q, \mathbf Z_i\right)\) be the conditional probability concerning the covariate for subject \(i\) at visit \(j\) where \(q,r=0,\dots,H\). Consider the generalized logistic models by setting category 0 as the reference: \[\log\left(\tau_{i j k l} / \tau_{i j k 0}\right)=\mathbf L_{i j}^T \mathbf\gamma_{k l} \quad \text{for} \quad l=1, \dots, K; k=0, \dots, K\] and \[\log\left(\pi_{i j q r} / \pi_{i j q 0}\right)=\mathbf L_{i j}^{x T} \mathbf\varphi_{q r}\quad\text{for} \quad r=1, \dots, H; q=0, \dots, H,\] where \(\mathbf L_{ij}\) and \(\mathbf L_{ij}^x\) are vectors of variables related to response and covariate misclassification processes, respectively, and \(\mathbf\gamma_{k l}\) and \(\mathbf\varphi_{q r}\) are vectors of regression parameters.

Let \(\mathbf\gamma_{k}=\left(\mathbf\gamma_{k 1}^T, \dots, \mathbf\gamma_{k K}^T\right)^T\) and \(\mathbf\gamma=\left(\mathbf\gamma_{0}^T, \dots, \mathbf\gamma_{K}^T \right)^T\). Let \(\mathbf\varphi_{q}=\left(\mathbf\varphi_{q 1}^T, \dots, \mathbf\varphi_{q H}^T \right)^T\) and \(\mathbf\varphi=\left(\mathbf\varphi_{0}^T, \dots, \mathbf\varphi_{H}^T\right)^T\). Let \(\mathbf\eta=\left(\mathbf\gamma^T, \mathbf\varphi^T\right)^T\). Define the \(K\times K\) matrix \(\mathbf R_{i j}=\left(\mathbf \tau_{i j 1}-\mathbf\tau_{i j 0}, \dots, \mathbf\tau_{i j K}-\mathbf\tau_{i j 0}\right)\) and the \(H\times H\) matrix \(\mathbf G_{i j}=\left(\mathbf \pi_{i j 1}-\mathbf\pi_{i j 0}, \dots, \mathbf\pi_{i j K}-\mathbf\pi_{i j 0}\right)\), where \(\mathbf\tau_{i j k}=\left(\tau_{i j k 1}, \dots, \tau_{i j k K}\right)^T\) and \(\mathbf\pi_{i j k}=\left(\pi_{i j k 1}, \dots, \pi_{i j k K}\right)^T\). Then the unbiased surrogates for \(\textbf{\texttt{Y}}_{ij}\) and \(\mathbf X_{ij}\) are constructed, respectively, by \[\textbf{\texttt{Y}}_{i j}^{*}=\mathbf R_{i j}^{-1}\left(\mathbf S_{i j}-\mathbf\tau_{i j 0}\right)\] and \[\mathbf X_{i j}^{*}=\mathbf G_{i j}^{-1}\left(\mathbf W_{i j}-\mathbf\pi_{i j 0}\right),\] where we write \(\textbf{\texttt{Y}}_{i j}^{*}=\left(Y_{i j 1}^*, \dots, Y_{i j K}^{*}\right)^T\), \(\mathbf X_{i j}^{*}=\left(X_{i j 1}^*, \dots, X_{i j K}^{*}\right)^T\), and let \(\textbf{\texttt{Y}}_{i}^{*}=\left(\textbf{\texttt{Y}}_{i 1}^T, \dots, \textbf{\texttt{Y}}_{i m_i}^{* T}\right)^T\). Let \(\mathbf e_q\) be an \(H\)-dimensional vector whose \(r\)th element is an indicator \(I(r=q)\) for \(q=1,\dots,H\) and let \(\mathbf e_0=0\).

If \(\mathbf\eta\) is known, then \[\mathbf U_{1 i}^*(\mathbf\theta)=\sum_{q_{m_{i}}=0}^{H} \dots \sum_{q_{1}=0}^{H}\left[\mathbf U_{1 i}\left\{\theta ; \textbf{\texttt{Y}}_i^*,\left(\mathbf e_{q_{1}}^T, \dots, \mathbf e_{q_{i_{i}}}^T\right)^T, \mathbf Z_i\right\} \prod_{j=1}^{m_{i}} X_{i j q_{j}}^{*}\right]\] and \[\mathbf U_{2 i}^*(\mathbf\theta)=\sum_{q_{m_{i}}=0}^{H} \dots \sum_{q_{1}=0}^{H}\left[\mathbf U_{2 i}\left\{\theta ; \textbf{\texttt{Y}}_i^*,\left(\mathbf e_{q_{1}}^T, \dots, \mathbf e_{q_{i_{i}}}^T\right)^T, \mathbf Z_i\right\} \prod_{j=1}^{m_{i}} X_{i j q_{j}}^{*}\right]\] are unbiased estimating functions of \(\mathbf\theta\), as shown in Appendix 2 of Chen et al. (2014). Estimation of \(\mathbf\theta\) can then be obtained by solving \[\label{u12} \sum_{i=1}^{n}\left\{\begin{array}{c}{\mathbf U_{1 \mathrm{i}}^*(\mathbf\theta)} \\ {\mathbf U_{2 i}^{*}(\mathbf\theta)}\end{array}\right\}=\mathbf 0 \tag{5}\] for \(\mathbf\theta\).

Case 2: Estimation with validation data

Case 1 highlights the estimation of \(\mathbf\theta\) when the parameter \(\mathbf\eta\) for the misclassification models is known or specified as a given value. In applications, \(\mathbf\eta\) is unknown and may be estimated from a validation subsample. In this case, we modify the estimation procedure based on ((5)) and describe a two-stage estimation procedure. First, let \(\delta_{ij}=I\)(subject \(i\) at visit \(j\) is included in the validation subsample). Using validation data (i.e., \(\delta_{ij}=1\)), we may estimate \(\tau_{ij}\) and \(\pi_{ij}\).

Define \(\mathbf D_{\mathbf\gamma i j}=\partial \mathbf\tau_{i j}^T/ \partial \mathbf\gamma\) and \(\mathbf D_{\mathbf\varphi i j}=\partial \mathbf\pi_{i j}^T/ \partial \mathbf\varphi\), then estimating functions for \(\mathbf\gamma\) and \(\mathbf\varphi\) are given by \(\mathbf Q_{\gamma i}(\mathbf\gamma)=\sum_{j=1}^{m_{i}} \mathbf D_{\gamma i j} \mathbf V_{\gamma i j}^{-1}\left\{\mathbf S_{i j}-\mathbf\tau_{i j}\right\} \delta_{i j}\) and \(\mathbf Q_{\varphi i}(\mathbf\varphi)=\sum_{j=1}^{m_{i}} \mathbf D_{\varphi i j} \mathbf V_{\varphi i j}^{-1}\left\{\mathbf W_{i j}-\mathbf\pi_{i j}\right\} \delta_{i j}\), where \(\mathbf V_{\mathbf\gamma i j}\) and \(\mathbf V_{\mathbf\varphi i j}\) are, respectively, the conditional covariance matrix for \(\mathbf S_{ij}\) and \(\mathbf W_{ij}\), given \(\textbf{\texttt{Y}}_{ij}\) and the true covariates.

Let \(\tilde{\texttt{Y}}_{i j k}=\texttt{Y}_{ijk}\) if \(\delta_{ij}=1\) and \(\tilde{\texttt{Y}}_{i j k}=\texttt{Y}_{ijk}^*\) otherwise, then we write \(\tilde{\textbf{\texttt{Y}}}_{i j}=\left(\tilde{\texttt{Y}}_{i j 1}, \dots, \tilde{\texttt{Y}}_{i j K}\right)^T\). Let \(\tilde{X}_{i j q}=X_{ijq}\) if \(\delta_{ij}=1\) and \(\tilde{X}_{i j q}=X_{ijq}^*\) otherwise. Then the augmented estimating functions of \(\mathbf\theta\) are given by \[\label{au1} \tilde{\mathbf U}_{1 i}(\mathbf\theta, \mathbf\eta)=\sum_{q_{m_{i}}=0}^{H} \dots \sum_{q_{1}=0}^{H}\left[\mathbf U_{1 i}\left\{\theta ; \tilde{\textbf{\texttt{Y}}}_{i},\left(\mathbf e_{q_{1}}^T, \dots, \mathbf{e}_{q_{m}}^T\right)^T, \mathbf Z_{i}\right\} \prod_{j=1}^{m_{i}} \tilde{X}_{i j q_{j}}\right] \tag{6}\] and \[\label{au2} \tilde{\mathbf U}_{2 i}(\mathbf\theta, \mathbf\eta)=\sum_{q_{m_{i}}=0}^{H} \dots \sum_{q_{1}=0}^{H}\left[\mathbf U_{2 i}\left\{\theta ; \tilde{\textbf{\texttt{Y}}}_{i},\left(\mathbf e_{q_{1}}^T, \dots, \mathbf{e}_{q_{m}}^T\right)^T, \mathbf Z_{i}\right\} \prod_{j=1}^{m_{i}} \tilde{X}_{i j q_{j}}\right]. \tag{7}\] Consequently, estimation of \(\mathbf\eta\) and \(\mathbf\theta\) can be carried out by the two-stage procedure.

\(\bf{Stage}\) \(\bf 1.\) Solve \(\sum_{i=1}^{n}\left\{\begin{array}{c}{\mathbf Q_{\mathbf\gamma i}(\mathbf\gamma)} \\ {\mathbf Q_{\varphi i}(\mathbf\varphi)}\end{array}\right\}=\mathbf 0\) for \(\mathbf\gamma\) and \(\mathbf\varphi\) and write \(\hat{\mathbf\eta}=\left(\hat{\mathbf\gamma}^T, \hat{\mathbf\varphi}^T\right)^T\), where \(\hat{\mathbf\gamma}\) and \(\hat{\mathbf\varphi}\) are the estimators for \(\mathbf\gamma\) and \(\mathbf\varphi\), respectively.

\(\bf{Stage}\) \(\bf 2.\) Substitute \(\mathbf\eta\) with \(\hat{\mathbf\eta}\) in ((6)) and ((7)) and solve \(\sum_{i=1}^{n}\left\{\begin{array}{l}{\tilde{\mathbf U}_{1 i}(\mathbf\theta, \hat{\mathbf\eta})} \\ {\tilde{\mathbf U}_{2 i}(\mathbf\theta, \hat{\mathbf\eta})}\end{array}\right\}=\mathbf 0\) for \(\mathbf\theta\). Let \(\hat{\mathbf\theta}=\left(\hat{\mathbf\beta}^T, \hat{\mathbf\alpha}^T\right)^T\) denote the resulting estimator \(\mathbf\theta\).

(Chen et al. 2014) established the asymptotic distribution of \(\hat{\mathbf\theta}\). Let \(\tilde{\mathbf U}_i\left(\mathbf \theta, \mathbf \eta\right)=\left\{\tilde{\mathbf U}_{1 i}^T(\mathbf \theta, \mathbf \eta), \tilde{\mathbf U}_{2 i}^T(\mathbf \theta, \mathbf\eta)\right\}^T\), \(\mathbf Q_i(\mathbf\eta)=\left\{\mathbf Q_{\gamma i}^T(\mathbf\gamma), \mathbf Q_{\varphi i}^T(\mathbf\varphi)\right\}^T\), \(\mathbf\Omega_i(\mathbf\theta, \mathbf\eta)=\tilde{\mathbf U}_i(\mathbf\theta, \mathbf\eta)-E\left\{\partial \tilde{U}_i(\mathbf\theta, \mathbf\eta) / \partial \mathbf\eta^T\right\} \cdot\left[E\left\{\partial \mathbf Q_i(\mathbf\eta) / \partial \mathbf\eta^T\right\}\right]^{-1}\), and \(\tilde{\mathbf\Gamma}(\mathbf\theta, \mathbf\eta)=E\left\{\partial \tilde{\mathbf U}_i(\mathbf\theta, \mathbf\eta) / \partial \mathbf\theta^T \right\}\). Then, under regularity conditions, \(n^{1 / 2}(\hat{\mathbf \theta}-\mathbf\theta)\) is asymptotically normally distributed with mean \(\mathbf 0\) and covariance matrix \(\tilde{\mathbf\Gamma}^{-1} \tilde{\mathbf\Sigma}\left(\tilde{\mathbf\Gamma}^{-1}\right)^T\), where \(\tilde{\mathbf\Sigma}=E\left\{\mathbf\Omega_i(\mathbf\theta, \mathbf\eta) \mathbf\Omega_i^T(\mathbf\theta, \mathbf\eta)\right\}\).

3 Package details

We develop an R package, called mgee2, to implement the misclassification adjustment method described in the preceding section. This package requires support from the external packages MASS (Venables and Ripley 2002), Matrix (Bates and Maechler 2019), and ggplot2 (Wickham 2016). Our mgee2 package mainly contains two functions, mgee2k and mgee2v, respectively, implementing cases 1 and 2 described in the previous section. Specifically, mgee2k implements the method where the misclassification parameters are given, and mgee2v implements the misclassification method for the case where validation data are available to estimate misclassification probabilities. We now describe the details of these two functions.

`mgee2k`

mgee2k implements the misclassification adjustment method outlined in Case 1 of the previous section, where the misclassification parameters are known. In this case, validation data are not required, and only the observed data of the outcome and covariates are needed for the implementation.

The function mgee2k requires the data set to be grouped by the individual id, \(i=1,...,n\), and each individual has \(m_i\) rows of data each corresponding to the visit time \(j=1,...,m_i\). The column name of the individual id is indicated by the argument id. The misclassification matrices for the response and covariate variables are recorded by the arguments gamMat and varphiMat, respectively, which need to be specified by the user.

To call mgee2k, we issue the following command,

    mgee2k(formula, id, data, corstr="exchangeable", misvariable,
                  gamMat, varphiMat, maxit=50, tol=1e-3)

where the meaning of each argument is described as follows:

formula: a formula object which specifies the relationship between the response and covariates for the observed data.
id: a character object which records individual id in the data.
data: a dataframe or matrix object for the observed data set.
corstr: a character object. The default value is "exchangeable", corresponding to the structure where the association between two paired responses is considered to be a constant. The other option is "log-linear" which indicates the log-linear association between two paired responses.
misvariable: a character object which names the error-prone covariate W.
maxit: an integer which specifies the maximum number of iterations. The default is 50.
tol: a numeric object which indicates the tolerance threshold. The default is 1e-3.
gamMat: a matrix object which records the misclassification parameter \(\gamma\) for response \(\texttt{Y}\).
varphiMat: a matrix object which records the misclassification parameter \(\phi\) for covariate X.

The function mgee2k returns a list of components:

beta: the coefficients in the same order as that specified in the formula for the response and covariates.
alpha: the coefficients for paired responses global odds ratios. The number of \(\alpha\) coefficients corresponds to the paired responses odds ratio structure selected in corstr. When corstr="exchangeable", only one baseline \(\alpha\) is fitted.
variance: the variance-covariance matrix of the estimators of all parameters.
convergence: a logical variable; TRUE if the model converges.
iteration: the number of iterations for the estimates of the model parameters to converge.
call: an unevaluated function call which consists of the named function applied to the given arguments.

`mgee2v`

The function mgee2v does not require the misclassification parameters to be known, but requires the availability of validation data.

Similar to mgee2k, the function mgee2v needs the data set to be structured by individual id, \(i=1,...,n\), and visit time, \(j=1,...,m_i\). The data set should contain the observed response and covariates, \(S\) and \(W\). To indicate whether or not a subject is in the validation set, an indicator variable delta should be added in the data set, and we use a column named valid.sample.ind for this purpose. The column name of the error-prone covariate \(W\) should also be specified in misvariable. To call mgee2v, we issue the command,

    mgee2v(formula, id, data, corstr="exchangeable", misvariable, valid.sample.ind,
                   y.mcformula, x.mcformula, maxit=50, tol=1e-3)

where the arguments are described as follows:

formula: a formula object which specifies the relationship between the response and covariates for the observed data.
id: a character object which records individual id in the data.
data: a dataframe or matrix object for the observed data set.
corstr: a character object. The default value is "exchangeable", corresponding to the structure where the association between two paired responses is considered to be a constant. The other option is "log-linear" which indicates the log-linear association between two paired responses.
misvariable: a character object which names the error-prone covariate W.
valid.sample.ind: a string object which names the indicator variable delta. When a data point belongs to the validation set, delta = 1; otherwise 0.
y.mcformula: a string object which indicates the misclassification formula between true response \(\texttt{Y}\) and the surrogate response S.
x.mcformula: a string object which indicates the misclassification formula between true error-prone covariate X and the surrogate W.
maxit: an integer which specifies the maximum number of iterations. The default is 50.
tol: a numeric object which indicates the tolerance threshold. The default is 1e-3.

The function mgee2v returns a list of components:

beta: the coefficients in the same order as that specified in the formula for the response and covariates.
alpha: the coefficients for paired responses global odds ratios. The number of \(\alpha\) coefficients corresponds to the paired responses odds ratio structure selected in corstr. When corstr="exchangeable", only one baseline \(\alpha\) is fitted.
variance: the variance-covariance matrix of the estimators of all parameters.
convergence: a logical variable; TRUE if the model converges.
iteration: the number of iterations for the estimates of the model parameters to converge.
call: an unevaluated function call which consists of the named function applied to the given arguments.

`ordGEE2`

In addition to developing the package mgee2 to implement the methods of (Chen et al. 2014), which accommodate misclassification effects in inferential procedures, we also implement the naive method of ignoring the feature of misclassification and call the resulting function ordGEE2. This function can be used together with the preceding described mgee2k or mgee2v to evaluate the impact of not addressing misclassification effects:

    ordGEE2(formula, id, data, corstr = "exchangeable", maxit = 50, tol = 0.001)

4 Data analysis

In this section, we conduct numerical studies to demonstrate the usage of our developed R package as well as to show supplementary functions such as summary and plot functions in this package. We first demonstrate all of the external functions in mgee2 through an example with a simulated data set, known as obs1, provided in our package.

An example

The simulated data set, called "obs1", includes 8 columns and 3000 rows, with each patient having 3 entries of visits. The format of this data set is as follows.

    > head(obs1)
    ID    Y    X treatment visit S W delta
    1  1    2    2         1     1 2 2     1
    2  1    0    0         1     2 0 0     1
    3  1 <NA> <NA>         1     3 1 2     0
    4  2 <NA> <NA>         1     1 1 0     0
    5  2 <NA> <NA>         1     2 0 1     0
    6  2 <NA> <NA>         1     3 0 0     0
    > summary(obs1)
    ID            Y           X        treatment visit   
    Min.   :   1.0   0   : 352   0   : 444   0:1500    1:1000  
    1st Qu.: 250.8   1   : 283   1   : 269   1:1500    2:1000  
    Median : 500.5   2   : 256   2   : 178             3:1000  
    Mean   : 500.5   NA's:2109   NA's:2109                     
    3rd Qu.: 750.2                                             
    Max.   :1000.0                                             
    S        W            delta      
    0:1181   0:1460   Min.   :0.000  
    1: 955   1: 944   1st Qu.:0.000  
    2: 864   2: 596   Median :0.000  
    Mean   :0.297  
    3rd Qu.:1.000  
    Max.   :1.000

Here, \(\texttt{Y}\) and \(X\) represent the true outcome and covariate variables, both being ordinal variables, each taking 3 possible values, denoted 0, 1, and 2, whereas \(S\) and \(W\) are the observed surrogates for \(\texttt{Y}\) and \(X\), respectively, with a \(5\%\) misclassification rate. delta is 1 when the subject is in the validation set and 0 otherwise. About \(30\%\) of subjects are randomly chosen to be included in the validation set. We include the subscripts \(i\) and \(j\) to \(\texttt{Y}\) and \(X\) to indicate the measurements for the corresponding variables for subject \(i\) at time point \(j\), in considering the proportional odds model indicated by ((1)), \[\label{equ:model} \text{logit} \: \lambda_{ijk}=\beta_{0k}+\beta_{X1}X_{ij1}+\beta_{X2}X_{ij2}+\beta_{Z1}Z_{ij1}+\beta_{Z2}Z_{ij2}+\beta_{Z3}Z_{ij3}\text{ for }k=1,2, \tag{8}\] where \(\lambda_{ijk}\) is defined as for ((1)); the treatment variable, denoted \(Z_{ij1}\), is an error-free binary variable; we simulated 3 visits for each patient, denoted by dummy variables \(Z_{ij2}\) and \(Z_{ij3}\), with the first visit as a reference level; \(X_{ij1}\) and \(X_{ij2}\) represent the dummy variables to reflect the three levels of the error-prone covariate, \(X_{ij}\); and \((\beta_{0k},\beta_{x1},\beta_{x2},\beta_{z1},\beta_{z2},\beta_{z3})^T\) is the vector of regression coefficients.

In the case corstr = "exchangeable", the association, defined as in ((2)), between paired responses is assumed to be \[\text{log} \: \psi_{i,jk,j'k'}=\phi;\] while in the case corstr = "log-linear", the association is assumed to be \[\text{log} \: \psi_{i,jk,j'k'}=\phi+\phi_2 I(k=2)+\phi_2 I(k'=2)+\phi_{22}I(k=2,k'=2),\] where \(\phi,\phi_2\), and \(\phi_{22}\) are parameters.

We now apply mgee2k and mgee2v, in contrast to ordGEE2, to fit the data to the models, respectively. The results are displayed as follows. In the summary tables for the R output, we use "Y>=1" and "Y>=2" to denote the coefficients \(\beta_{01}\) and \(\beta_{02}\), respectively, and let "Delta" correspond to the parameter \(\phi\) in the dependence structure.

mgee2k

To use function mgee2k, we need to specify the misclassification matrices beforehand. Here, we set the misclassification matrices the same as used in the simulation process.

    > data(obs1)
    > obs1$visit <- as.factor(obs1$visit)
    > obs1$treatment <- as.factor(obs1$treatment)
    > obs1$S <- as.factor(obs1$S)
    > obs1$W <- as.factor(obs1$W)
    > ## set misclassification parameters to be known.
    > varphiMat <- gamMat <- log( cbind(0.04/0.95, 0.01/0.95,
    +                                   0.95/0.03, 0.02/0.03,
    +                                   0.04/0.01, 0.95/0.01) )
    > mgee2k.fit = mgee2k(formula = S~W+treatment+visit, id = "ID", data = obs1,
    +                   corstr = "exchangeable", misvariable = "W", gamMat = gamMat, 
    +                   varphiMat = varphiMat)
    > summary(mgee2k.fit)
    Call:
    mgee2k(formula = S ~ W + treatment + visit, id = "ID", data = obs1, 
    corstr = "exchangeable", misvariable = "W", gamMat = gamMat, 
    varphiMat = varphiMat)
    
    Summary table of the estimation
                Estimate  Std.Err Z value   Pr(>z)    
    Y>=1        0.70889  0.08591   8.251 2.22e-16 ***
    Y>=2       -0.67521  0.08625  -7.828 4.88e-15 ***
    W1          0.58667  0.08719   6.729 1.71e-11 ***
    W2          0.94948  0.09745   9.743  < 2e-16 ***
    treatment1 -0.70554  0.09114  -7.742 9.77e-15 ***
    visit2     -0.24147  0.07735  -3.122   0.0018 ** 
    visit3     -0.62480  0.07571  -8.253 2.22e-16 ***
    Delta       1.22606  0.12231  10.024  < 2e-16 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

graphic without alt text — Figure 1: The display of the results in the summary table of applying *mgee2k* method to the example. The vertical axis presents the estimates for the coefficients corresponding to \(\texttt{Y}\)>=1, \(\texttt{Y}\)>=2, …, and Delta in the order from the bottom to the top. The horizontal axis shows exp(point estimates) (shown in red dots) for those coefficients indicated by the vertical axis, together with their 95% confidence intervals (shown in blue line segments). The confidence intervals are calculated as \((\exp(C_L), \exp(C_U))\), where \((C_L, C_U)\) is a 95% confidence interval of a coefficient.

mgee2v

To use mgee2v, a column of indicator variable should be specified in valid.sample.ind.

        > data(obs1)
        >   obs1$visit <- as.factor(obs1$visit)
        >   obs1$treatment <- as.factor(obs1$treatment)
        >   obs1$S <- as.factor(obs1$S)
        >   obs1$W <- as.factor(obs1$W)
        >   mgee2v.fit = mgee2v(formula = S~W+treatment+visit, id = "ID", data = obs1,
        +                       y.mcformula = "S~1", x.mcformula = "W~1", 
        +                       misvariable = "W", valid.sample.ind = "delta",
        +                       corstr = "exchangeable")
        > summary(mgee2v.fit)
        Call:
        mgee2v(formula = S ~ W + treatment + visit, id = "ID", 
        data = obs1, corstr = "exchangeable", misvariable = "W", 
        valid.sample.ind = "delta", y.mcformula = "S~1", x.mcformula = "W~1")

        Summary table of the estimation
                    Estimate  Std.Err Z value   Pr(>z)    
        Y>=1        0.64876  0.08851   7.330 2.30e-13 ***
        Y>=2       -0.68226  0.08703  -7.839 4.44e-15 ***
        W1          0.56507  0.08140   6.942 3.88e-12 ***
        W2          0.98411  0.09305  10.577  < 2e-16 ***
        treatment1 -0.68153  0.09052  -7.529 5.11e-14 ***
        visit2     -0.24694  0.07483  -3.300 0.000966 ***
        visit3     -0.60027  0.07335  -8.184 2.22e-16 ***
        Delta       1.22862  0.12160  10.103  < 2e-16 ***
        ---
        Signif. codes:  
        0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

ordGEE2

    > naigee.fit = ordGEE2(formula = S~W+treatment+visit, id = "ID",
    +                      data = obs1, corstr = "exchangeable")
    > summary(naigee.fit)
    Call:
    ordGEE2(formula = S ~ W + treatment + visit, id = "ID", 
    data = obs1, corstr = "exchangeable")
    
    Summary table of the estimation
                Estimate  Std.Err Z value   Pr(>z)    
    Y>=1        0.73276  0.07990   9.171  < 2e-16 ***
    Y>=2       -0.69330  0.08004  -8.662  < 2e-16 ***
    W1          0.51237  0.07354   6.967 3.23e-12 ***
    W2          0.84890  0.08582   9.892  < 2e-16 ***
    treatment1 -0.65954  0.08511  -7.749 9.33e-15 ***
    visit2     -0.22766  0.07241  -3.144  0.00167 ** 
    visit3     -0.58407  0.07052  -8.282 2.22e-16 ***
    Delta       1.06616  0.09846  10.828  < 2e-16 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1

plot_model

We use the function plot_model to compare the results obtained from the three functions:

    > plot_model(naigee.fit)
    > plot_model(mgee2.fit)
    > plot_model(mgee2v.fit)

It is helpful to compare the odds ratios when there are multiple covariates. We use the function plot_model to visualize the odds ratios. The estimated odds ratios for this simulated data set across the three methods are displayed in Figure 1, 2, and 3. The red dot gives the odds ratio of each covariate. The horizontal blue line measures the length of each confidence interval. The vertical axes of the graphs indicate the descending order of the covariates. In other words, the red points from the lowest to the highest in the graph represent the first covariate, the second covariate, and so on. It is seen that the three methods yield similar odds ratios.

Simulation studies

To further compare the three methods, a simulation study is conducted. We run 500 simulations where each data set includes 1000 subjects, with three visits for each subjects. obs1 is one example of the simulated data. The true values of the coefficients are reported in 1:

Table 1: True coefficients.
\(\beta_{01}\)	\(\beta_{02}\)	\(\beta_{X1}\)	\(\beta_{X2}\)	\(\beta_{Z1}\)	\(\beta_{Z2}\)	\(\beta_{Z3}\)
\(\log 2\)	\(\log(1/2)\)	\(\log 2\)	\(\log 3\)	\(\log(1/2)\)	\(\log(3/4)\)	\(\log(1/2)\)

2 reports the simulation results for the care with a 5% misclassification rate set for both the response and covariate variables, where Bias% records a bias in percentage, EV represents an empirical variance, AMV stands for an average of model-based variance, and CR records a coverage rate of 95% confidence intervals. Simulation results show that the mgee2 and mgee2v perform better than the naive method ordGEE2, and they produce reasonable results.

Table 2: Simulation results with a 5% misclassification rate.
	ordGEE2				mgee2				mgee2v
	Bias%	EV	AMV	CR	Bias%	EV	AMV	CR	Bias%	EV	AMV	CR
\(\beta_{01}\)	3.119	0.007	0.007	0.942	-0.915	0.008	0.008	0.944	-2.223	0.008	0.008	0.951
\(\beta_{02}\)	3.226	0.007	0.007	0.946	1.562	0.008	0.008	0.940	3.556	0.009	0.008	0.947
\(\beta_{x1}\)	-12.112	0.006	0.006	0.784	1.238	0.008	0.008	0.942	-6.933	0.024	0.014	0.924
\(\beta_{x2}\)	-9.810	0.007	0.008	0.754	1.433	0.009	0.010	0.964	3.393	0.016	0.011	0.941
\(\beta_{z1}\)	-7.032	0.008	0.007	0.922	-0.456	0.009	0.008	0.954	-0.138	0.009	0.008	0.949
\(\beta_{z2}\)	-6.311	0.006	0.005	0.932	0.071	0.006	0.006	0.938	-0.364	0.006	0.006	0.932
\(\beta_{z3}\)	-6.630	0.005	0.005	0.908	0.056	0.006	0.006	0.964	0.143	0.006	0.006	0.962
\(\phi\)	-13.130	0.009	0.010	0.690	0.217	0.014	0.015	0.954	1.257	0.016	0.017	0.956

In addition to the preceding simulation with a misclassification rate of 5%, we conducted another simulation with the same parameters except that the misclassification rate is changed to be 20%, and corstr = "log-linear". The results are reported in Table 3, which shows more noticeable differences in implementing the three functions, ‘ordGEE2’, ‘mgee2’, and ‘mgee2v’.

Table 3: Simulation results with a 20% misclassification rate.
	ordGEE2				mgee2				mgee2v
	Bias%	EV	AMV	CR	Bias%	EV	AMV	CR	Bias%	EV	AMV	CR
\(\beta_{01}\)	9.589	0.007	0.007	0.866	0.748	0.015	0.015	0.952	0.872	0.015	0.016	0.966
\(\beta_{02}\)	11.131	0.006	0.007	0.842	-0.210	0.014	0.015	0.958	-0.523	0.015	0.016	0.958
\(\beta_{x1}\)	-48.891	0.005	0.006	0.000	-1.233	0.027	0.029	0.964	-0.874	0.023	0.023	0.940
\(\beta_{x2}\)	-43.506	0.008	0.008	0.000	-0.400	0.026	0.027	0.958	-0.399	0.023	0.023	0.946
\(\beta_{z1}\)	-26.284	0.006	0.006	0.346	0.364	0.011	0.012	0.960	0.084	0.011	0.011	0.940
\(\beta_{z2}\)	-24.870	0.006	0.006	0.832	1.703	0.012	0.011	0.938	1.396	0.010	0.010	0.948
\(\beta_{z3}\)	-26.893	0.006	0.006	0.314	0.228	0.011	0.012	0.954	0.144	0.010	0.011	0.968
\(\phi_0\)	-53.210	0.009	0.009	0.000	1.873	0.078	0.068	0.942	1.034	0.052	0.066	0.976
\(\phi_2\)	-73.139	0.006	0.006	0.042	-0.353	0.052	0.047	0.948	-1.241	0.037	0.049	0.970
\(\phi_{22}\)	-59.955	0.011	0.011	0.000	0.256	0.075	0.075	0.944	0.188	0.053	0.079	0.978

5 A case study

To illustrate the usage of the developed R package, we analyze a dataset arising from the Framingham Heart Study, obtained from the NIH website (https://biolincc.nhlbi.nih.gov/teaching/). Similar to (Chen et al. 2014), we consider those 915 male patients who completed both exams #2 and #3, and age between 31 and 65 at the entry of the study. The response variable, HBP, is a categorical variable indicating the status of the systolic blood pressure (SBP), where HBP=0 if SBP is below 140 mmHg, HBP=1 if SBP is between 140 mmHg and 159 mmHg, and HBP=2 if SBP is larger than 160 mmHg.

We are interested in understanding the relationship between HBP and covariates, including the serum cholesterol level (CHOL), age, and the current smoking status (CURSMOKE), as well as the examination status, denoted as "Exam3". CHOL is classified as three categories, with 0, 1, and 2 representing normal (less than 200 mg/dL), borderline high (200-239mg/dL), and hypercholesterolemia (greater than 240 mg/dL), respectively. Exam3 is a dummy variable, with 1 indicating observations for exam 3 and 0 for exam 2.

First, we visualize how SBP may change with age by stratifying the study subjects into different categories according to the exam time, smoking status, or CHOL. To see the trend, we display simple linear regression lines that fit scattered points of SBP against AGE for patients in each category, as shown in Figure 4. Except for the patients with CHOL value 0 and CURSMOKE value 0 at exam 2, there is generally an upward trend of SBP versus age for each category, though the degree varies. While each patient takes 2 exams, the time interval between two exams is different from patient to patient. To reflect this feature of different gap times for the study subjects, in Figure 5 we further display spaghetti plots (Hedeker and Gibbons 2006) for patients in different categories, where the two endpoints of each black line segment mark SBP and age for the corresponding study subject at exams 2 and 3 in each category, respectively. The blue curve represents the loess smooth curve in each panel to show the trend of SBP against AGE. The loess smooth function is a tool to create smooth lines for scattered plots using polynomial approximations. The code for producing Figures 4 and 5 is included in the help file of data set heart in our R package.

Next, we use the proportional odds model to examine how SBP may be quantitatively associated with the covariates. For the \(i\)th patient at the \(j\)th visit, \(X_{ij,CHOL=1}\) and \(X_{ij,CHOL=2}\) are binary indicator variables recording whether the patient’s cholesterol level is 1 and 2, respectively; \(Z_{ij,smoker}\) is a binary variable whether or not the patient is a smoker; \(Z_{ij,exam3}\) is a binary variable showing whether or not the patient is taking exam #3; and \(Z_{i,age}\) records the age of the \(i\)th patient at the entry of the study.

As defined in ((1)), consider the model \[\begin{aligned} \text{logit}\: \lambda_{ijk}=&\beta_{0k}+\beta_{X,CHOL=1}X_{ij,CHOL=1}+\beta_{X,CHOL=2}X_{ij,CHOL=2}\\ &+\beta_{Z,age}Z_{i,age}+\beta_{Z,smoker}Z_{ij,smoker}+\beta_{Z,exam3}Z_{ij,exam3} \label{eq:casestudy} \end{aligned} \tag{9}\] for \(k=1,2\), where \(\beta_{0k}, \beta_{X,CHOL=1}, \beta_{X,CHOL=2}, \beta_{Z,age}, \beta_{Z,smoker},\) and \(\beta_{Z,exam3}\) are the parameters.

Table 4: A case study of a data subset arising from the Framingham Heart Study, with an assumed misclassification rate at 5%.
	ordGEE2			mgee2k
	Est.	SD	p-vlue	Est.	SD	p-vlue
\(\beta_{01}\)	-4.291	0.635	<0.001	-4.943	0.737	<0.001
\(\beta_{02}\)	-5.623	0.638	<0.001	-6.195	0.740	<0.001
\(\beta_{x,CHOL=1}\)	0.068	0.133	0.608	0.117	0.180	0.515
\(\beta_{x,CHOL=2}\)	0.352	0.140	0.012	0.474	0.186	0.011
\(\beta_{z,age}\)	0.063	0.012	<0.001	0.071	0.014	<0.001
\(\beta_{z,smoker}\)	-0.044	0.105	0.673	-0.042	0.118	0.722
\(\beta_{z,exam3}\)	0.145	0.097	0.133	0.182	0.110	0.097
\(\phi\)	2.301	0.207	<0.001	2.301	0.318	<0.001

The data set used in our example is included in our package called "heart". To demonstrate the usage of the developed package, we perceive that the response HBP level and the covariate cholesterol level are prone to misclassification. Since this example does not have a validation data set, we only analyze the data using the naive method, "ordGEE2", and the corrected method with a specified known misclassification rate, "mgee2k", where the misclassification rates for both the outcome and the covariate are assumed to be 5%, and the exchangeable dependence structure is considered. The analysis results are shown in Table 4. Overall, the naive method and the corrected method indicate the same significant health factors, yet the magnitude of the coefficient estimates and their standard errors are different. Higher cholesterol levels and older ages appear to be positively correlated with high blood pressure.

6 Summary

Analysis of longitudinal ordinal data is important for research in health science, epidemiological studies, and social science. Marginal analysis using generalized estimating equations has been extensively employed in applications. However, such a strategy is challenged by the presence of mismeasurement of variables. To address this challenge, (Chen et al. 2014) developed estimation methods for analyzing correlated ordinal responses and ordinal covariates, which are subject to misclassification.

To allow analysts to apply the useful methods of (Chen et al. 2014) without doing individual codes, we develop an R package mgee2 to implement the methods for general use. Our package provides three methods for estimation, including the two methods of corrections for misclassification effects, as opposed to the naive method, which disregards the feature of mismeasurement in variables. The package can be used for modeling longitudinal ordinal data with misclassified response and covariates. It provides consistent estimation results by directly inputting the data under required assumptions.

7 Acknowledgements

The authors thank the review team for the helpful comments on the initial version. Yi’s research was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC). Yi is Canada Research Chair in Data Science (Tier 1). Her research was undertaken, in part, thanks to funding from the Canada Research Chairs program.

This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.