PGEE : An R Package for Analysis of Longitudinal Data with High-Dimensional Covariates

We introduce an R package PGEE that implements the penalized generalized estimating equations (GEE) procedure proposed by Wang et al. (2012) to analyze longitudinal data with a large number of covariates. The PGEE package includes three main functions: CVfit, PGEE, and MGEE. The CVfit function computes the cross-validated tuning parameter for penalized generalized estimating equations. The function PGEE performs simultaneous estimation and variable selection for longitudinal data with high-dimensional covariates; whereas the function MGEE fits unpenalized GEE to the data for comparison. The R package PGEE is illustrated using a yeast cell-cycle gene expression data set.


Introduction
Longitudinal data arises from repeated measurements on the same subjects over time.A popular approach to analyzing longitudinal data is generalized estimating equations (GEE), which were proposed by Liang and Zeger (1986) and Zeger and Liang (1986).The GEE approach fits a marginal regression model to the longitudinal data.Instead of specifying the full joint likelihood, it only requires to specify the first two marginal moments.This is particularly attractive when the responses are discrete as specifying a joint distribution for multivariate discrete distribution is known to be challenging.Furthermore, although the GEE procedure relies on a working correlation model, it produces a consistent and asymptotically normal estimator even if the working correlation structure is misspecified.If the specified working correlation structure is close to the true correlation structure, further efficiency gain can be expected.Some commonly used working correlation structures include the exchangeable (Exch), first-order autoregressive (Ar(1)), stationary-1-dependent (MV_1) and so on.The generalized estimating equations are now implemented in two nice R packages: the gee package (Carey, 2015) and the geepack package (Halekoh et al., 2006).
With the advent of technology in data-collection, longitudinal data with a large number of covariates, in other words, high-dimensional longitudinal data, have now been commonly observed in fields such as health and genomic studies, economics and behavioral sciences.Including redundant covariates in model results in loss of accuracy in both estimation and inference.In the modern "large n, diverging p" framework, Wang (2011) studied the consistency and asymptotic normality of GEE regression estimators and verified the validity of the sandwich variance formula of GEE estimators and the large-sample Wald test under regularity conditions.Wang et al. (2012) further proposed penalized GEE for simultaneous variable selection and estimation for the cases where the number of covariates in the model is large.Similarly as GEE, the penalized GEE procedure only requires to specify the first two marginal moments and a working correlation matrix and assumes that missing data is valid only under missing completely at random, which means that missingness is independent of both observed and unobserved data.It leads to consistent variable selection performance even if the working correlation structure is misspecified, that is, with probably approaching one, the true model is selected if it is one of the candidate models.Recently, there has been growing interest in high-dimensional longitudinal data analysis, see for example Lian et al. (2014) and Wang et al. (2014).
In this paper, we present the R package PGEE (Inan et al., 2017) which implements the penalized generalized estimating equations procedure in Wang et al. (2012) to analyze the longitudinal data with high-dimensional covariates.The package PGEE is available on CRAN at https://cran.rproject.org/web/packages/PGEE.The rest of the paper is organized as follows.Section 2.2 provides a brief overview for both GEE and PGEE.Section 2.3 describes the main features of the functions in the PGEE package.Section 2.4 illustrates the use of PGEE via a yeast cell-cycle gene expression data set.Section 2.5 concludes the paper.

Data structure
Consider a longitudinal study where for the ith (i = 1, 2, . . ., N) subject at time t (t = 1, 2, . . ., n i ) we observe a response variable Y it and a p × 1 vector of covariates The R Journal Vol.9/1, June 2017 ISSN 2073-4859 . ., . . ., X in i T denote n i × 1 vector of responses and n i × p matrix of covariates for the ith subject, respectively.The observations obtained from the same subject are correlated whereas those obtained from different subjects are assumed to be independent.
Background on generalized estimating equations Liang and Zeger (1986) assume that the first two marginal moments of , where θ it = X T it β, and β = (β 1 , . . ., β p ) T is p × 1 vector of unknown regression coefficients of interest, i = 1, 2, . . ., N, t = 1, 2, . . ., n i .These moment assumptions would follow when the marginal response variable has a canonical exponential family distribution with scaling parameter one.
Let V i = Var(Y i |X it ) be the n i × n i covariance matrix of the ith subject, i = 1, . . ., N. In practice, Liang and Zeger (1986) suggest to estimate V i via a working correlation structure.Specifically, let where A i is an n i × n i diagonal matrix with the marginal variance of responses on the diagonals, and R n i (α) represents an n i × n i working correlation matrix indexed by a vector of parameters α.From now on, we will use R(α) rather than R n i (α) for simplicity.The popular choices for R(α) may be independence (Indep), exchangeable (Exch), first-order autoregressive (Ar(1)), stationarym-dependent (MV_m), and non-stationary-m-dependent (NMV_m) (m denotes the lag order), and unstructured (UN) working correlation structure.A good review of the commonly used working correlation matrices is given in Horton and Lipsitz (1999).Liang and Zeger (1986) proposed to estimate the regression parameters β by solving the following set of estimating equations where R−1 (α) denotes the estimated working correlation matrix.The estimating equations can be solved using a modified Fisher scoring algorithm.Within the iterative Fisher scoring algorithm, the parameter α in R(α) can be estimated by residual-based moment method, see Hardin and Hilbe (2003).Liang and Zeger (1986) showed that the resulted estimator is consistent even if R is misspecified.

Penalized generalized estimating equations for longitudinal data with high-dimensional covariates
With high-dimensional covariates, it is often reasonable to assume that many of these covariates are not relevant for modeling the marginal mean of the response variable, in other words, the regression coefficients vector β can be assumed to be sparse in the sense that most of its components are exactly zero.Wang et al. (2012) introduced penalized generalized estimating equations (PGEE) for simultaneous estimation and variable selection in this setting.More specifically, they propose to estimate β by β, which solves the following set of penalized estimating equations where q λ (|β|) = (q λ (|β 1 |), . . ., q λ (|β p |)) T , sign(β) = (sign(β 1 ), . . ., sign(β p )) T , and q λ (|β|) • sign(β) denotes the Hadamard product (element-wise product) of these two vectors.The penalty function q λ (|β j |), the jth component of q λ (|β|), takes a zero value for a large value of |β j | and takes a large value for a small value of |β j |.Consequently, the generalized estimating equation S(β j ), the jth component of S(β), is not penalized if |β j | is large in magnitude, whereas S(β j ) is penalized if |β j | is smaller than a cut-off value (greater than zero).Hence, the role of the penalty function q λ (|β j |) is to shrink the estimates of small coefficients toward zero.The coefficients whose estimates are shrunken to zero are excluded from the final model.The cut-off value is chosen as 10 −3 as in Cho and Qu (2013), Wang et al. (2012), andWang et al. (2007).The penalty has a tuning parameter λ that controls the model complexity.Wang et al. (2012) studied the SCAD penalty function (Fan and Li, 2001) which is defined on [0, +∞] as The R Journal Vol.9/1, June 2017 ISSN 2073-4859 where λ ≥ 0, a > 2 and b + = bI(b > 0) for a real number b.Following the recommendation of Fan and Li (2001), we use a = 3.7 and find it usually works well.The nonconvex SCAD penalty function alleviates the main drawback of L 1 penalty function in that it avoids over-penalizing large coefficients and hence leads to consistent variable selection under more relaxed conditions.Under regularity conditions, the estimated coefficients for redundant covariates are shrunken to exactly zero.Johnson et al. (2008), Wang et al. (2012) proposed to solve the penalized estimated equations in Equation ( 3) by combining the minorization-maximization (MM) algorithm with a Newton-Raphson (NR) algorithm.At the jth iteration,

Motivated by
where where is a small value (e.g., 10 −6 ).Wang et al. (2012) derived the asymptotic theory of PGEE in a high-dimensional framework where the number of covariates p increases as N increases, and p can reach the same order as N.An important feature of PGEE is that even if the working correlation structure is misspecified, the consistency of model selection holds, that is, with probability approaching one, it correctly identifies the zero coefficients to be zero and the nonzero coefficients to be nonzero.They also suggested a sandwich formula to estimate the asymptotic covariance matrix of the penalized GEE estimator as follows: where where The tuning parameter λ in Equation ( 4) can be selected using K-fold cross-validation, where K is a positive integer.The data is divided into non-overlapping K sub-samples of equal sizes.The kth sub-sample being left out as the testing data set, and the remaining data are used as the training data set, k = 1, . . ., K. We use the set N −k to denote the indice set of the subjects in the training data set and use |N −k | to denote the cardinality of N −k .We fit the PGEE under working independence assumption using the training data and then evaluate the prediction error using the test data by PE −k (λ), which is defined as We repeat the above computation for each of the K subsamples, and the overall cross-validated prediction error is given by Given a set of λ values over a grid, we choose the value of λ that yields the smallest CV(λ).

PGEE estimation algorithm
The algorithm for solving penalized generalized estimating equations is summarized as follows: 1. Determine a reasonable grid of values for λ.

Given a value of λ:
• Assign an initial value for β.
• Stop the iteration sequence if a convergence criterion is satisfied such as when the L 1 norm of the difference of estimated β between iterations is below a threshold (e.g., 10 −3 ) and denote the estimated value at convergence as β.
3. Repeat step 2 for each value of λ over the grid and find the value of λ that gives the smallest cross-validation prediction error.
4. Find the estimated parameter β corresponding to the selected λ, and compute the sandwich covariance matrix by Equation ( 7).

The PGEE package
The R package PGEE consists of three core functions: CVfit, PGEE, and MGEE, respectively.The main function PGEE fits PGEEs to the longitudinal data with high-dimensional covariates.Prior to model fitting with PGEE, the cross-validated tuning parameter should be computed via the function CVfit.
The package also includes the function MGEE which fits the unpenalized GEEs to the data.The PGEE and MGEE functions are written by the authors.The R package PGEE depends on the R packages MASS (Ripley, 2015) and mvtnorm (Genz et al., 2015) only.
In this section, we introduce the input arguments of the functions CVfit and PGEE, whereas the function MGEE shares the same arguments with the function PGEE except the arguments lambda, pindex, and eps.
The usage and input arguments of CVfit function are summarized as follows: CVfit(formula, id, data, family, scale.fix= TRUE, scale.value= 1, fold, lambda.vec,pindex, eps, maxiter, tol) The function CVfit applies the step 3 in the estimation algorithm in Section 2.2.4 via Equation (9).It uses the function PGEE inside such that both functions share common input arguments.The input argument formula is a formula expression in the form of response predictors as in lm and glm functions.The argument id is a vector for identifying subjects/clusters and the argument data is a data frame which stores the response and covariates in formula with id variable as in gee function in R package gee.Please note that the function PGEE requires the covariates to be numeric variables and does not work with factor covariates.The argument family is a list of functions and expressions for defining link and variance functions.While families supported in the R package PGEE are binomial, gaussian, gamma, and poisson, link functions supported are identity, log, logit, inverse, probit, and cloglog.The argument scale.fix is a logical variable.The default value is TRUE.On the other hand, if scale.fix= TRUE, scale.valueassigns a numeric value to which the scale parameter should be fixed at.Otherwise, the default value is 1.The arguments fold, pindex, and eps are the main buildings of the function CVfit for cross-validation.The argument fold is the number of folds used in cross-validation.The argument lambda.vec is a vector of tuning parameters used in cross-validation.The argument pindex is an index vector showing the parameters which are not subject to penalization.The default value is NULL.However, in case of a model with intercept, the intercept parameter should be never penalized.The argument eps is a numerical value for the used in Equation ( 6).The default value is 10 −6 .The argument maxiter is the number of iterations that is used in the estimation algorithm.The default value is 30.The argument tol is the tolerance level that is used in the estimation algorithm to evaluate algorithm convergence.The default value is 10 −3 .The function CVfit returns an object class of CVfit.Applying the function print to the object returned by function CVfit provides detailed information related to cross-validation and gives the value of λ that minimizes the cross-validated value of prediction error.
PGEEs (Wang et al., 2012) and, in turn, the function CVfit in R package PGEE accommodate the SCAD penalty.Fan and Li (2001) demonstrated that the SCAD penalty function is a popular nonconvex penalty function that satisfies three desirable properties of variable selection (e.g., sparsity, unbiasedness, and continuity) simultaneously.Recently, a number of R packages have been developed for estimation and variable selection problems in linear regression models, logistic regression models, quantile regression models, and Cox proportional hazards models for cross-sectional data with highdimensional covariates; see the R package ncvreg (Breheny and Huang, 2011) for linear and logistic regression models with SCAD and MCP penalization functions, the R package penalized (Goeman, 2010) for generalized linear regression models and Cox proportional hazards models with L 1 and L 2 penalty functions, the R package glmnet (Friedman et al., 2010) for generalized linear regression models and Cox proportional hazards models with LASSO and elastic-net penalty functions, and the R package rqPen (Sherwood and Maidman, 2016) for quantile regression penalized quantile regression The R Journal Vol.9/1, June 2017 ISSN 2073-4859 with LASSO, SCAD, and MCP functions.However, these packages do not apply to the clustered data setting which we consider in this paper.
The usage and input arguments of PGEE function are summarized as follows: PGEE(formula, id, data, na.action = NULL, family = gaussian(link = "identity"), corstr = "independence", Mv = NULL, beta_int = NULL, R = NULL, scale.fix= TRUE, scale.value= 1, lambda, pindex = NULL, eps = 10^-6, maxiter = 30, tol = 10^-3, silent = TRUE) The input arguments formula, data, id, and family are the same as those in the CVfit function.Here, the default value for family is gaussian.The argument na.action is a function to remove missing values from the data, where only na.omit is allowed.The argument corstr is a character string, which specifies the type of correlation structure within the repeated measurements of a subject.The correlation structures supported in the R package PGEE are "AR-1","exchangeable", "fixed", "independence", "stat_M_dep", "non_stat_M_dep", and "unstructured".The default corstr type is "independence".If either "stat_M_dep" or "non_stat_M_dep" is specified in corstr, then the argument Mv assigns a numeric value for Mv, which is one minus less than the largest number of repeated measurements of a subject has in the data.Otherwise, the default value is NULL.When the longitudinal data is unbalanced, the use of "non_stat_M_dep" and "unstructured" is not allowed in the argument corstr.If corstr = "fixed" is specified, then the argument R is a user specified square correlation matrix of dimension maximum of the number of repeated measurements of a subject has in the data.Otherwise, the default value is NULL.The argument beta_int is a user specified vector of initial values for regression parameters including the intercept.The default value is NULL which gets initial values through fitting a glm to the whole longitudinal data.The argument lambda is a numerical value returned by CVfit function.The input arguments scale.fix,scale.value,pindex, eps, maxiter, and tol are the same as those in the CVfit function.The argument silent is a logical variable; if false, the regression parameter estimates at each iteration are printed.The default value is TRUE.
The function PGEE returns an object class of PGEE.Applying the functions print and summary to an object returned by function PGEE provides detailed information related to the model fitting and summarizes the results as illustrated in the next section.
The function MGEE closely follows the syntax of the function gee in the R package gee except that the argument subset for data sub-setting and the argument contrasts for coding factor variables in terms of dummy variables are not used in the function MGEE.Furthermore, while any lag order can be assumed in the argument corstr = "AR-M" of the function gee, only first-order lag is allowed in the function MGEE (e.g., corstr = "AR-1").On the other hand, there is much discrepancy between the arguments of the function MGEE and the function geeglm in the R package geepack since the latter inherits its arguments mostly from the function glm.As the result, arguments in geepack such as weights, subset, etastart, mustart, offset, and waves are not available in our function MGEE.Its corstr menu consists of "AR-1","exchangeable", "fixed", "independence", and "unstructured" structures, which is less comprehensive compared to the corstr menu of the function MGEE.Lastly, in addition to the sandwich variance estimator, the function geeglm offers jackknife variance estimator for data sets with small number of clusters via the argument std.err.

Example
In this section, we demonstrate the use of the R package PGEE using a yeast cell-cycle gene expression data set collected in the CDC15 experiment of Spellman et al. (1998).The experiment measured genome-wide mRNA levels of 6178 yeast open reading frames (ORFs) (translates DNA sequences into its corresponding amino acid sequences which will appear in the final protein).This experiment covered two cell-cycle periods, where measurements were taken at 7-minute intervals over a 119minutes period yielding a total of 18 time points.
As discussed in Wang et al. (2012) and Wang et al. (2007), the cell-cycle process is a regulated life process where the cell grows, replicates its DNA and prepares itself for cell-division.This process is generally divided into M/G1-G1-S-G2-M stages, where M refers to "mitosis", G1 refers "GAP 1", S refers to DNA "synthesis", and G2 refers to "GAP 2", respectively.Spellman et al. (1998) identified a total of 800 genes that showed periodic variation during the cell-cycle process.However, to better understand the phenomenon underlying cell-cycle process, it is important to identify transcription factors (TFs) that regulate the gene expression levels of cell cycle-regulated genes.Specifically, TFs are proteins that control gene regulation by determining the rate of transcription of genetic information from DNA to mRNA.
We repeat the steps above for each M/G1, G1, S, G2, and M stages under three different working correlation matrices (independence, exchangeable, and Ar(1)) and identify the number of TFs selected at each stage where the results are presented in Table 1.The results in Table 1 suggest that PGEE is robust to working correlation matrix specification and the selected TFs do not change significantly across working correlation matrix types within a stage.Furthermore, the analysis also reveals that different TFs may play different roles on gene expression levels in each cell-cycle process and there may be a small overlap in the selected TFs at different stages (e.g., only FKH1, FKH2, and SMP1 are related to all stages of the yeast cell-cycle process).

Conclusion
In this paper, we present the R package PGEE which implements the PGEEs approach in Wang et al. (2012).The PGEE procedure performs simultaneous estimation and variable selection for longitudinal data analysis with high-dimensional covariates.We believe that this package is useful to practitioners in diverse fields where high-dimensional longitudinal data commonly arises such as genetics and large-scale health studies.

Table 1 :
summarizes the number of TFs which are identified as statistically significant for each stage under three different working correlation structures.Number of TFs selected at each stage in the yeast cell-cycle process with PGEE