We introduce an R package PGEE that implements the penalized generalized estimating equations (GEE) procedure proposed by Wang et al. (2012) to analyze longitudinal data with a large number of covariates. The PGEE package includes three main functions: CVfit
, PGEE
, and MGEE
. The CVfit
function computes the cross-validated tuning parameter for penalized generalized estimating equations. The function PGEE
performs simultaneous estimation and variable selection for longitudinal data with high-dimensional covariates; whereas the function MGEE
fits unpenalized GEE to the data for comparison. The R package PGEE is illustrated using a yeast cell-cycle gene expression data set.
Longitudinal data arises from repeated measurements on the same subjects over time. A popular approach to analyzing longitudinal data is generalized estimating equations (GEE), which were proposed by Liang and Zeger (1986) and Zeger and Liang (1986). The GEE approach fits a marginal regression model to the longitudinal data. Instead of specifying the full joint likelihood, it only requires to specify the first two marginal moments. This is particularly attractive when the responses are discrete as specifying a joint distribution for multivariate discrete distribution is known to be challenging. Furthermore, although the GEE procedure relies on a working correlation model, it produces a consistent and asymptotically normal estimator even if the working correlation structure is misspecified. If the specified working correlation structure is close to the true correlation structure, further efficiency gain can be expected. Some commonly used working correlation structures include the exchangeable (Exch), first-order autoregressive (Ar(1)), stationary-1-dependent (MV_1) and so on. The generalized estimating equations are now implemented in two nice R packages: the gee package (Carey 2015) and the geepack package (Halekoh et al. 2006).
With the advent of technology in data-collection, longitudinal data with a large number of covariates, in other words, high-dimensional longitudinal data, have now been commonly observed in fields such as health and genomic studies, economics and behavioral sciences. Including redundant covariates in model results in loss of accuracy in both estimation and inference. In the modern “large n, diverging p” framework, Wang (2011) studied the consistency and asymptotic normality of GEE regression estimators and verified the validity of the sandwich variance formula of GEE estimators and the large-sample Wald test under regularity conditions. Wang et al. (2012) further proposed penalized GEE for simultaneous variable selection and estimation for the cases where the number of covariates in the model is large. Similarly as GEE, the penalized GEE procedure only requires to specify the first two marginal moments and a working correlation matrix and assumes that missing data is valid only under missing completely at random, which means that missingness is independent of both observed and unobserved data. It leads to consistent variable selection performance even if the working correlation structure is misspecified, that is, with probably approaching one, the true model is selected if it is one of the candidate models. Recently, there has been growing interest in high-dimensional longitudinal data analysis, see for example Lian et al. (2014) and Wang et al. (2014).
In this paper, we present the R package PGEE (Inan et al. 2017) which implements the penalized generalized estimating equations procedure in Wang et al. (2012) to analyze the longitudinal data with high-dimensional covariates. The package PGEE is available on CRAN at https://cran.r-project.org/web/packages/PGEE. The rest of the paper is organized as follows. Section 2 provides a brief overview for both GEE and PGEE. Section 3 describes the main features of the functions in the PGEE package. Section 4 illustrates the use of PGEE via a yeast cell-cycle gene expression data set. Section 5 concludes the paper.
Consider a longitudinal study where for the
Liang and Zeger (1986) assume that the first two marginal moments of
Let
where
Let
With high-dimensional covariates, it is often reasonable to assume that
many of these covariates are not relevant for modeling the marginal mean
of the response variable, in other words, the regression coefficients
vector
where
where
Motivated by Johnson et al. (2008), Wang et al. (2012) proposed to solve
the penalized estimated equations in Equation ((3)) by
combining the minorization-maximization (MM) algorithm with a
Newton-Raphson (NR) algorithm. At the
where
where
where
The tuning parameter
Given a set of
The algorithm for solving penalized generalized estimating equations is summarized as follows:
Determine a reasonable grid of values for
Given a value of
Assign an initial value for
Compute
Update the current estimate of
Stop the iteration sequence if a convergence criterion is
satisfied such as when the
Compute the cross-validation value of
Repeat step
Find the estimated parameter
The R package PGEE consists of three core functions: CVfit
, PGEE
,
and MGEE
, respectively. The main function PGEE
fits PGEEs to the
longitudinal data with high-dimensional covariates. Prior to model
fitting with PGEE
, the cross-validated tuning parameter should be
computed via the function CVfit
. The package also includes the
function MGEE
which fits the unpenalized GEEs to the data. The PGEE
and MGEE
functions are written by the authors. The R package PGEE
depends on the R packages
MASS (Ripley 2015) and
mvtnorm (Genz et al. 2015) only.
In this section, we introduce the input arguments of the functions
CVfit
and PGEE
, whereas the function MGEE
shares the same
arguments with the function PGEE
except the arguments lambda
,
pindex
, and eps
.
The usage and input arguments of CVfit
function are summarized as
follows:
CVfit(formula, id, data, family, scale.fix = TRUE, scale.value = 1, fold, lambda.vec,
pindex, eps, maxiter, tol)
The function CVfit
applies the step PGEE
inside such that both functions share common input
arguments. The input argument formula
is a formula expression in the
form of response predictors
as in lm
and glm
functions. The
argument id
is a vector for identifying subjects/clusters and the
argument data
is a data frame which stores the response and covariates
in formula
with id
variable as in gee
function in R package gee.
Please note that the function PGEE
requires the covariates to be
numeric variables and does not work with factor covariates. The argument
family
is a list of functions and expressions for defining link
and
variance
functions. While families supported in the R package PGEE
are binomial
, gaussian
, gamma
, and poisson
, link functions
supported are identity
, log
, logit
, inverse
, probit
, and
cloglog
. The argument scale.fix
is a logical variable. The default
value is TRUE
. On the other hand, if scale.fix = TRUE
, scale.value
assigns a numeric value to which the scale parameter should be fixed at.
Otherwise, the default value is 1
. The arguments fold
, pindex
, and
eps
are the main buildings of the function CVfit
for
cross-validation. The argument fold
is the number of folds used in
cross-validation. The argument lambda.vec
is a vector of tuning
parameters used in cross-validation. The argument pindex
is an index
vector showing the parameters which are not subject to penalization. The
default value is NULL
. However, in case of a model with intercept, the
intercept parameter should be never penalized. The argument eps
is a
numerical value for the maxiter
is the number of
iterations that is used in the estimation algorithm. The default value
is tol
is the tolerance level that is used in the
estimation algorithm to evaluate algorithm convergence. The default
value is CVfit
returns an object class of
CVfit
. Applying the function print
to the object returned by
function CVfit
provides detailed information related to
cross-validation and gives the value of
PGEEs (Wang et al. 2012) and, in turn, the function CVfit
in R
package PGEE accommodate the SCAD penalty. Fan and Li (2001)
demonstrated that the SCAD penalty function is a popular nonconvex
penalty function that satisfies three desirable properties of variable
selection (e.g., sparsity, unbiasedness, and continuity) simultaneously.
Recently, a number of R packages have been developed for estimation and
variable selection problems in linear regression models, logistic
regression models, quantile regression models, and Cox proportional
hazards models for cross-sectional data with high-dimensional
covariates; see the R package
ncvreg (Breheny and Huang 2011) for
linear and logistic regression models with SCAD and MCP penalization
functions, the R package
penalized (Goeman 2010)
for generalized linear regression models and Cox proportional hazards
models with
The usage and input arguments of PGEE
function are summarized as
follows:
PGEE(formula, id, data, na.action = NULL, family = gaussian(link = "identity"),
corstr = "independence", Mv = NULL, beta_int = NULL, R = NULL, scale.fix = TRUE,
scale.value = 1, lambda, pindex = NULL, eps = 10^-6, maxiter = 30, tol = 10^-3,
silent = TRUE)
The input arguments formula
, data
, id
, and family
are the same
as those in the CVfit
function. Here, the default value for family
is gaussian
. The argument na.action
is a function to remove missing
values from the data, where only na.omit
is allowed. The argument
corstr
is a character string, which specifies the type of correlation
structure within the repeated measurements of a subject. The correlation
structures supported in the R package PGEE are
"AR-1"
,"exchangeable"
, "fixed"
, "independence"
, "stat_M_dep"
,
"non_stat_M_dep"
, and "unstructured"
. The default corstr
type is
"independence"
. If either "stat_M_dep"
or "non_stat_M_dep"
is
specified in corstr
, then the argument Mv
assigns a numeric value
for Mv
, which is one minus less than the largest number of repeated
measurements of a subject has in the data. Otherwise, the default value
is NULL
. When the longitudinal data is unbalanced, the use of
"non_stat_M_dep"
and "unstructured"
is not allowed in the argument
corstr
. If corstr = "fixed"
is specified, then the argument R
is a
user specified square correlation matrix of dimension maximum of the
number of repeated measurements of a subject has in the data. Otherwise,
the default value is NULL
. The argument beta_int
is a user specified
vector of initial values for regression parameters including the
intercept. The default value is NULL
which gets initial values through
fitting a glm
to the whole longitudinal data. The argument lambda
is
a numerical value returned by CVfit
function. The input arguments
scale.fix
, scale.value
, pindex
, eps
, maxiter
, and tol
are
the same as those in the CVfit
function. The argument silent
is a
logical variable; if false, the regression parameter estimates at each
iteration are printed. The default value is TRUE
.
The function PGEE
returns an object class of PGEE
. Applying the
functions print
and summary
to an object returned by function PGEE
provides detailed information related to the model fitting and
summarizes the results as illustrated in the next section.
The function MGEE
closely follows the syntax of the function gee
in
the R package gee except that the argument subset
for data
sub-setting and the argument contrasts
for coding factor variables in
terms of dummy variables are not used in the function MGEE
.
Furthermore, while any lag order can be assumed in the argument
corstr = "AR-M"
of the function gee
, only first-order lag is allowed
in the function MGEE
(e.g., corstr = "AR-1"
). On the other hand,
there is much discrepancy between the arguments of the function MGEE
and the function geeglm
in the R package geepack since the latter
inherits its arguments mostly from the function glm
. As the result,
arguments in geepack such as weights
, subset
, etastart
,
mustart
, offset
, and waves
are not available in our function
MGEE
. Its corstr
menu consists of "AR-1"
,"exchangeable"
,
"fixed"
, "independence"
, and "unstructured"
structures, which is
less comprehensive compared to the corstr
menu of the function MGEE
.
Lastly, in addition to the sandwich variance estimator, the function
geeglm
offers jackknife variance estimator for data sets with small
number of clusters via the argument std.err
.
In this section, we demonstrate the use of the R package PGEE using a
yeast cell-cycle gene expression data set collected in the CDC15
experiment of Spellman et al. (1998). The experiment measured
genome-wide mRNA levels of
As discussed in Wang et al. (2012) and Wang et al. (2007), the cell-cycle
process is a regulated life process where the cell grows, replicates its
DNA and prepares itself for cell-division. This process is generally
divided into M/G1-G1-S-G2-M stages, where M refers to “mitosis”, G1
refers “GAP 1”, S refers to DNA “synthesis”, and G2 refers to “GAP 2”,
respectively. Spellman et al. (1998) identified a total of
As Wang et al. (2012) and Wang et al. (2007), we used a subset of
where
Correlation | M/G1 | G1 | S | G2 | M |
---|---|---|---|---|---|
Independence | 16 | 12 | 14 | 8 | 9 |
Exchangeable | 15 | 13 | 12 | 8 | 7 |
Ar(1) | 15 | 13 | 13 | 8 | 8 |
For illustration, we used the yeast data from “G1” stage. Specifically,
like the R package gee, the package PGEE requires the response and
covariate columns to be ordered by id
column and within id
column
according to time
column. In our example, the yeast data was saved
under the name of yeastG1
object. The first column was id
column,
which identifies the genes. Then, while the continuous responses were
placed under the column y
, the time
variable and the standardized
values of yeastG1
data as follows:
> # load data
> data(yeastG1)
> data = yeastG1
> # get the column names
> colnames(data)[1:9]
[1] "id" "y" "time" "ABF1" "ACE2" "ADR1" "ARG80" "ARG81" "ARO80"
> # see some portion of yeast G1 data
> head(data,5)[1:9]
id y time ABF1 ACE2 ADR1 ARG80 ARG81 ARO80
1 1 0.88 3 -0.09702788 8.3839614 -0.1435702 -0.1482691 -0.09690053 -0.1073776
2 1 0.32 4 -0.09702788 8.3839614 -0.1435702 -0.1482691 -0.09690053 -0.1073776
3 1 1.09 12 -0.09702788 8.3839614 -0.1435702 -0.1482691 -0.09690053 -0.1073776
4 1 0.73 13 -0.09702788 8.3839614 -0.1435702 -0.1482691 -0.09690053 -0.1073776
5 2 0.66 3 -0.34618104 -0.1418099 -0.1397801 -0.1476834 -0.08486203 -0.1073536
Prior to model fitting with the function PGEE
, we needed to compute
the cross-validated value of tuning paremeter over a grid via the
function CVfit
. This process requires a trial-error period, where one
can start with a wide grid interval and then narrow it down. As
described in Section 3, we determined the main input
arguments of the function CVfit
as follows:
> library(PGEE)
> # define the input arguments
> formula <- "y ~.-id"
> family <- gaussian(link = "identity")
> lambda.vec <- seq(0.01,0.2,0.01)
> # find the optimum lambda
> cv <- CVfit(formula = formula, id = id, data = data, family = family, scale.fix = TRUE,
+ scale.value = 1, fold = 4, lambda.vec = lambda.vec, pindex = c(1,2), eps = 10^-6,
+ maxiter = 30, tol = 10^-6)
> # print the results
> print(cv)
Call:
CVfit(formula = formula, id = id, data = data, family = family,
scale.fix = TRUE, scale.value = 1, fold = 4, lambda.vec = lambda.vec,
pindex = c(1, 2), eps = 10^-6, maxiter = 30, tol = 10^-6)
4-fold CV results:
lambda Cv
[1,] 0.01 720.6857
[2,] 0.02 482.1046
[3,] 0.03 382.6932
[4,] 0.04 358.1970
[5,] 0.05 344.1034
[6,] 0.06 338.4713
[7,] 0.07 335.7137
[8,] 0.08 333.5432
[9,] 0.09 330.7405
[10,] 0.10 327.8395
[11,] 0.11 326.3243
[12,] 0.12 325.3068
[13,] 0.13 324.6835
[14,] 0.14 324.5388
[15,] 0.15 324.8667
[16,] 0.16 325.7849
[17,] 0.17 327.1245
[18,] 0.18 328.4365
[19,] 0.19 329.5468
[20,] 0.20 330.6265
Optimal tuning parameter:
Best lambda
0.14
> # see the returned values by CVfit
> names(cv)
[1] "fold" "lam.vect" "cv.vect" "lam.opt" "cv.min" "call"
> # get the optimum lambda
> cv$lam.opt
[1] 0.14
After selecting the tuning parameter CVfit
,
we apply the PGEE
function with the working correlation matrix type
corstr = "independence"
as follows:
> # fit the PGEE model
> myfit1 <- PGEE(formula = formula, id = id, data = data, na.action = NULL,
+ family = family, corstr = "independence", Mv = NULL,
+ beta_int = c(rep(0,dim(data)[2]-1)), R = NULL, scale.fix = TRUE,
+ scale.value = 1, lambda = cv$lam.opt, pindex = c(1,2), eps = 10^-6,
+ maxiter = 30, tol = 10^-6, silent = TRUE)
For comparison, we also fit the same model with
corstr = "exchangeable"
and corstr = "AR-1"
(see
Table 1). Here, we use beta_int
. Alternatively, the
initial estimates could be obtained via fitting a glm
while setting
beta_int = NULL
. We specify the vector of index for unpenalized
parameters as pindex = c(1,2)
since the first two regression
coefficients myfit1
object (in a similar way summary(myfit1)
object) can be found
out by the names
function:
> # get the values returned by myfit object
> names(myfit1)
[1] "title" "version" "model"
[4] "call" "terms" "nobs"
[7] "iterations" "coefficients" "nas"
[10] "linear.predictors" "fitted.values" "residuals"
[13] "family" "y" "id"
[16] "max.id" "working.correlation" "scale"
[19] "epsilon" "lambda.value" "robust.variance"
[22] "naive.variance" "xnames" "error"
The returned objects by PGEE
function are in similar format as those
returned by gee
function in R package. For example, while we could
obtain whole model fitting results by summary(myfit1)
object, we could
also obtain the summary of the model fitting results, i.e., estimate of
regression coefficients, their corresponding naive and robust standard
errors as well as z-values through coefficients
method for
summary(myfit1)
object as follows:
> # see a portion of the results returned by coef(summary(myfit1))
> head(coef(summary(myfit1)),7)
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) 9.835775e-02 0.0603431824 1.629972904 4.334557e-02 2.2691533
time 9.774627e-03 0.0065644728 1.489019375 3.274155e-03 2.9853891
ABF1 -4.032513e-03 0.0095653833 -0.421573608 2.054339e-03 -1.9629245
ACE2 1.824265e-06 0.0002669801 0.006832963 1.634333e-06 1.1162135
ADR1 1.911308e-07 0.0001733864 0.001102340 7.611678e-07 0.2511020
ARG80 2.017436e-07 0.0001741572 0.001158399 8.684975e-07 0.2322903
ARG81 2.374483e-05 0.0007900111 0.030056320 2.296590e-05 1.0339165
Any regression estimate less than
> # see the variables which have non-zero coefficients
> index1 <- which(abs(coef(summary(myfit1))[,"Estimate"]) > 10^-3)
> names(abs(coef(summary(myfit1))[index1,"Estimate"]))
[1] "(Intercept)" "time" "ABF1" "FKH1" "FKH2" "GAT3"
[7] "GCR2" "MBP1" "MSN4" "NDD1" "PHD1" "RGM1"
[13] "RLM1" "SMP1" "SRD1" "STB1" "SWI4" "SWI6"
> # see the PGEE summary statistics of these non-zero variables
> coef(summary(myfit1))[index1,]
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) 0.098357752 0.060343182 1.6299729 0.043345573 2.269153
time 0.009774627 0.006564473 1.4890194 0.003274155 2.985389
ABF1 -0.004032513 0.009565383 -0.4215736 0.002054339 -1.962925
FKH1 -0.009152898 0.013746477 -0.6658359 0.004173178 -2.193268
FKH2 -0.091503036 0.029629441 -3.0882471 0.017178993 -5.326449
GAT3 0.009780852 0.014721289 0.6644019 0.002192983 4.460067
GCR2 -0.005837966 0.011396288 -0.5122690 0.003227041 -1.809077
MBP1 0.102623543 0.028474614 3.6040363 0.017389748 5.901382
MSN4 0.011652400 0.015301265 0.7615318 0.004533629 2.570215
NDD1 -0.068098866 0.027962789 -2.4353389 0.017078278 -3.987455
PHD1 0.018224333 0.017586392 1.0362747 0.006676215 2.729740
RGM1 0.031474714 0.022152842 1.4207980 0.006025010 5.224010
RLM1 0.004245315 0.009823147 0.4321746 0.003155203 1.345497
SMP1 0.018181353 0.017691495 1.0276889 0.007614400 2.387759
SRD1 -0.009422532 0.013882871 -0.6787164 0.005117179 -1.841353
STB1 0.038198667 0.022075228 1.7303860 0.017485954 2.184534
SWI4 0.007370389 0.012622711 0.5838990 0.004184668 1.761284
SWI6 0.033957904 0.022673644 1.4976818 0.013225660 2.567577
For comparison, we fitted the unpenalized GEEs via MGEE
function under
the same settings defined above as follows:
> # fit the GEE model
> myfit2 <- MGEE(formula = formula, id = id, data = data, na.action = NULL,
+ family = family, corstr = "independence", Mv = NULL,
+ beta_int = c(rep(0,dim(data)[2]-1)), R = NULL, scale.fix = TRUE,
+ scale.value = 1, maxiter = 30, tol = 10^-6, silent = TRUE)
Finally, we obtain the TFs which were significantly associated with the gene expression levels at G1 stage via PGEE and GEE analyses, respectively.
> # see the significantly associated TFs in PGEE analysis
> names(which(abs(coef(summary(myfit1))[index1,"Robust z"]) > 1.96))
[1] "(Intercept)" "time" "ABF1" "FKH1" "FKH2" "GAT3"
[7] "MBP1" "MSN4" "NDD1" "PHD1" "RGM1" "SMP1"
[13] "STB1" "SWI6"
> # see the significantly associated TFs in GEE analysis
> names(which(abs(coef(summary(myfit2))[,"Robust z"]) > 1.96))
[1] "(Intercept)" "time" "ABF1" "ARG81" "ASH1" "CAD1"
[7] "GAT3" "GCN4" "GCR1" "GRF10.Pho2." "MBP1" "MET31"
[13] "MET4" "MTH1" "NDD1" "PDR1" "ROX1" "STB1"
[19] "STP1" "YAP5" "ZAP1"
When the results of PGEE and GEE analysis are compared, it is observed
that while TFs such that
We repeat the steps above for each M/G1, G1, S, G2, and M stages under
three different working correlation matrices (independence,
exchangeable, and Ar(1)) and identify the number of TFs selected at each
stage where the results are presented in Table 1. The
results in Table 1 suggest that PGEE is robust to working
correlation matrix specification and the selected TFs do not change
significantly across working correlation matrix types within a stage.
Furthermore, the analysis also reveals that different TFs may play
different roles on gene expression levels in each cell-cycle process and
there may be a small overlap in the selected TFs at different stages
(e.g., only
In this paper, we present the R package PGEE which implements the PGEEs approach in Wang et al. (2012). The PGEE procedure performs simultaneous estimation and variable selection for longitudinal data analysis with high-dimensional covariates. We believe that this package is useful to practitioners in diverse fields where high-dimensional longitudinal data commonly arises such as genetics and large-scale health studies.
Lan Wang’s research is supported by NSF grant NSF DMS-1512267. Gul Inan’s visit to University of Minnesota is supported through a 2219-Fellowship by the Scientific and Technological Research Council of Turkey (TUBITAK). Authors would also like to thank the editor and the reviewer for their constructive comments which improved the quality of the paper.
gee, geepack, PGEE, MASS, mvtnorm, ncvreg, penalized, glmnet, rqPen
Distributions, Econometrics, Environmetrics, Finance, MachineLearning, MixedModels, NumericalMathematics, Psychometrics, Robust, Survival, TeachingStatistics
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Inan & Wang, "PGEE: An R Package for Analysis of Longitudinal Data with High-Dimensional Covariates", The R Journal, 2017
BibTeX citation
@article{RJ-2017-030, author = {Inan, Gul and Wang, Lan}, title = {PGEE: An R Package for Analysis of Longitudinal Data with High-Dimensional Covariates}, journal = {The R Journal}, year = {2017}, note = {https://rjournal.github.io/}, volume = {9}, issue = {1}, issn = {2073-4859}, pages = {393-402} }