CONTRIBUTED RESEARCH ARTICLES 181 Fast Pure R Implementation of GEE: Application of the Matrix Package

Generalized estimating equation solvers in R only allow for a few pre-determined options for the link and variance functions. We provide a package, geeM, which is implemented entirely in R and allows for user specified link and variance functions. The sparse matrix representations provided in the Matrix package enable a fast implementation. To gain speed, we make use of analytic inverses of the working correlation when possible and a trick to find quick numeric inverses when an analytic inverse is not available. Through three examples, we demonstrate the speed of geeM, which is not much worse than C implementations like geepack and gee on small data sets and faster on large data sets.


Introduction
An extension of generalized linear models and quasilikelihood (McCullagh and Nelder, 1986), generalized estimating equations (GEEs; Liang and Zeger, 1986) are a useful method for analyzing clustered (or longitudinal) data which may be correlated within cluster but are independent between clusters.
There are two packages on CRAN which will solve GEEs and produce standard errors in R: geepack (Højsgaard et al., 2006) and gee (Carey. et al., 2012).gee provides the basic functionality necessary for fitting GEEs, while geepack also allows for regression models for both the scale and correlation parameters, as described in Yan and Fine (2004).Both of these packages rely heavily on a computational engine implemented in C and called by the R function wrapper.They use the family object as input and hard code each of the link and variance function options within the C code.Whereas this configuration yields computational speed, it has two important drawbacks: it does not allow the user to create and specify his own link and variance functions, and it is not easy for a programmer with only modest programming skills to alter the code for methodological developments related to or extending GEE.
To see why the user may need to define link and variance functions, consider the methodology presented in Schildcrout and Rathouz (2010).The authors look at binary response data in an outcome dependent sampling situation.Because the response is binary, the authors are able to use the offset option in standard GEE software to correct the bias from the sampling scheme.However, if the response takes some other form (e.g.counts or a continuous response), then we would need to define link and variance functions to correct this bias.These link and variance functions involve integrals that need to be computed whenever the functions are evaluated at different points.
Our new package, called geeM (the M is to emphasize the use of the Matrix package; Bates and Maechler, 2013) is coded completely in R, which grants the user the flexibility to define link and variance functions as part of the model specification (McDaniel and Henderson, 2013).It also allows for easy modification of the code by method developers.The main function in the package, geem, takes a list that includes the link function, variance function, inverse link function, and the derivative of the inverse link function, as an argument.These functions must all be vectorized.
In spite of this, we are still able to solve the estimating equations quickly for many problems.To accomplish this, the package relies heavily on the recently developed Matrix package for speed.We especially exploit the sparse matrix representation objects contained therein, which allow for efficient operations on matrices in which most elements are zero.Besides introducing our package, a key aim of this article is to demonstrate in the context of a well-known statistical algorithm the power, speed and flexibility of the Matrix package.
We first give the details of the implementation, and then describe some limitations of the package.Finally, we show through examples and simulations the relative speed of the new R package.Along the way, we point out innovative uses of Matrix package functions.

GEE solutions
To set up the problem, consider Y i , a multivariate n i -vector of responses Y i = (Y i1 , . . ., Y it , . . ., Y in i ) with mean vector µ i = (µ i1 , . . ., µ it , . . ., µ in i ).Denote E(Y it ) = µ it .Let X i = (x i1 , . . ., x it , . . ., x in i ) be a n i × p design matrix of covariate vectors corresponding to each observation for the ith cluster.
The R Journal Vol.5/1, June ISSN 2073-4859 In the typical specification, the expectations are related to the mean by the monotone link function, and the variances are a function of the mean where φ is a dispersion parameter.Let A i be a diagonal matrix with kth diagonal element equal to a (µ ik ).Next we define where R i (α) is the working correlation matrix for the ith cluster, parameterized by α, which may be a vector.
Using these conventions, Liang and Zeger (1986) describe a modified Fisher scoring algorithm to estimate regression coefficients β.The generalized estimating equations are given by where Instead of using a summation, we can instead put all of this in matrix form.Let N = ∑ K i=1 n i , X be a N × p matrix with the X i matrices stacked on top of one another, ∆ and A be N × N diagonal matrices with the appropriate entries from ∆ i and A i on the diagonal, and S be a N-vector generated by stacking up the S i 's.Finally, R(α) is a N × N block diagonal matrix with the individual R i (α) matrices as blocks.With these definitions, our GEE representation is (3)

Implementation details
The geeM package is coded entirely in R.This allows for easier modification by methodologists considering variants and extensions of GEE, as well as increased flexibility for some specific analyses.

Uses of the Matrix package
Creating large matrices to solve the estimating equations leads to large sparse matrices.The Matrix package allows us to easily exploit this sparsity with its sparse matrix representations, which store only the non-zero elements of a matrix.Without the memory savings from these representations, even moderately sized problems would lead to huge memory demands.
Further, we benefit from an impressive gain in the speed of operating on these large sparse matrices.Multiplication is the largest burden computationally, and the speed gains in very large, very sparse matrices are immense.This is accomplished by a separately defined %*% method for objects in the "sparseMatrix" class.When multiplication is of the form B C or B B we can take advantage of the crossprod function.
In the modified Fisher scoring algorithm, we have many opportunities to take advantage of the "sparseMatrix" class.Two diagonal matrices are used, A and ∆, which we create using the Diagonal function.We also have the working correlation matrix, R(α), which is a sparse symmetric block diagonal matrix because observations within clusters may be correlated but observations between clusters are independent.
Additionally, the Fisher scoring algorithm requires that we compute the inverse of the working correlation matrix, R(α).Because we never need R(α) directly, we only compute R(α) −1 and build it directly for any given α.For that we use the sparseMatrix function, to which we supply three vectors of the same length: an integer vector i of row positions of non-zero entries, an integer vector j of column positions of non-zero entries, and a vector x of entries.

Analytic inverses
For three common correlation structures, independence, exchangeable, and AR-1, analytic inverses are available.In these cases we use these analytic inverses to create R(α) −1 .
The R Journal Vol.5/1, June ISSN 2073-4859 In the AR-1 case, we are able to quickly construct the inverse correlation matrix as a linear combination of three basic matrices, with coefficients depending on the value of α (Sutradhar and Kumar, 2003).The inverse can be constructed as where L, M, and N are all block diagonals, the blocks of which take the forms , and .
These basic matrices are constructed only once at the beginning of the code and saved for later use.
The exchangeable correlation matrix is a little more complicated because the entries in the inverse depend on the cluster size (Sutradhar and Kumar, 2003).The analytic inverse can be computed as I n i is the n i × n i identity matrix, and J n i is an n i × n i matrix of ones.In the implementation, we compute a vector with every entry in the correlation matrix (a vector of length ∑ K i=1 n 2 i ) and create the new matrix with the sparseMatrix function.

Numeric inverses
For all other correlation structures, we use the solve command in R. To describe the method, let us suppose that we have k different cluster sizes, n 1 , . . ., n k , in descending order.
We first construct the working correlation matrix for all clusters of size n 1 , then take the upper left corner of this matrix to construct correlation matrices of appropriate size for clusters of size n 2 , . . ., n k .We then calculate the inverses of each of these k matrices.Finally, we construct the block diagonal working correlation matrix for the whole sample using the inverses of the appropriate size.This technique exploits the fact that the inverse of a block diagonal matrix is the inverse of the blocks, allowing us to reduce the number of invocations of the solve function to k, the number of unique cluster sizes.
Using this technique, we have support for M-dependence (M should be specified by the user), unstructured correlation, fixed correlation, and user-defined correlation structures.Under M-dependence, two observations, x it and x it have correlation α |t−t | if |t − t | ≤ M and 0 otherwise.Under unstructured correlation, the correlation between x it and x it is assumed to be α tt .The fixed correlation option requires that the user provides a correlation matrix giving the correlations between all observations in the largest cluster.For a user-defined correlation structure, the function accepts a square matrix of the size of the largest cluster which defines the entries in the correlation matrix assumed to be the same.
The R Journal Vol.5/1, June ISSN 2073-4859 Note that in all correlation structures, if any variables are missing, then those rows are dropped.So, if rows have missing data, or a given subject has fewer than the maximum number of observations, then the leading principal submatrices of the appropriate dimensions are used.This may not preserve the desired correlation structure.The desired structure can be recovered by including rows with weights of zero for those time points with missing data.

Estimation of correlation parameters
For all supported correlation structures, we use simple moment estimators.These are the exact same estimators used by SAS proc genmod (SAS Institute Inc., 2010)

Simulated count data
For a slightly larger data set, we simulated count data from an overdispersed Poisson distribution.There are 10,000 clusters with 5 observations in each cluster.The formula we use is count ~time.
Here we use the log-linear link and the identity variance function.
Table 2 shows the speed results under four different working correlation structures.The average speed (Avg.) is calculated based on fitting the same data 5 times for each method and the relative speed (Rel.) is the average speed of each method divided by the average speed of the fastest method.

Birth outcomes data
Finally, we move to a large data set.These data contains the information on birth outcomes from Abrevaya (2006).We use gestational age as the continuous response, and mother's age as the only covariate.We use the Gaussian family for the link and variance functions.
The data contain 141,929 clusters of size 2 or 3 each.This results in a total of 296,218 observations.It is worth noting that for a data set this size, all other programs except R needed to be shut down.If too many other programs are open, geeM may take a much longer time or even run out of memory.
Table 3 shows the speed results under four different working correlation structures.The average speed (Avg.) is calculated based on fitting the data 5 times for each method and the relative speed (Rel.) is the average speed of each method divided by the average speed of the fastest method.

Illustrative use
Before concluding, we provide examples of how to use the geem function.Here we fit the ohio data with different link and variance functions.In this case, we use a skewed logit link function, as described in Prentice (1976), which allows us to remove the symmetry constraint imposed by the logit and probit link functions.According to Prentice, this type of skewed logit link may provide a better fit to dose response data.The inverse link function takes the form The R Journal Vol.5/1, June ISSN 2073-4859 where k is known.When k = 1 we have the logistic model, when k < 1 the distribution is negatively skewed, and when k > 1 the distribution is positively skewed.In the example, we let k = 2 and choose µ(1 − µ) as the variance function.When defining these four functions, it is essential that they all be vectorized, that is, they accept a vector argument and return a vector of the same length.

Conclusion
This paper describes a pure R implementation of a generalized estimating equation solver relying heavily on the Matrix package.Coding in R allows for the use of user-defined link and variance functions, giving a useful kind of flexibility.
The Matrix package is what enables a fast implementation.Sparse matrix representations give ways to efficiently build the large, sparse matrices needed to solve the GEEs without using too much memory.Operations on these sparse matrices are also much faster.
Much of the work in fitting GEEs comes from matrix inversion or solving a system of equations.For three correlation structures, we are able to analytically invert the largest matrix.For the rest of the correlation structures, we rely on inversion of a relatively small number of blocks to find the inverse.
Unfortunately, this technique of gaining speed in matrix inversion can cause problems.Because we invert matrices instead of solving systems, it is possible that the function may be less numerically stable.
Finally, we show that, in some cases, the geeM package is faster than gee or geepack.In other cases it is not much slower, with geeM tending to run faster on larger data sets.

Table 1 :
and are estimated solely from the Pearson residuals.Speed comparisons for ohio data set.

Table 2 :
Speed comparisons for the simulated count data set.

Table 3 :
Speed comparisons for the birth outcomes data.