Generalized linear models are the workhorse of many inferential problems. Also in the modern era with high-dimensional settings, such models have been proven to be effective exploratory tools. Most attention has been paid to Gaussian, binomial and Poisson settings, which have efficient computational implementations and where either the dispersion parameter is largely irrelevant or absent. However, general GLMs have dispersion parameters
High-dimensional inference problems are studies where the number of
predictors
(Augugliaro et al. 2013) proposed a new approach based on the differential geometrical representation of the likelihood, in particular for a generalized linear model (GLM). The method does not require an explicit penalty function and is called differential geometric LARS (dgLARS) because it generalizes the geometrical ideas on which the least angle regression (Efron et al. 2004) is based. (Pazira et al. 2018) extended the dgLARS method to high-dimensional GLMs with exponential dispersion models and arbitrary link functions. In the same paper, the authors repurposed the classic estimation of the dispersion parameter in a high-dimensional setting and also proposed a new, more efficient estimatator. (Wit et al. 2020) extended the dgLARS method to sparse inference in relative risk regression models.
From a computational point of view, the main problem of the dgLARS method is related to the standard predictor-corrector (PC) algorithm developed by (Augugliaro et al. 2013) to compute the implicitly defined solution path. The PC algorithm becomes computationally intractable when working with thousands of variables because in the prediction step, the number of arithmetic operations needed to compute the Euler predictor scales as the cube of the number of variables. This leads to a cubic increase in the run time needed for computing the solution curve.
In this paper we briefly explain an improved version of the PC algorithm, proposed in (Pazira et al. 2018) and (Pazira 2020), simply called the improved PC (IPC) algorithm. The IPC algorithm is able to calculate the solution path in fewer, but more relevant points, greatly reducing the computational burden. In addition, we use a much more efficient cyclic coordinate descend (CCD) algorithm (Augugliaro et al. 2012) to calculate a rough dgLARS solution curve for ultra high-dimensional data. In this paper, we focus on the behaviour of the IPC algorithm. The new version of the R-package dglars is implemented with both the CCD and IPC algorithms (Augugliaro et al. 2020). The user can also opt to use the old PC algorithm. The package is available on the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org/package=dglars.
The remaining of this paper is organized as follows. Firstly, we briefly review the differential geometry underlying the dgLARS method and briefly explain the dispersion parameter estimation methods. Next, the new functions implemented in the updated version of the dglars package are described and shown that they can be used to estimate the dispersion parameter. Then, various simulation studies are performed to evaluate the performance and run times of the proposed estimation algorithms. Finally, we use the functions implemented in the dglars package to illustrate its use in an example data set.
In this section we describe very briefly the dgLARS method and the dispersion parameter estimation methods. The interested reader is referred to (Augugliaro et al. 2014) and (Pazira et al. 2018). In general, the aim of the dgLARS method is to define a continuous model path that attains the highest likelihood with the fewest number of variables.
Let
When we work with
In order to study the geometrical structure of a GLM, we shall assume
that
The dgLARS estimator is based on a differential geometric
characterization of the Rao score test statistic, obtained using the
inner product between the bases of the tangent space
As shown in (Pazira et al. 2018), the Rao score test statistic can be written as
Formally, the dgLARS is a method for constructing a path of solutions,
indexed by a positive parameter
At each transition point we have a change in the active set. We shall
say that
Step | Algorithm | |
---|---|---|
1 | First compute |
|
2 | Set |
|
3 | Repeat | |
4 | Use (8) to compute |
|
5 | If method = “dgLASSO” then | |
6 | use (9) and then (10) to compute |
|
7 | set |
|
8 | Set |
|
9 | Use (7) to compute |
|
10 | Use |
|
11 | For all |
|
12 | If |
|
13 | use (11) to compute |
|
14 | first set |
|
15 | If |
|
16 | update |
|
17 | Until convergence |
Computationally, the problem of how to estimate the dgLARS solution
curve can be decomposed into two sub-problems. The first defines an
efficient computational method to compute the transition points, i.e.,
the values of the tuning parameter corresponding to a change in the
active set. In other words, at each transition points, say
(Augugliaro et al. 2013) proposed a predictor-corrector (PC) algorithm to solve the two sub-problems. Although this algorithm can compute the solution curve for moderately large problems, identifying the transition points is extremely inefficient and can led to a significant increase in computational time. This problem is highlighted in (Pazira et al. 2018) and (Pazira 2020), where an improvement to the original PC algorithm is also proposed. In order to make this paper self-contained, we briefly review this algorithm.
Let
Since the step size
From ((4)), we know that
Since the dispersion parameter
The idea of the GRCV method is to split the data
(
In the second stage, we perform model selection on each training set to
determine two small subsets of selected variables
In the third stage, the coefficient
Finally, in the fourth stage, we estimate the dispersion parameter
Figure 1 describes the four step procedure for calculating
the GRCV estimate of the dispersion parameter. Since in the second stage
of the GRCV procedure the dispersion parameter has to be estimated, an
iterative procedure can be defined to reduce its dependence on the
generalized Pearson estimator: The algorithm iterates the four steps,
such that for the
(Pazira et al. 2018) showed that the GRCV estimator
The dglars package (Augugliaro et al. 2020) is a collection of computational tools related to the inference procedure of the dgLARS method in the R programming environment.
dglars()
functionDifferent from the previous version, the new
dglars package (version
gaussian
, binomial
, poisson
, Gamma
and
inverse.gaussian
families with the most commonly used link functions.
The main function of this package, dglars()
, is a wrapper function
implemented to handle the formula interface usually used in R to create
the
dglars(formula, family = gaussian, g, unpenalized, b_wght, data, subset,
contrast = NULL, control = list())
This function is used to compute the dgLARS/dgLASSO solution curve. As
in the standard glm
function, the user can specify the family and link
functions using the argument family
; see the next section regarding an
example of Gamma GLM. This can be a character string naming a family
function or the result of a call to a family function. In the new
version of the package, the model can be specified by combining family
and link functions as described in Table 2. By default
the gaussian
family with identity
link function is used. In the
future, the package will be updated with the negative.binomial
family
with the link functions log
, identity
, and sqrt
.
Family | Available link functions | |
---|---|---|
gaussian |
“identity ",”log ", “inverse " |
|
binomial |
“logit ",”probit ", “cauchit ",”cloglog ", “log " |
|
poisson |
“log ",”identity ", “sqrt " |
|
Gamma |
“inverse ",”log ", “identity " |
|
inverse.gaussian |
“",”inverse ", “log ",”identity " |
The argument control
is a named list of control parameters with the
following elements:
control = list(algorithm = "pc", method = "dgLASSO", g0 = NULL, nNR= 200,
nv = NULL, eps = 1.0e-05, np = NULL, dg_max = 0, cf = 0.5,
NReps = 1.0e-06, ncrct = 50, nccd = 1.0e+05)
By using the control parameter algorithm
it is possible to select the
algorithm used to fit the dgLARS solution curve. Setting
algorithm = "pc"
selects the default IPC algorithm; the CCD algorithm
is used when algorithm = "ccd"
is selected. To reduce the
computational time needed to compute the dgLARS/dgLASSO solution curve,
the algorithms have been written in Fortran 90. The argument method
is
used to choose between the dgLASSO solution curve (method = "dgLASSO"
)
and the dgLARS solution curve (method = "dgLARS"
).
The g0
control parameter is used to define the smallest value of the
tuning parameter. By default this parameter is set to 1.0e-06
when
When Gaussian, Gamma or inverse Gaussian family is used, dglars()
returns the vector of estimates for the dispersion parameter; by
default, the generalized Pearson statistic is used as estimator but the
user can use the function phihat()
to specify other estimators. For
the binomial and Poisson family, the dispersion parameter is assumed
known and equal to one.
grcv()
and phihat()
Since the Gaussian, Gamma and inverse Gaussian error distributions have
an additional dispersion parameter, this package implements the
functions grcv()
and phihat()
to estimate the dispersion parameter
grcv(object, type = c("BIC", "AIC"), nit = 10, control = list(), trace = FALSE, ...)
where object
is a fitted dglars
object, type
is the measure of
goodness-of-fit used in Step 2 of the algorithm reported in
Figure 1. With the current version, the user can choose
between the Bayesian (default) and the Akaike information criteria. The
argument nit
is used to specify the number of iterations of the GRCV
procedure. The resulting estimate is obtained as the median of the nit
iterations. control
is a list of control parameters passed to the
function dglars
, whereas trace
is a logical variable specifying
whether or not information is printed as the GRCV algorithm proceeds.
Finally, the argument ...
is used to pass the arguments to the method
functions AIC.dglars
and BIC.dglars
.
As grcv()
is only used to estimate the dispersion parameter using the
GRCV estimator, the function phihat()
is specifically developed to
handle the all the estimators of the dispersion parameter available in
the dglars
package. This function is defined as follows:
phihat(object, type = c("pearson", "deviance", "mle", "grcv"), g = NULL, ...)
where object
is a fitted dglars
object and type
is string
specifying the estimator of the dispersion parameter. The user can
select the Pearson estimator (default), the deviance estimator, the MLE
estimator or the GRCV estimator. The optional argument g
is a vector
specifying the values of the tuning parameter object
; for an example
see next section. Finally, the argument ...
is used to pass the
argument to the function grcv
. The function phihat()
returns a
vector with the estimates of the dispersion parameter; when
type = "grcv"
all elements of this vector are the same, because the
GRCV estimator does not depend on the tuning parameter
The function phihat()
is called by the method functions
logLik.dglars()
, AIC.dglars()
and coef.dglars()
:
logLik(object, phi = c("pearson", "deviance", "mle", "grcv"), g = NULL, ...)
AIC(object, phi = c("pearson", "deviance", "mle", "grcv"), k = 2,
complexity = c("df", "gdf"), g = NULL, ...)
coef(object, type = c("pearson", "deviance", "mle", "grcv"), g = NULL, ...)
when the argument phi
(or type
in coef()
) is set to any of the
four estimation methods, i.e., “pearson
”, “deviance
”, “mle
” or
“grcv
”. In the dglars
package, the summary()
method:
summary(object, type = c("AIC", "BIC"), digits = max(3, getOption("digits") - 3), ...)
uses the generalized Pearson estimator to define the BIC or AIC values,
but the user can use “...
” to pass to the method AIC()
the
additional arguments needed to compute a more general measure of
goodness-of-fit, e.g., “phi
”, “k
” or “complexity
”.
To gain more insight about the new features of the
dglars, we simulated a
data set from a Gamma regression model with the log
link function
where the sample size is
First we install and load the dglars package in the R session by the codes
R> install.packages("dglars")
R> library("dglars")
The corresponding R code is given by:
R> set.seed(11235)
R> n <- 50
R> p <- 100
R> s <- 2
R> X <- matrix(runif(n = n * p), n, p)
R> bs <- rep(2, s)
R> Xs <- X[, 1:s]
R> eta <- drop(0.5 + Xs %*% bs)
R> mu <- Gamma("log")$linkinv(eta)
R> shape <- 1
R> phi <- 1 / shape
R> y <- rgamma(n, shape = shape, scale = mu * phi)
R> fit <- dglars(y ~ X, family = Gamma("log"),
+ control = list(algorithm = "pc", method = "dgLARS",
+ g0 = 0.5))
We use the argument g0=0.5
in the function dglars
to avoid
convergence problems coming from the high-dimensionality of the data.
The fit
object is a S3 class ‘dglars
’, for which the method function
summary.dglars()
can be used to obtain more information about the
estimated sequence of models. The following R code shows the output
printed by the summary.dglars()
method with BIC criterion and the GRCV
estimate for the dispersion parameter.
R> set.seed(11235)
R> summary(fit, type = "BIC", phi = "grcv", control = list(g0 = 0.5))
Call: dglars(formula = y ~ X, family = Gamma("log"), control = list(algorithm = "pc",
method = "dgLARS", g0 = 0.5))
Sequence g %Dev df BIC Rank
2.5003 0.00000 2 381.6 22
+ X1
1.9828 0.08380 3 378.4 20
1.9827 0.08381 3 378.4 19
+ X2
1.5384 0.20214 4 372.3 8
1.5314 0.20372 4 372.1 7
+ X12
1.3876 0.26004 5 371.3 2
1.3861 0.26060 5 371.2 1 <-
+ X74
1.2834 0.29734 6 372.0 6
1.2833 0.29738 6 372.0 5
+ X31
1.1688 0.33733 7 372.5 9
+ X100
1.1065 0.36541 8 374.0 10
+ X24
0.9437 0.44169 9 371.5 4
0.9413 0.44271 9 371.4 3
+ X71
0.9208 0.45310 10 374.4 11
+ X9
0.8460 0.49003 11 375.2 13
0.8436 0.49117 11 375.1 12
+ X16
0.7450 0.53586 12 375.2 15
0.7447 0.53597 12 375.2 14
+ X64
0.7252 0.54783 13 378.1 18
0.7250 0.54793 13 378.1 17
+ X18
0.5902 0.62506 14 375.4 16
+ X6
0.5821 0.62907 15 379.0 21
+ X36
0.5659 0.63773 16 382.2 23
+ X37
0.5279 0.65923 17 384.3 25
0.5278 0.65929 17 384.3 24
+ X93
0.5000 0.67501 18 386.8 26
Details:
BIC values computed using k = 3.912 and complexity = 'df'
dispersion parameter estimated by 'grcv'
===============================================================
Summary of the Selected Model
Formula: y ~ X1 + X2 + X12
Family: 'Gamma'
Link: 'log'
Coefficients:
Estimate
Int. 1.7494
X1 0.9320
X2 0.5119
X12 0.1749
Dispersion parameter: 1.044 (estimated by 'grcv' method)
---
g: 1.386
Null deviance: 88.74
Residual deviance: 65.62
BIC: 371.21
Algorithm 'pc' ( method = 'dgLARS' )
From this output, we can see that the dgLARS method first finds the true
predictors (X1
and X2
) and then includes the other false predictors.
The ranking of the estimated models obtained by the number of estimated
non-zero coefficients as a measure of goodness of fit
(complexity = "df"
) is also shown. The corresponding best model is
identified by an arrow on the right. The formula of the identified best
model, the corresponding estimated coefficients and the estimate of the
dispersion parameter are shown in the second section of the output.
These values are obtained at the optimal value of the tuning parameter
X12
.
Since the deviance, the MLE and the generalized Pearson estimators of
the dispersion parameter depend on the tuning parameter phihat()
function.
For example, with the following R code we can see the sequence of values
of the tuning parameter with the estimated values of the dispersion
parameter by means of the generalized Pearson, deviance, MLE and GRCV
methods. For the GRCV method we apply the BIC criterion and nit=10
iterations inside the algorithm.
R> set.seed(11235)
R> g <- fit$g
R> phi.grcv <- phihat(fit, type = "grcv", control = list(g0 = 0.5))
R> phi.pear <- phihat(fit, type = "pearson")
R> phi.dev <- phihat(fit, type = "deviance")
R> phi.mle <- phihat(fit, type = "mle")
R> path <- cbind(g, phi.pear, phi.dev, phi.mle, phi.grcv)
R> print(path, digits = 4)
g phi.pear phi.dev phi.mle phi.grcv
[1,] 2.5003 2.2017 1.8111 1.4327 1.044
[2,] 1.9828 1.9604 1.6939 1.3309 1.044
[3,] 1.9827 1.9603 1.6938 1.3309 1.044
[4,] 1.5384 1.6245 1.5065 1.1829 1.044
[5,] 1.5314 1.6197 1.5035 1.1809 1.044
[6,] 1.3876 1.4518 1.4275 1.1085 1.044
[7,] 1.3861 1.4499 1.4264 1.1078 1.044
[8,] 1.2834 1.3472 1.3857 1.0599 1.044
[9,] 1.2833 1.3470 1.3856 1.0598 1.044
[10,] 1.1688 1.2357 1.3365 1.0071 1.044
[11,] 1.1065 1.1848 1.3096 0.9696 1.044
[12,] 0.9437 1.0242 1.1797 0.8659 1.044
[13,] 0.9413 1.0218 1.1775 0.8645 1.044
[14,] 0.9208 1.0189 1.1837 0.8502 1.044
[15,] 0.8460 0.9425 1.1314 0.7988 1.044
[16,] 0.8436 0.9394 1.1289 0.7972 1.044
[17,] 0.7450 0.8419 1.0561 0.7340 1.044
[18,] 0.7447 0.8416 1.0559 0.7338 1.044
[19,] 0.7252 0.8336 1.0560 0.7169 1.044
[20,] 0.7250 0.8334 1.0557 0.7167 1.044
[21,] 0.5902 0.6622 0.8993 0.6045 1.044
[22,] 0.5821 0.6704 0.9144 0.5986 1.044
[23,] 0.5659 0.6676 0.9185 0.5858 1.044
[24,] 0.5279 0.6339 0.8894 0.5537 1.044
[25,] 0.5278 0.6338 0.8893 0.5536 1.044
[26,] 0.5000 0.6160 0.8740 0.5300 1.044
By the following R code, we can specify the values of the tuning
parameter
R> set.seed(11235)
R> new_g <- seq(range(fit$g)[2], range(fit$g)[1], by = -0.5)
R> phi.grcv <- phihat(fit, g = new_g, type = "grcv",
+ control = list(g0 = 0.5))
R> phi.pear <- phihat(fit, g = new_g, type = "pearson")
R> phi.dev <- phihat(fit, g = new_g, type = "deviance")
R> phi.mle <- phihat(fit, g = new_g, type = "mle")
R> path <- cbind(new_g, phi.pear, phi.dev, phi.mle, phi.grcv)
R> print(path, digits = 4)
new_g phi.pear phi.dev phi.mle phi.grcv
[1,] 2.5003 2.2017 1.8111 1.4327 1.044
[2,] 2.0003 1.9677 1.6985 1.3340 1.044
[3,] 1.5003 1.6072 1.5117 1.1647 1.044
[4,] 1.0003 1.0817 1.2328 0.9004 1.044
[5,] 0.5003 0.6163 0.8743 0.5302 1.044
Finally, we show the output of function summary.dglars()
with the
generalized Pearson estimator for a comparison with the results yielded
by the GRCV method.
R> summary(fit, type = "BIC", phi = "pearson")
Call: dglars(formula = y ~ X, family = Gamma("log"), control = list(algorithm = "pc",
method = "dgLARS", g0 = 0.5))
Sequence g %Dev df BIC Rank
2.5003 0.00000 2 382.5 26
+ X1
1.9828 0.08380 3 380.1 25
1.9827 0.08381 3 380.1 24
+ X2
1.5384 0.20214 4 374.4 17
1.5314 0.20372 4 374.3 16
+ X12
1.3876 0.26004 5 373.1 8
1.3861 0.26060 5 373.1 7
+ X74
1.2834 0.29734 6 373.6 10
1.2833 0.29738 6 373.6 9
+ X31
1.1688 0.33733 7 373.7 11
+ X100
1.1065 0.36541 8 375.0 22
+ X24
0.9437 0.44169 9 371.3 3
0.9413 0.44271 9 371.2 2
+ X71
0.9208 0.45310 10 374.1 14
+ X9
0.8460 0.49003 11 373.9 13
0.8436 0.49117 11 373.8 12
+ X16
0.7450 0.53586 12 372.3 6
0.7447 0.53597 12 372.3 5
+ X64
0.7252 0.54783 13 374.9 21
0.7250 0.54793 13 374.9 20
+ X18
0.5902 0.62506 14 368.1 1 <-
+ X6
0.5821 0.62907 15 371.5 4
+ X36
0.5659 0.63773 16 374.2 15
+ X37
0.5279 0.65923 17 374.8 19
0.5278 0.65929 17 374.8 18
+ X93
0.5000 0.67501 18 376.3 23
Details:
BIC values computed using k = 3.912 and complexity = 'df'
dispersion parameter estimated by 'pearson'
===============================================================
Summary of the Selected Model
Formula: y ~ X1 + X2 + X9 + X12 + X16 + X18 + X24 + X31 + X64 + X71 +
X74 + X100
Family: 'Gamma'
Link: 'log'
Coefficients:
Estimate
Int. 0.6492
X1 1.6660
X2 1.2259
X9 -0.1183
X12 0.5763
X16 -0.0987
X18 -0.1471
X24 0.6490
X31 0.5249
X64 -0.2859
X71 -0.2110
X74 0.0810
X100 -0.6195
Dispersion parameter: 0.6622 (estimated by 'pearson' method)
---
g: 0.5902
Null deviance: 88.74
Residual deviance: 33.27
BIC: 368.05
Algorithm 'pc' ( method = 'dgLARS' )
These outputs show that by using different dispersion estimators one can
obtain different final models. By using the GRCV estimator, the dgLARS
method selects a really small model containing the true predictors, that
is y
X1 + X2 + X12
, while using the generalized Pearson
estimator our final model contains 12 predictors. We note, however, that
the final model selected by the dgLARS method is very sensitive to the
(slightly random) value of the GRCV estimator. Although the GRCV tends
to work better than the generalized Pearson estimator, no strong
conclusions should be attached to this particular example.
In this section we illustrate the difference in performance between the
original PC and the new IPC algorithms; for an extensive simulation
study see next section. As we mentioned before, the new version of the
dglars package only
implements the IPC and CCD algorithms to compute the dgLARS solution
curve. Therefore, we use the PC algorithm in version
We consider the following R code to simulate a Poisson regression model
with the canonical link function (link = ""
), sample size equal to
R> set.seed(11235)
R> n <- 100
R> p <- 5
R> X <- matrix(abs(rnorm(n * p)), n, p)
R> b <- 1:2
R> eta <- drop(b[1] + (X[, 1] * b[2]))
R> mu <- poisson()$linkinv(eta)
R> y <- rpois(n, mu)
Only the first predictor is set to affect the response variable
R> fit_ipc <- dglars(y ~ X, family = poisson,
+ control = list(algorithm = "pc"))
By running the following commands we remove the last version and then
install the version install.packages()
can do it for us, such that if the package is
already installed, this function replaces it with the specified package
from source:
R> detach(name = "package:dglars", unload = TRUE)
R> remove.packages(pkgs = "dglars")
R> Old_dglars <- "https://cran.r-project.org/src/contrib/Archive/dglars/
+ dglars_1.0.5.tar.gz"
R> install.packages(Old_dglars, repos = NULL, type = "source")
R> library("dglars")
R> fit_pc <- dglars(y ~ ., family = "poisson",
+ control = list(algorithm = "pc"))
By printing the ‘dglars
’ object for our simulated data set, we can see
that the number of the points composing the dgLASSO solution curve
achieved by the PC algorithm is
R> fit_pc
Call: dglars(formula = y ~ X, family = "poisson", control = list(algorithm = "pc"))
Sequence g Dev %Dev df
68.2417 9403.51 0.0000 1
+X1
10.1351 623.36 0.9337 2
3.7587 186.10 0.9802 2
2.6310 143.85 0.9847 2
2.5719 141.99 0.9849 2
2.5718 141.99 0.9849 2
+X4
1.9682 124.04 0.9868 3
1.6730 116.91 0.9876 3
1.5270 113.79 0.9879 3
1.4544 112.34 0.9881 3
1.4182 111.64 0.9881 3
1.4001 111.30 0.9882 3
1.3820 110.96 0.9882 3
+X3
1.1309 104.95 0.9888 4
1.0056 102.37 0.9891 4
0.9430 101.20 0.9892 4
0.9117 100.63 0.9893 4
0.8804 100.09 0.9894 4
+X2
0.5796 93.69 0.9900 5
0.4302 91.44 0.9903 5
0.3557 90.57 0.9904 5
0.3186 90.19 0.9904 5
0.3000 90.02 0.9904 5
0.2814 89.85 0.9904 5
+X5
0.0001 88.01 0.9906 6
Algorithm pc ( method = dgLASSO ) with exit = 0
The number of the iterations computing the solution points by the PC algorithm and the values of the tuning parameter can be obtained by the following code:
R> fit_pc$np
[1] 25
R> fit_pc$g
[1] 68.2417321 10.1350645 3.7587453 2.6310292 2.5719482 2.5717722
[7] 1.9681589 1.6729772 1.5269781 1.4543691 1.4181613 1.4000815
[13] 1.3820137 1.1308691 1.0055607 0.9429753 0.9117001 0.8804338
[19] 0.5796210 0.4302023 0.3557410 0.3185725 0.3000037 0.2814428
[25] 0.0001000
By printing , we can see that the IPC algorithm reduces the number of the iterations for obtaining the solution curve at the change points, leading to significant computational savings.
R> fit_ipc
Call: dglars(formula = y ~ X, family = poisson, control = list(algorithm = "pc"))
Sequence g Dev %Dev n. non zero
68.241732 9403.51 0.0000 1
+ X1
10.135064 623.36 0.9337 2
3.758745 186.10 0.9802 2
2.631029 143.85 0.9847 2
2.571948 141.99 0.9849 2
2.571772 141.99 0.9849 2
+ X4
1.382273 110.97 0.9882 3
1.382018 110.96 0.9882 3
+ X3
0.880438 100.09 0.9894 4
+ X2
0.281457 89.85 0.9904 5
0.281445 89.85 0.9904 5
+ X5
0.000001 88.01 0.9906 6
Algorithm 'pc' ( method = 'dgLASSO' ) with exit = 0
Fewer than half the number of the iterations are needed by the IPC algorithm compared to the PC algorithm, speeding up the algorithm by a factor of 2.
R> fit_ipc$np
[1] 12
R> fit_ipc$g
[1] 6.824173e+01 1.013506e+01 3.758745e+00 2.631029e+00 2.571948e+00
[6] 2.571772e+00 1.382273e+00 1.382018e+00 8.804378e-01 2.814572e-01
[11] 2.814454e-01 9.999996e-07
From a computational point of view, the main consequence of using the technique used in the IPC algorithm is a decrease in the run times by adjusting the step size and finding the true transition points. The next section investigates the overall performance of the IPC algorithm by a simulation study.
In this section we present a simulation study to investigate the performance of the improved PC algorithm implemented in the dglars package. Although the PC and IPC algorithms compute the same active set, they have different number of arithmetic operations for getting there. The main problem of the PC algorithm is related to the number of the number of arithmetic operations needed to compute the solution curve.
IPC | PC | IPC | PC | ||||||
0.0 | 100 | 0.018 | 52.011 | 0.022 | 79.956 | 0.250 | 103.74 | 0.406 | 190.74 |
(0.003) | (5.686) | (0.005) | (12.68) | (0.031) | (5.439) | (0.077) | (16.03) | ||
1000 | 0.193 | 67.622 | 0.278 | 99.267 | 4.333 | 165.94 | 5.489 | 233.06 | |
(0.023) | (6.225) | (0.056) | (15.51) | (0.458) | (9.120) | (0.935) | (20.51) | ||
3000 | 0.775 | 72.511 | 0.854 | 111.00 | 15.068 | 183.47 | 19.933 | 256.84 | |
(0.080) | (6.469) | (0.165) | (16.32) | (1.351) | (8.601) | (2.422) | (18.17) | ||
5000 | 1.134 | 68.933 | 1.205 | 97.100 | 24.219 | 182.83 | 31.844 | 249.78 | |
(0.132) | (6.682) | (0.232) | (16.58) | (2.934) | (11.08) | (5.140) | (22.66) | ||
7000 | 1.553 | 74.378 | 1.962 | 109.52 | 37.149 | 190.58 | 49.291 | 262.67 | |
(0.190) | (6.604) | (0.444) | (19.62) | (3.198) | (8.760) | (6.439) | (19.96) | ||
0.5 | 100 | 0.016 | 49.167 | 0.022 | 80.144 | 0.174 | 91.178 | 0.274 | 162.03 |
(0.003) | (5.613) | (0.004) | (12.57) | (0.022) | (5.170) | (0.049) | (14.02) | ||
1000 | 0.150 | 59.311 | 0.196 | 81.611 | 3.129 | 143.50 | 4.116 | 207.72 | |
(0.021) | (6.272) | (0.048) | (13.94) | (0.467) | (11.39) | (0.870) | (25.10) | ||
3000 | 0.642 | 65.111 | 0.684 | 89.100 | 10.365 | 154.49 | 13.933 | 216.19 | |
(0.067) | (7.083) | (0.141) | (16.59) | (1.255) | (9.871) | (2.611) | (23.57) | ||
5000 | 1.095 | 69.122 | 1.212 | 98.822 | 18.663 | 163.66 | 25.388 | 225.16 | |
(0.126) | (6.505) | (0.235) | (15.34) | (2.355) | (10.87) | (4.095) | (20.52) | ||
7000 | 1.420 | 70.844 | 1.763 | 102.00 | 24.742 | 159.40 | 33.101 | 217.87 | |
(0.180) | (6.283) | (0.321) | (14.657) | (2.827) | (9.858) | (5.181) | (21.07) |
Our simulation study is based on a logistic regression model with sample
size equal to R
code to replicate our study is reported in the attached file.
Table 3 reports the average CPU times in seconds and the
mean number of points of the solution curve (
In this section we analyze an example dataset by using the functions
available in the dglars
package. We consider the benchmark diabetes
data (Efron et al. 2004) to study
the sparse structure of an inverse Gaussian regression model. This
dataset was also used in (Ishwaran et al. 2010) and is available in the R package
lars:
R> install.packages(pkgs = "lars")
R> data("diabetes", package = "lars")
The response x
, such as age, sex, bmi (body mass index), map (mean arterial
blood pressure) and six blood serum measurements: ldl (low-density
lipoprotein), hdl (high-density lipoprotein), ltg (lamotrigine),
glu (glucose), tc (triglyceride) and tch (total cholesterol). In
addition to x2
consists of
a total of 64 columns. So, the complete data consists of diabetes
progression observations on
From previous analyses, it was clear that the disease progression is not
appropriately modelled by a Gaussian. After some goodness-of-fit
considerations, we settle on an inverse Gaussian regression model, which
requires us to estimate the dispersion parameter. First, we estimate the
optimal value of the tuning parameter cvdglars()
function, i.e.,
R> library("dglars")
R> set.seed(11235)
R> cv_diabetes <- cvdglars(y ~ x, family = inverse.gaussian("log"),
+ data = diabetes)
R> cv_diabetes
Call: cvdglars(formula = y ~ x, family = inverse.gaussian("log"), data = diabetes)
Coefficients:
Estimate
Int. 4.9539
xsex -2.0273
xbmi 2.8447
xmap 2.1969
xtc -0.3811
xhdl -2.4124
xltg 3.8501
Dispersion parameter: 0.001141
Details:
number of non zero estimates: 8
cross-validation deviance: 0.06224
g: 0.01533
n. fold: 10
Algorithm 'pc' ( method = 'dgLASSO' )
This output shows that the dgLARS method by the help of the CV criterion selects an inverse Gaussian regression model with six covariates (sex, bmi, map, tc, hdl and ltg);
R> cv_diabetes$formula_cv
y ~ sex + bmi + map + tc + hdl + ltg
Moreover, the optimal tuning parameter is
R> diabetes_dglars <- dglars(y ~ x, family = inverse.gaussian("log"),
+ data = diabetes)
R> set.seed(11235)
R> summary(diabetes_dglars, type = "BIC", phi = "grcv")
Call: dglars(formula = y ~ x, family = inverse.gaussian("log"), data = diabetes)
Sequence g %Dev df BIC Rank
0.505974 0.00000 2 5223 18
+ xbmi
0.481473 0.02290 3 5207 17
0.481262 0.02309 3 5207 16
+ xltg
0.250152 0.26744 4 4986 15
0.233248 0.27846 4 4975 14
0.233174 0.27851 4 4975 13
+ xmap
0.222313 0.28613 5 4974 12
+ xhdl
0.100212 0.36560 6 4906 11
0.099904 0.36572 6 4906 10
+ xsex
0.030320 0.41322 7 4868 2
0.030263 0.41324 7 4868 1 <-
+ xtc
0.014883 0.41892 8 4869 3
+ xglu
0.005757 0.42063 9 4873 4
+ xtch
0.002389 0.42122 10 4879 6
0.002384 0.42122 10 4879 5
+ xldl
0.001704 0.42199 11 4884 8
0.001691 0.42200 11 4884 7
+ xage
0.000001 0.42272 12 4890 9
Details:
BIC values computed using k = 6.091 and complexity = 'df'
dispersion parameter estimated by 'grcv'
===============================================================
Summary of the Selected Model
Formula: y ~ xsex + xbmi + xmap + xhdl + xltg
Family: 'inverse.gaussian'
Link: 'log'
Coefficients:
Estimate
Int. 4.9495
xsex -1.6834
xbmi 2.7786
xmap 1.9536
xhdl -2.2917
xltg 3.5420
Dispersion parameter: 0.001112 (estimated by 'grcv' method)
---
g: 0.03026
Null deviance: 1.0361
Residual deviance: 0.6079
BIC: 4868.0435
Algorithm 'pc' ( method = 'dgLASSO' )
The fitted model now does not include tc, but does include the other
five predictors (sex, bmi, map, hdl and ltg). In fact, the
optimal value of the tuning parameter dglrs
’ object, by only using the design matrix x
and the response variable y
using the following code:
set.seed(11235)
grcv(diabetes_dglars, type = "BIC")
[1] 0.001111604
Since the original PC algorithm is only available for the version
R> system.time(diabetes_dglars_ipc <- dglars(y ~ x,
+ family = inverse.gaussian("log"), data = diabetes))
user system elapsed
0.016 0.000 0.016
and the number of iterations computing the solution points by the IPC algorithm is
R> diabetes_dglars_ipc$np
[1] 18
In this paper, we have described improvements to the R package dglars for estimating a larger class of generalized linear models with arbitrary link functions and a general class of exponential dispersion models. We briefly reviewed the differential geometrical theory underlying the dgLARS method and briefly explained the dispersion parameter estimation methods. We described some functions implemented in the new version of the dglars package that can be used to estimate the dispersion parameter. We also used these functions to compare run times between the new IPC and old PC algorithms. The simulations showed that the IPC algorithm is significantly faster than the original PC algorithm. The new version of dglars is available on CRAN.
This article is based upon work from COST Action European Cooperation for Statistics of Network Data Science (COSTNET, CA15109), supported by COST (European Cooperation in Science and Technology).
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Pazira, et al., "A Software Tool For Sparse Estimation Of A General Class Of High-dimensional GLMs", The R Journal, 2022
BibTeX citation
@article{RJ-2022-008, author = {Pazira, Hassan and Augugliaro, Luigi and Wit, Ernst C.}, title = {A Software Tool For Sparse Estimation Of A General Class Of High-dimensional GLMs}, journal = {The R Journal}, year = {2022}, note = {https://rjournal.github.io/}, volume = {14}, issue = {1}, issn = {2073-4859}, pages = {34-53} }