Various R packages have been developed for the non-convex penalized estimation but they can only be applied to the smoothly clipped absolute deviation (SCAD) or minimax concave penalty (MCP). We develop an R package, entitled ncpen, for the non-convex penalized estimation in order to make data analysts to experience other non-convex penalties. The package ncpen implements a unified algorithm based on the convex concave procedure and modified local quadratic approximation algorithm, which can be applied to a broader range of non-convex penalties, including the SCAD and MCP as special examples. Many user-friendly functionalities such as generalized information criteria, cross-validation and ridge regularization are provided also.
The penalized estimation has been one of the most important statistical
techniques for high dimensional data analysis, and many penalties have
been developed such as the least absolute shrinkage and selection
operator (LASSO) (Tibshirani 1996), smoothly clipped absolute
deviation (SCAD) (Fan and Li 2001), and minimax concave penalty (MCP)
(Zhang 2010). In the context of R, many authors released fast and
stable R packages for obtaining the whole solution path of the penalized
estimator for the generalized linear model (GLM). For example,
lars (Efron et al. 2004),
glmpath (Park and Hastie 2007)
and glmnet
(Friedman et al. 2007) implement the LASSO. Packages such as
plus (Zhang 2010),
sparsenet
(Mazumder et al. 2011),
cvplogit
(Jiang and Huang 2014) and
ncvreg
(Breheny and Huang 2011) implement the SCAD and MCP. Among them,
glmnet and ncvreg are very fast, stable, and well-organized,
presenting various user-friendly functionalities such as the
cross-validation and
The non-convex penalized estimation has been studied by many researchers (Fan and Li 2001; Huang et al. 2008; Kim et al. 2008; Zou and Li 2008; Friedman 2012; Kwon and Kim 2012; Zhang and Zhang 2012). However, there is still a lack in research on the algorithms that exactly implement the non-convex penalized estimators for the non-convexity of the objective function. One nice approach is using the coordinate descent (CD) algorithm (Tseng 2001; Breheny and Huang 2011). The CD algorithm fits quite well for some quadratic non-convex penalties such as the SCAD and MC (Breheny and Huang 2011; Mazumder et al. 2011; Jiang and Huang 2014) since each coordinate update in the CD algorithm becomes an easy convex optimization problem with a closed form solution. This is the main reason for the preference of the CD algorithm implemented in many R packages such as sparsenet and ncvreg. However, coordinate updates in the CD algorithm require extra univariate optimizations for other non-convex penalties such as the log and bridge penalties (Huang et al. 2008; Zou and Li 2008; Friedman 2012), which severely lowers the convergence speed. Another subtle point is that the CD algorithm requires standardization of the input variables and need to enlarge the concave scale parameter in the penalty (Breheny and Huang 2011) to obtain the local convergence, which may cause to lose an advantage of non-convex penalized estimation (Kim and Kwon 2012) and give much different variable selection performance (Lee 2015).
In this paper, we develop an R package
ncpen for the non-convex
penalized estimation based on the convex-concave procedure (CCCP) or
difference-convex (DC) algorithm (Kim et al. 2008; Shen et al. 2012)
and the modified local quadratic approximation algorithm (MLQA)
(Lee et al. 2016). The main contribution of the package ncpen is that
it encompasses most of existing non-convex penalties, including the
truncated
The rest of the paper is organized as follows. In the next section, we describe the algorithm implemented in ncpen with major steps and details. Afterwards, the main functions and various options in ncpen are presented with numerical illustrations. Finally, the paper concludes with remarks.
We consider the problem of minimizing
We consider a class of non-convex penalties that satisfy
The moderately clipped LASSO and sparse ridge are simple smooth
interpolations between the MCP (near the origin) and the LASSO and
ridge, respectively. The log and bridge penalties are modified to be
linear over
The CCCP-MLQA algorithm iteratively conducts two main steps: CCCP
(Yuille and Rangarajan 2003) and MLQA (Lee et al. 2016) steps. The
CCCP step decomposes the penalty
The objective function
Algorithm 1: minimizing
Algorithm 1 includes minimizing
Algorithm 2: minimizing
Usually, the computation time of the algorithm rapidly increases as the
number of non-zero parameters increases or
Assume that
Algorithm 3: minimizing
Remark 1 The number of variables that violates the KKT conditions could be large
for some high-dimensional cases. In this case, it may be inefficient to
add only one variable
In practice, we want to approximate the whole solution path or surface
of the minimizer
In this section, we introduce the main functions with various options and user-friendly functions implemented in the package ncpen for the users. Next section will illustrate how the various options in the main function make a difference in data analysis through numerical examples.
The R package ncpen contains the main functions: ncpen()
for fitting
various nonconvex penalized regression models, cv.ncpen()
and
gic.ncpen()
for selecting the optimal model from a sequence of the
regularization path based on cross-validation and a generalized
information
criterion(Wang et al. 2007, 2009; Fan and Tang 2013; Wang et al. 2013).
In addition, the useful functions are also implemented in the package:
sam.gen.ncpen()
for generating a synthetic data from various models
with the correlation structure in (11), plot()
for
graphical representation, coef()
for extracting coefficients from the
fitted object, predict()
for making predictions from new design
matrix. The followings are the basic usage of the main functions:
## linear regression with scad penalty
n=100; p=10
sam = sam.gen.ncpen(n=n,p=p,q=5,cf.min=0.5,cf.max=1,corr=0.5,family="gaussian")
x.mat = sam$x.mat; y.vec = sam$y.vec
fit = ncpen(y.vec=y.vec,x.mat=x.mat,family="gaussian",penalty="scad")
coef(fit); plot(fit)
# prediction from a new dataset
test.n=20
newx = matrix(rnorm(test.n*p),nrow=test.n,ncol=p)
predict(fit,new.x.mat=newx)
# selection of the optimal model based on the cross-validation
cv.fit = cv.ncpen(y.vec=y.vec,x.mat=x.mat,family="gaussian",penalty="scad")
coef(cv.fit)
# selection of the optimal model using the generalized information criterion
fit = ncpen(y.vec=y.vec,x.mat=x.mat,family="gaussian",penalty="scad")
gic.ncpen(fit)
The main function ncpen()
provides various options which produce
different penalized estimators. The other packages for nonconvex
penalized regressions also have similar options: ncpen()
.
The option alpha
in the main functions forces the algorithm to solve
the following penalized problem with the alpha
, which is
the mixing parameter between the penalties alpha
, we can treat the problem
with highly correlated variables, and it makes the algorithm more stable
also since the minimum eigenvalue of the Hessian marix becomes large up
to factor
We can give different weights for each observation and penalty by the
options obs.weight
and pen.weight
, which provides the minimizer of
It is common practice to standardize variables prior to fitting the
penalized models, but one may opt not to. Hence, we provide the option
x.standardize
for flexible analysis. The option x.standardize=TRUE
means that the algorithm solves the original penalized problem in
(1), with the standardized (scaled) variables, and then the
resulting solution
We introduced the warm start strategy for speed up the algorithm, but
the solution path, in fact, depends on the initial solution of the CCCP
algorithm because of the non-convexity. The option local=TRUE
in
ncpen provides the same initial value specified by the option
local.initial
into each CCCP iterations for whole local=TRUE
makes the algorithm slower but the
performance of the resulting estimator would be often improved as
provided a good initial such as the maximum likelihood estimator or
LASSO.
We consider the linear and logistic regression models to calculate the
total elapsed time for constructing the solution path over 100
Model | ncvreg | SCAD | MCP | TLP | CLASSO | SR | MBR | MLOG | |
---|---|---|---|---|---|---|---|---|---|
Linear | 200 | 0.0226 | 0.1277 | 0.1971 | 0.0333 | 0.0696 | 0.0618 | 0.0620 | 0.0476 |
regression | 400 | 0.0329 | 0.1082 | 0.2031 | 0.0662 | 0.1041 | 0.1025 | 0.1160 | 0.0919 |
800 | 0.0347 | 0.1008 | 0.1867 | 0.0865 | 0.0993 | 0.1067 | 0.1425 | 0.1197 | |
1600 | 0.0665 | 0.2035 | 0.3170 | 0.1717 | 0.1847 | 0.1983 | 0.2669 | 0.2301 | |
3200 | 0.1394 | 0.4341 | 0.6173 | 0.3541 | 0.3962 | 0.4161 | 0.5505 | 0.4678 | |
6400 | 0.2991 | 0.9853 | 1.2045 | 0.7955 | 0.8788 | 0.9066 | 1.2281 | 1.0148 | |
Logistic | 200 | 0.0565 | 0.0454 | 0.0400 | 0.0391 | 0.0148 | 0.0160 | 0.0379 | 0.0411 |
regression | 400 | 0.0787 | 0.1113 | 0.0971 | 0.0747 | 0.0556 | 0.0608 | 0.0969 | 0.0808 |
800 | 0.0907 | 0.1570 | 0.1623 | 0.1198 | 0.0777 | 0.1015 | 0.1511 | 0.1298 | |
1600 | 0.1682 | 0.2965 | 0.3007 | 0.2294 | 0.1640 | 0.2088 | 0.3002 | 0.2451 | |
3200 | 0.3494 | 0.6480 | 0.6258 | 0.4655 | 0.3513 | 0.4423 | 0.6395 | 0.5305 | |
6400 | 0.7310 | 1.4144 | 1.3711 | 1.0268 | 0.8389 | 1.0273 | 1.4445 | 1.1827 |
Model | ncvreg | SCAD | MCP | TLP | CLASSO | SR | MBR | MLOG | |
---|---|---|---|---|---|---|---|---|---|
Linear | 200 | 0.0150 | 0.0733 | 0.2201 | 0.0433 | 0.0629 | 0.0981 | 0.0909 | 0.0721 |
regression | 400 | 0.0210 | 0.0664 | 0.1588 | 0.0532 | 0.0617 | 0.0678 | 0.0941 | 0.0813 |
800 | 0.0538 | 0.1650 | 0.2172 | 0.1107 | 0.1505 | 0.1457 | 0.1750 | 0.1383 | |
1600 | 0.0945 | 0.2703 | 0.2946 | 0.1793 | 0.2253 | 0.2221 | 0.2672 | 0.2045 | |
3200 | 0.1769 | 0.5071 | 0.5032 | 0.3379 | 0.3972 | 0.3986 | 0.4801 | 0.3684 | |
6400 | 0.3439 | 1.0781 | 1.0228 | 0.7366 | 0.8001 | 0.8207 | 1.0210 | 0.7830 | |
Logistic | 200 | 0.0590 | 0.1065 | 0.1029 | 0.0750 | 0.0465 | 0.0696 | 0.0978 | 0.0804 |
regression | 400 | 0.0568 | 0.1054 | 0.1044 | 0.0753 | 0.0453 | 0.0593 | 0.0941 | 0.0809 |
800 | 0.1076 | 0.1555 | 0.1349 | 0.1103 | 0.0873 | 0.0934 | 0.1423 | 0.1163 | |
1600 | 0.1327 | 0.1944 | 0.1591 | 0.1419 | 0.1122 | 0.1151 | 0.1842 | 0.1460 | |
3200 | 0.2073 | 0.3120 | 0.2529 | 0.2382 | 0.1885 | 0.1948 | 0.3055 | 0.2415 | |
6400 | 0.3843 | 0.5893 | 0.4792 | 0.4646 | 0.3539 | 0.3576 | 0.5978 | 0.4677 |
We compare the solution paths based on the diabetes samples available
from lars package (Efron et al. 2004), where the sample size
There are cases when we need to introduce the ridge penalty for some
reasons, and ncpen provides a hybrid version of the penalties:
We introduced the warm start strategy for speed up the algorithm but the solution path, in fact, depends on the initial solution because of the non-convexity. For comparison, we use the prostate tumor gene expression data and the results are displayed in Figure 5 and Table 3. In the figure, left panels show the solution paths for the SCAD, MCP and clipped LASSO obtained by the warm start, and the right panels show those obtained by using the LASSO as a global initial for the CCCP algorithm. Figure 5 shows two strategies for initial provide very different solution paths, which may result in different performances of the estimators. We compare the prediction accuracy and selectivity of the estimators by two strategies. The results are obtained by 300 random partitions of data set divided into two parts, training (70%) and test (30%) datasets. For each training data, the optimal tuning parameter values are selected by 10-fold cross-validation, and then we compute the prediction error(the percentage of the misclassified samples) on each test dataset and the number of selected nonzero variables on each training dataset. Table 3 shows all methods by the global initial perform better than those by the warm start strategy. In summary, the nonconvex penalized estimation depends on the initial solution, and the non-convex penalized estimator by a good initial would improve its performance.
warm start | global initial | ||||
Method | prediction error | # variables | prediction error | # variables | |
SCAD | 9.44 (.2757) | 1.19 (.0268) | 9.30 (.3091) | 7.02 (.3907) | |
MCP | 9.38 (.2774) | 1.14 (.0234) | 8.84 (.2657) | 7.41 (.3771) | |
TLP | 9.44 (.2647) | 1.18 (.0254) | 7.47 (.2677) | 19.01 (.1710) | |
CLASSO | 9.12 (.2749) | 4.65 (.1914) | 8.20 (.2785) | 8.14 (.1542) | |
SR | 9.38 (.2875) | 5.15 (.2290) | 7.94 (.2677) | 15.13 (.2283) | |
MBR | 9.78 (.2621) | 1.29 (.0352) | 8.21 (.3004) | 8.75 (.1712) | |
MLOG | 9.51 (.2627) | 1.16 (.0233) | 7.66 (.2746) | 15.21 (.1816) |
We have developed the R package ncpen for estimating generalized linear models with various concave penalties. The unified algorithm implemented in ncpen is flexible and efficient. The package also provides various user-friendly functions and user-specific options for different penalized estimators. The package is currently available with a general public license (GPL) from the Comprehensive R Archive Network at https://CRAN.R-project.org/package=ncpen. Our ncpen package implements internal optimization algorithms implemented in C++ benefiting from Rcpp package (Eddelbuettel et al. 2011).
This research is supported by the Julian Virtue Professorship 2016-18 Endowment at the Pepperdine Graziadio Business School at Pepperdine University, and the National Research Foundation of Korea (NRF) funded by the Korea government (No. 2020R1I1A3071646 and 2020R1F1A1A01071036).
lars, glmpath, glmnet, plus, sparsenet, cvplogit, ncvreg, ncpen, spls, Rcpp
ChemPhys, HighPerformanceComputing, MachineLearning, NumericalMathematics, Survival
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Kim, et al., "A Unified Algorithm for the Non-Convex Penalized Estimation: The ncpen Package", The R Journal, 2021
BibTeX citation
@article{RJ-2021-003, author = {Kim, Dongshin and Lee, Sangin and Kwon, Sunghoon}, title = {A Unified Algorithm for the Non-Convex Penalized Estimation: The ncpen Package}, journal = {The R Journal}, year = {2021}, note = {https://rjournal.github.io/}, volume = {12}, issue = {2}, issn = {2073-4859}, pages = {120-133} }