Various R packages have been developed for the non-convex penalized estimation but they can only be applied to the smoothly clipped absolute deviation (SCAD) or minimax concave penalty (MCP). We develop an R package, entitled ncpen, for the non-convex penalized estimation in order to make data analysts to experience other non-convex penalties. The package ncpen implements a unified algorithm based on the convex concave procedure and modified local quadratic approximation algorithm, which can be applied to a broader range of non-convex penalties, including the SCAD and MCP as special examples. Many user-friendly functionalities such as generalized information criteria, cross-validation and ridge regularization are provided also.
The penalized estimation has been one of the most important statistical techniques for high dimensional data analysis, and many penalties have been developed such as the least absolute shrinkage and selection operator (LASSO) (Tibshirani 1996), smoothly clipped absolute deviation (SCAD) (Fan and Li 2001), and minimax concave penalty (MCP) (Zhang 2010). In the context of R, many authors released fast and stable R packages for obtaining the whole solution path of the penalized estimator for the generalized linear model (GLM). For example, lars (Efron et al. 2004), glmpath (Park and Hastie 2007) and glmnet (Friedman et al. 2007) implement the LASSO. Packages such as plus (Zhang 2010), sparsenet (Mazumder et al. 2011), cvplogit (Jiang and Huang 2014) and ncvreg (Breheny and Huang 2011) implement the SCAD and MCP. Among them, glmnet and ncvreg are very fast, stable, and well-organized, presenting various user-friendly functionalities such as the cross-validation and \(\ell_2\)-stabilization (Zou and Hastie 2005; Huang et al. 2016b).
The non-convex penalized estimation has been studied by many researchers (Fan and Li 2001; Huang et al. 2008; Kim et al. 2008; Zou and Li 2008; Friedman 2012; Kwon and Kim 2012; Zhang and Zhang 2012). However, there is still a lack in research on the algorithms that exactly implement the non-convex penalized estimators for the non-convexity of the objective function. One nice approach is using the coordinate descent (CD) algorithm (Tseng 2001; Breheny and Huang 2011). The CD algorithm fits quite well for some quadratic non-convex penalties such as the SCAD and MC (Breheny and Huang 2011; Mazumder et al. 2011; Jiang and Huang 2014) since each coordinate update in the CD algorithm becomes an easy convex optimization problem with a closed form solution. This is the main reason for the preference of the CD algorithm implemented in many R packages such as sparsenet and ncvreg. However, coordinate updates in the CD algorithm require extra univariate optimizations for other non-convex penalties such as the log and bridge penalties (Huang et al. 2008; Zou and Li 2008; Friedman 2012), which severely lowers the convergence speed. Another subtle point is that the CD algorithm requires standardization of the input variables and need to enlarge the concave scale parameter in the penalty (Breheny and Huang 2011) to obtain the local convergence, which may cause to lose an advantage of non-convex penalized estimation (Kim and Kwon 2012) and give much different variable selection performance (Lee 2015).
In this paper, we develop an R package ncpen for the non-convex penalized estimation based on the convex-concave procedure (CCCP) or difference-convex (DC) algorithm (Kim et al. 2008; Shen et al. 2012) and the modified local quadratic approximation algorithm (MLQA) (Lee et al. 2016). The main contribution of the package ncpen is that it encompasses most of existing non-convex penalties, including the truncated \(\ell_1\) (Shen et al. 2013), log (Zou and Li 2008; Friedman 2012), bridge (Huang et al. 2008), moderately clipped LASSO (Kwon et al. 2015), sparse ridge (Kwon et al. 2013; Huang et al. 2016a) penalties as well as the SCAD and MCP and covers a broader range of regression models: multinomial and Cox models as well as the GLM. Further, ncpen provides two unique options: the investigation of initial dependent solution paths and non-standardization of input variables, which allow the users more flexibility.
The rest of the paper is organized as follows. In the next section, we describe the algorithm implemented in ncpen with major steps and details. Afterwards, the main functions and various options in ncpen are presented with numerical illustrations. Finally, the paper concludes with remarks.
We consider the problem of minimizing \[\label{ncp:pro} Q_{\lambda}\left({\mathbf{\beta}}\right) = L\left({\mathbf{\beta}}\right) + \sum_{j=1}^pJ_{\lambda}\left(|\beta_j|\right), \tag{1}\] where \({\mathbf{\beta}}=(\beta_1,\ldots,\beta_p)^T\) is a \(p\)-dimensional parameter vector of interest, \(L\) is a convex loss function and \(J_\lambda\) is a non-convex penalty with tuning parameter \(\lambda>0\). We first introduce the CCCP-MLQA algorithm for minimizing \(Q_\lambda\) when \(\lambda\) is fixed, and then explain how to construct the whole solution path over a decreasing sequence of \(\lambda\)s by using the algorithm.
We consider a class of non-convex penalties that satisfy \(J_\lambda\left(|t|\right) = \int_0^{|t|} \nabla J_\lambda \left(s\right)ds, t\in\mathbb{R}\) for some non-decreasing function \(\nabla J_\lambda\) and \[\label{pen:decomp} D_\lambda\left(t\right) = J_\lambda\left(|t|\right)-\kappa_\lambda |t| \tag{2}\] is concave function, where \(\kappa_\lambda = \lim_{t\to0+}\nabla J_\lambda\left(t\right)\). The class includes most of existing non-convex penalties: SCAD (Fan and Li 2001), \[\nabla J_{\lambda}\left(t\right) = \lambda I\left[0<t<\lambda\right] + \left\{\left(\tau\lambda-t\right)/\left(\tau-1\right)\right\}I\left[\lambda \leq t<\tau\lambda\right]\] for \(\tau>2\), MCP (Zhang 2010), \[\nabla J_{\lambda}\left(t\right) = \left(\lambda-t/\tau\right)I\left[0< t<\tau\lambda\right]\] for \(\tau>1\), truncated \(\ell_1\)-penalty (Shen et al. 2013), \[\nabla J_{\lambda}\left(t\right) = \lambda I\left[0<t<\tau\right]\] for \(\tau>0\), moderately clipped LASSO (Kwon et al. 2015), \[\nabla J_{\lambda}\left(t\right) = \left(\lambda-t/\tau\right)\left[0<t<\tau\left(\lambda-\gamma\right)\right] + \gamma \left[t\geq \tau\left(\lambda-\gamma\right)\right]\] for \(\tau>1\) and \(0\leq\gamma\leq \lambda\), sparse ridge (Kwon et al. 2013), \[\nabla J_{\lambda}\left(t\right) = \left(\lambda-t/\tau\right)I\left[0<t<\tau\lambda/\left(\tau\gamma+1\right)\right] + \gamma t I\left[t\geq\tau\lambda/\left(\tau\gamma+1\right)\right]\] for \(\tau>1\) and \(\gamma\geq0\), modified log (Zou and Hastie 2005). \[\nabla J_{\lambda}\left(t\right) = \left(\lambda/\tau\right)\left[0<t<\tau\right] + \left(\lambda/t\right)\left[t\geq\tau\right]\] for \(\tau>0\), and modified bridge (Huang et al. 2008) \[\nabla J_{\lambda}\left(t\right) = \left(\lambda/2\sqrt{\tau}\right)\left[0<t<\tau\right] + (\lambda/2\sqrt{t})\left[t\geq\tau\right]\] for \(\tau>0\).
The moderately clipped LASSO and sparse ridge are simple smooth interpolations between the MCP (near the origin) and the LASSO and ridge, respectively. The log and bridge penalties are modified to be linear over \(t\in(0,\tau]\) so that they have finite right derivative at the origin. See the plot for graphical comparison of the penalties introduced here.
The CCCP-MLQA algorithm iteratively conducts two main steps: CCCP (Yuille and Rangarajan 2003) and MLQA (Lee et al. 2016) steps. The CCCP step decomposes the penalty \(J_\lambda\) as in (2) and then minimizes the tight convex upper bound obtained from a linear approximation of \(D_\lambda\). The MLQA step first minimizes a quadratic approximation of the loss \(L\) and then modifies the solution to keep descent property.
The objective function \(Q_\lambda\) in (1) can be rewritten by using the decomposition in (2) as \[\label{obj:decomp} Q_{\lambda}\left({\mathbf{\beta}}\right) = L\left({\mathbf{\beta}}\right) + \sum_{j=1}^p D_{\lambda}(\beta_j) + \kappa_\lambda \sum_{j=1}^p |\beta_j| \tag{3}\] so that \(Q_\lambda\left({\mathbf{\beta}}\right)\) becomes a sum of convex, \(L\left({\mathbf{\beta}}\right)+ \kappa_\lambda \sum_{j=1}^p |\beta_j|\), and concave, \(\sum_{j=1}^p D_{\lambda}(\beta_j)\), functions. Hence the tight convex upper bound of \(Q_\lambda\left({\mathbf{\beta}}\right)\) (Yuille and Rangarajan 2003) becomes \[\label{obj:cccp} U_{\lambda}\left({\mathbf{\beta}};\tilde{{\mathbf{\beta}}}\right) = L\left({\mathbf{\beta}}\right) + \sum_{j=1}^p \partial D_{\lambda}(\tilde\beta_j)\beta_j + \kappa_\lambda \sum_{j=1}^p |\beta_j|, \tag{4}\] where \(\tilde {\mathbf{\beta}}=\left(\tilde\beta_1,\dots,\tilde\beta_p\right)^T\) is a given point and \(\partial D_\lambda(\tilde\beta_j)\) is a subgradient of \(D_\lambda(\beta_j)\) at \(\beta_j=\tilde\beta_j\).\ Algorithm 1 summarizes the CCCP step for minimizing \(Q_\lambda\).
Algorithm 1: minimizing \(Q_{\lambda} (\beta)\)
\(~~~\) 1. Set \(\tilde{{\mathbf{\beta}}}\).
\(~~~\) 2. Update \(\tilde{{\mathbf{\beta}}}\) by
\(\tilde{{\mathbf{\beta}}}= \arg\min_{\scriptsize{\mathbf{\beta}}}U_\lambda\left({\mathbf{\beta}};\tilde{{\mathbf{\beta}}}\right)\).
\(~~~\) 3. Repeat the Step 2 until convergence.
Algorithm 1 includes minimizing \(U_{\lambda}\left({\mathbf{\beta}};\tilde{{\mathbf{\beta}}}\right)\) in (4) given a solution \(\tilde{\mathbf{\beta}}\). An easy way is iteratively minimizing local quadratic approximation (LQA) of \(L\) around \(\tilde{\mathbf{\beta}}\): \[L\left({\mathbf{\beta}}\right)\approx\tilde L\left({\mathbf{\beta}};\tilde{{\mathbf{\beta}}}\right) = L\left(\tilde{{\mathbf{\beta}}}\right)+\nabla L\left(\tilde{{\mathbf{\beta}}}\right)^T\left({\mathbf{\beta}}-\tilde{{\mathbf{\beta}}}\right)+\left({\mathbf{\beta}}-\tilde{{\mathbf{\beta}}}\right)^T\nabla^2L\left(\tilde{{\mathbf{\beta}}}\right)\left({\mathbf{\beta}}-\tilde{{\mathbf{\beta}}}\right)/2,\] where \(\nabla L\left({\mathbf{\beta}}\right)=\partial L\left({\mathbf{\beta}}\right)/\partial {\mathbf{\beta}}\) and \(\nabla^2 L\left({\mathbf{\beta}}\right)=\partial^2 L\left({\mathbf{\beta}}\right)/\partial {\mathbf{\beta}}^2\). Then \(U_\lambda\left({\mathbf{\beta}};\tilde{{\mathbf{\beta}}}\right)\) can be minimized by iteratively minimizing \[\label{obj:lqa} \tilde U_{\lambda}\left({\mathbf{\beta}};\tilde{{\mathbf{\beta}}}\right) = \tilde L\left({\mathbf{\beta}};\tilde{{\mathbf{\beta}}}\right) + \sum_{j=1}^p \partial D_{\lambda}(|\tilde\beta_j|)\beta_j + \kappa_\lambda \sum_{j=1}^p |\beta_j|. \tag{5}\] It is easy to minimize \(\tilde U_{\lambda}\left({\mathbf{\beta}};\tilde{{\mathbf{\beta}}}\right)\) since it is simply a quadratic function and the penalty term is the LASSO. For the algorithm, we use the coordinate descent algorithm introduced by Friedman et al. (2007). Note that the LQA algorithm may not have the descent property. Hence, we incorporate the modification step to ensure the descent property. Let \(\tilde{\mathbf{\beta}}^a\) be the minimizer of \(\tilde U_{\lambda}\left({\mathbf{\beta}};\tilde{{\mathbf{\beta}}}\right)\). Then we modify the solution \(\tilde{\mathbf{\beta}}^a\) by \(\tilde{\mathbf{\beta}}^{\hat h}\) whenever it violates the descent property, i.e., \(U_{\lambda}(\tilde{{\mathbf{\beta}}}^a;\tilde{{\mathbf{\beta}}})>U_{\lambda}\left(\tilde{{\mathbf{\beta}}};\tilde{{\mathbf{\beta}}}\right)\): \[\label{mod:step} \tilde{\mathbf{\beta}}^{\hat h} = {\hat h}\tilde{\mathbf{\beta}}^a + \left(1-{\hat h}\right)\tilde{\mathbf{\beta}}, \tag{6}\] where \(\hat h = \arg\min_{h>0} U_\lambda\left(h\tilde{{\mathbf{\beta}}}^a+\left(1-h\right)\tilde{{\mathbf{\beta}}};\tilde{{\mathbf{\beta}}}\right)\). This modification step in (6) guarantees the descent property of the LQA algorithm (Lee et al. 2016).
Algorithm 2: minimizing \(U_\lambda\left({\mathbf{\beta}};\tilde{{\mathbf{\beta}}}\right)\)
\(~~~\) 1. Set \(\tilde{{\mathbf{\beta}}}\).
\(~~~\) 2. Find
\(\tilde{{\mathbf{\beta}}}^a = \arg\min_{\scriptsize{\mathbf{\beta}}}\tilde U_\lambda\left({\mathbf{\beta}};\tilde{{\mathbf{\beta}}}\right)\).
\(~~~\) 3. Find
\(\hat h = \arg\min_{h>0} U_\lambda\left(h\tilde{{\mathbf{\beta}}}^a+\left(1-h\right)\tilde{{\mathbf{\beta}}};\tilde{{\mathbf{\beta}}}\right)\).
\(~~~\) 4. Update \(\tilde{{\mathbf{\beta}}}\) by
\(\hat h\tilde{{\mathbf{\beta}}}^a+\left(1-\hat h\right)\tilde{{\mathbf{\beta}}}\).
\(~~~\) 5. Repeat the Step 2–4 until convergence.
Usually, the computation time of the algorithm rapidly increases as the number of non-zero parameters increases or \(\lambda\) decreases toward zero. To accelerate the algorithm, we incorporate the active-set-control procedure while constructing the solution path over a decreasing sequence of \(\lambda\).
Assume that \(\lambda\) is given and we have an initial solution \(\tilde{\mathbf{\beta}}\) which is expected to be very close to the minimizer of \(Q_\lambda\left({\mathbf{\beta}}\right)\). First we check the first order KKT optimality conditions: \[\label{KKT} \partial Q_{\lambda}\left(\tilde{\mathbf{\beta}}\right)/\partial\beta_j=0,j\in{\cal A}~~and~~ |\partial Q_{\lambda}\left(\tilde{\mathbf{\beta}}\right)/\partial\beta_j|\leq \kappa_{\lambda},j\in{\cal N}, \tag{7}\] where \({\cal A}=\{j:\tilde\beta_j\neq0\}\) and \({\cal N}=\{j:\tilde\beta_j=0\}\). We stop the algorithm if the conditions are satisfied else update \({\cal N}\) and \(\tilde{\mathbf{\beta}}\) by \({\cal N}={\cal N}\setminus\{j_{\max}\}\) and \[\label{small:Q} \tilde{\mathbf{\beta}}= {\arg\min}_{\beta_j=0,j\in{\cal N}} Q_{\lambda}\left({\mathbf{\beta}}\right), \tag{8}\] respectively, where \(j_{\max}={\arg\max}_{j\in{\cal N}}|\partial Q_{\lambda}\left(\tilde{\mathbf{\beta}}\right)/\partial\beta_j|\). We keep these iterations until the KKT conditions in (7) are satisfied with \(\tilde{\mathbf{\beta}}\). The key step is (8) which is easy and fast to obtain by using Algorithm 1 and 2 since the objective function only includes the parameters in \({\cal A}\cup\{j_{\max}\}\).
Algorithm 3: minimizing \(Q_{\lambda} (\beta)\)
\(~~~\) 1. Set \(\tilde{{\mathbf{\beta}}}\).
\(~~~\) 2. Set \({\cal A}=\{j:\tilde\beta_j\neq0\}\) and
\({\cal N}=\{j:\tilde\beta_j=0\}\).
\(~~~\) 3. Check whether
\(\partial Q_{\lambda}\left(\tilde{\mathbf{\beta}}\right)/\partial\beta_j=0,j\in{\cal A}~and~ |\partial Q_{\lambda}\left(\tilde{\mathbf{\beta}}\right)/\partial\beta_j|\leq \kappa_{\lambda},j\in{\cal N}\).
\(~~~\) 4. Update \({\cal N}\) by \({\cal N}\setminus\{j_{\max}\}\), where
\(j_{\max}={\arg\max}_{j\in{\cal N}}|\partial Q_{\lambda}\left(\tilde{\mathbf{\beta}}\right)/\partial\beta_j|\).
\(~~~\) 5. Update \(\tilde{{\mathbf{\beta}}}\) by
\(\tilde{\mathbf{\beta}}= {\arg\min}_{\beta_j=0,j\in{\cal N}} Q_{\lambda}\left({\mathbf{\beta}}\right)\).
\(~~~\) 6. Repeat the Step 2–5 until the KKT conditions satisfy.
Remark 1 The number of variables that violates the KKT conditions could be large for some high-dimensional cases. In this case, it may be inefficient to add only one variable \(j_{\max}\) into \({\cal A}\). It would be more efficient to add more variables into \({\cal A}\). However, when the number variables added is too large, it also is inefficient. With many experiences, we found that the algorithm would be efficient with 10 variables.
In practice, we want to approximate the whole solution path or surface of the minimizer \(\hat{\mathbf{\beta}}^\lambda\) as a function of \(\lambda\). For the purpose, we first construct a decreasing sequence \(\lambda_{\max}=\lambda_0>\lambda_1>\cdots>\lambda_{n-1}>\lambda_n=\lambda_{\min}\) and then obtain the corresponding sequence of minimizers \(\hat{\mathbf{\beta}}^{\lambda_0},\ldots,\hat{\mathbf{\beta}}^{\lambda_n}\). Let \(\partial Q_{\lambda}\left(\mathbf{0}\right)\) be the subdifferential of \(Q_{\lambda}\) at \(\mathbf{0}\). Then we can see that \(\mathbf{0}\in\partial Q_{\lambda}\left(\mathbf{0}\right)=\{\nabla L\left(\mathbf{0}\right)+{\mathbf{\delta}}:\max_j|\delta_j|\leq \kappa_\lambda\}\), for any \(\kappa_\lambda>\kappa_{\lambda_{\max}}=\max_j|\partial L\left(\mathbf{0}\right)/\partial\beta_j|\), which implies the \(p\)-dimensional zero vector is the exact minimizer of \(Q_\lambda\left({\mathbf{\beta}}\right)\) when \(\kappa_\lambda\geq\kappa_{\lambda_{\max}}\). Hence, we start from the largest value \(\lambda=\lambda_{\max}\) that satisfies \(\kappa_{\lambda_{\max}}=\max_j|\partial L\left(\mathbf{0}\right)/\partial\beta_j|\), and then we continue down to \(\lambda=\lambda_{\min}=\epsilon\lambda_{\max}\), where \(\epsilon\) is a predetermined ratio such as \(\epsilon=0.01\). Once we obtain the minimizer \(\hat{\mathbf{\beta}}^{\lambda_{k-1}}\) then it is easy to find \(\hat{\mathbf{\beta}}^{\lambda_k}\) by using \(\hat{\mathbf{\beta}}^{\lambda_{k-1}}\) as an initial solution in Algorithm 3, which is expected to be close to \(\hat{\mathbf{\beta}}^{\lambda_k}\) for a finely divided \(\lambda\) sequence. This scheme is called the warm start strategy, which makes the algorithm more stable and efficient (Friedman et al. 2010).
In this section, we introduce the main functions with various options and user-friendly functions implemented in the package ncpen for the users. Next section will illustrate how the various options in the main function make a difference in data analysis through numerical examples.
The R package ncpen contains the main functions: ncpen()
for fitting
various nonconvex penalized regression models, cv.ncpen()
and
gic.ncpen()
for selecting the optimal model from a sequence of the
regularization path based on cross-validation and a generalized
information
criterion(Wang et al. 2007, 2009; Fan and Tang 2013; Wang et al. 2013).
In addition, the useful functions are also implemented in the package:
sam.gen.ncpen()
for generating a synthetic data from various models
with the correlation structure in (11), plot()
for
graphical representation, coef()
for extracting coefficients from the
fitted object, predict()
for making predictions from new design
matrix. The followings are the basic usage of the main functions:
## linear regression with scad penalty
=100; p=10
n= sam.gen.ncpen(n=n,p=p,q=5,cf.min=0.5,cf.max=1,corr=0.5,family="gaussian")
sam = sam$x.mat; y.vec = sam$y.vec
x.mat = ncpen(y.vec=y.vec,x.mat=x.mat,family="gaussian",penalty="scad")
fit coef(fit); plot(fit)
# prediction from a new dataset
=20
test.n= matrix(rnorm(test.n*p),nrow=test.n,ncol=p)
newx predict(fit,new.x.mat=newx)
# selection of the optimal model based on the cross-validation
= cv.ncpen(y.vec=y.vec,x.mat=x.mat,family="gaussian",penalty="scad")
cv.fit coef(cv.fit)
# selection of the optimal model using the generalized information criterion
= ncpen(y.vec=y.vec,x.mat=x.mat,family="gaussian",penalty="scad")
fit gic.ncpen(fit)
The main function ncpen()
provides various options which produce
different penalized estimators. The other packages for nonconvex
penalized regressions also have similar options: \(\ell_2\)-regularization
and penalty weights. However, the package ncpen provides the two
unique options for standaradization and initial value. Below, we briefly
describe the options in the main function ncpen()
.
The option alpha
in the main functions forces the algorithm to solve
the following penalized problem with the \(\ell_2\)-regularization or
ridge effect (Zou and Hastie 2005).
\[\label{obs:ridge}
Q_{\lambda}\left({\mathbf{\beta}}\right) = \frac{1}{n}\sum_{i=1}^{n}\ell_i\left({\mathbf{\beta}}\right) + \alpha\sum_{j=1}^{p}J_{\lambda}(|\beta_j|) + \left(1-\alpha\right)\lambda\sum_{j=1}^{p}\beta_j^2, \tag{9}\]
where \(\alpha\in[0,1]\) is the value from the option alpha
, which is
the mixing parameter between the penalties \(J_{\lambda}\) and ridge. The
objective function in (9) includes the elastic net
(Zou and Hastie 2005) when \(J_{\lambda}\left(t\right)=\lambda t\) and
Mnet (Huang et al. 2016a) when \(J_{\lambda}\left(\cdot\right)\) is the MC
penalty. By controlling the option alpha
, we can treat the problem
with highly correlated variables, and it makes the algorithm more stable
also since the minimum eigenvalue of the Hessian marix becomes large up
to factor \(\left(1-\alpha\right)\lambda\)
(Zou and Hastie 2005; Lee and Breheny 2015; Huang et al. 2016a).
We can give different weights for each observation and penalty by the
options obs.weight
and pen.weight
, which provides the minimizer of
\[\label{obs:weight}
Q_{\lambda}\left({\mathbf{\beta}}\right) = \sum_{i=1}^{n} d_i \ell_i\left({\mathbf{\beta}}\right) + \sum_{j=1}^p w_j J_{\lambda}(|\beta_j|), \tag{10}\]
where \(d_i\) is the weight for the \(i\)th observation and \(w_j\) is the
penalty weight for the \(j\)th variable. For example, controlling
observation weights is required for the linear regression model with
heteroscedastic error variance. Further, we can compute adaptive
versions of penalized estimators by giving different penalty weights as
in the adaptive LASSO (Zou 2006).
It is common practice to standardize variables prior to fitting the
penalized models, but one may opt not to. Hence, we provide the option
x.standardize
for flexible analysis. The option x.standardize=TRUE
means that the algorithm solves the original penalized problem in
(1), with the standardized (scaled) variables, and then the
resulting solution \(\hat\beta_j\) is converted to the original scale by
\(\hat\beta_j / s_j\), where \(s_j=\sum_{i=1}^{n}x_{ij}^2/n\). When the
penalty \(J_{\lambda}\) is the LASSO penalty, this procedure is equivalent
to solving following penalized problem
\[Q^{s}_{\lambda}\left({\mathbf{\beta}}\right) = L\left({\mathbf{\beta}}\right) + \sum_{j=1}^p\lambda_j|\beta_j|,\]
where \(\lambda_j=\lambda s_j\), which is another adaptive version of the
LASSO being different from the adaptive LASSO (Zou 2006).
We introduced the warm start strategy for speed up the algorithm, but
the solution path, in fact, depends on the initial solution of the CCCP
algorithm because of the non-convexity. The option local=TRUE
in
ncpen provides the same initial value specified by the option
local.initial
into each CCCP iterations for whole \(\lambda\) values.
The use of the option local=TRUE
makes the algorithm slower but the
performance of the resulting estimator would be often improved as
provided a good initial such as the maximum likelihood estimator or
LASSO.
We consider the linear and logistic regression models to calculate the total elapsed time for constructing the solution path over 100 \(\lambda\) values: \[\label{exm:kwon} y={\bf x}^T{\mathbf{\beta}}+\varepsilon~~and~~ \mathbf{P}\left(y=1|{\bf x}\right)=\frac{exp({\bf x}^T\scriptsize{\mathbf{\beta}})}{1+exp({\bf x}^T\scriptsize{\mathbf{\beta}})} \tag{11}\] where \({\bf x}\sim N_p\left(\mathbf{0},\Sigma\right)\) with \(\Sigma_{jk}=0.5^{|j-k|}\), \(\beta_j=1/j\) for \(j,k=1,\cdots,p\) and \(\varepsilon\sim N\left(0,1\right)\). The averaged elapsed times of ncpen in 100 random repetitions are summarized in Table 1 and 2 for various \(n\) and \(p\), where the penalties are the SCAD, MCP, truncated \(\ell_1\) (TLP), moderately clipped LASSO (CLASSO), sparse ridge (SR), modified bridge (MBR) and log (MLOG). For comparison, we try ncvreg for the SCAD also. The results show that all methods in ncpen are feasible for high-dimensional data.
Model | \(n\) | ncvreg | SCAD | MCP | TLP | CLASSO | SR | MBR | MLOG |
---|---|---|---|---|---|---|---|---|---|
Linear | 200 | 0.0226 | 0.1277 | 0.1971 | 0.0333 | 0.0696 | 0.0618 | 0.0620 | 0.0476 |
regression | 400 | 0.0329 | 0.1082 | 0.2031 | 0.0662 | 0.1041 | 0.1025 | 0.1160 | 0.0919 |
800 | 0.0347 | 0.1008 | 0.1867 | 0.0865 | 0.0993 | 0.1067 | 0.1425 | 0.1197 | |
1600 | 0.0665 | 0.2035 | 0.3170 | 0.1717 | 0.1847 | 0.1983 | 0.2669 | 0.2301 | |
3200 | 0.1394 | 0.4341 | 0.6173 | 0.3541 | 0.3962 | 0.4161 | 0.5505 | 0.4678 | |
6400 | 0.2991 | 0.9853 | 1.2045 | 0.7955 | 0.8788 | 0.9066 | 1.2281 | 1.0148 | |
Logistic | 200 | 0.0565 | 0.0454 | 0.0400 | 0.0391 | 0.0148 | 0.0160 | 0.0379 | 0.0411 |
regression | 400 | 0.0787 | 0.1113 | 0.0971 | 0.0747 | 0.0556 | 0.0608 | 0.0969 | 0.0808 |
800 | 0.0907 | 0.1570 | 0.1623 | 0.1198 | 0.0777 | 0.1015 | 0.1511 | 0.1298 | |
1600 | 0.1682 | 0.2965 | 0.3007 | 0.2294 | 0.1640 | 0.2088 | 0.3002 | 0.2451 | |
3200 | 0.3494 | 0.6480 | 0.6258 | 0.4655 | 0.3513 | 0.4423 | 0.6395 | 0.5305 | |
6400 | 0.7310 | 1.4144 | 1.3711 | 1.0268 | 0.8389 | 1.0273 | 1.4445 | 1.1827 |
Model | \(p\) | ncvreg | SCAD | MCP | TLP | CLASSO | SR | MBR | MLOG |
---|---|---|---|---|---|---|---|---|---|
Linear | 200 | 0.0150 | 0.0733 | 0.2201 | 0.0433 | 0.0629 | 0.0981 | 0.0909 | 0.0721 |
regression | 400 | 0.0210 | 0.0664 | 0.1588 | 0.0532 | 0.0617 | 0.0678 | 0.0941 | 0.0813 |
800 | 0.0538 | 0.1650 | 0.2172 | 0.1107 | 0.1505 | 0.1457 | 0.1750 | 0.1383 | |
1600 | 0.0945 | 0.2703 | 0.2946 | 0.1793 | 0.2253 | 0.2221 | 0.2672 | 0.2045 | |
3200 | 0.1769 | 0.5071 | 0.5032 | 0.3379 | 0.3972 | 0.3986 | 0.4801 | 0.3684 | |
6400 | 0.3439 | 1.0781 | 1.0228 | 0.7366 | 0.8001 | 0.8207 | 1.0210 | 0.7830 | |
Logistic | 200 | 0.0590 | 0.1065 | 0.1029 | 0.0750 | 0.0465 | 0.0696 | 0.0978 | 0.0804 |
regression | 400 | 0.0568 | 0.1054 | 0.1044 | 0.0753 | 0.0453 | 0.0593 | 0.0941 | 0.0809 |
800 | 0.1076 | 0.1555 | 0.1349 | 0.1103 | 0.0873 | 0.0934 | 0.1423 | 0.1163 | |
1600 | 0.1327 | 0.1944 | 0.1591 | 0.1419 | 0.1122 | 0.1151 | 0.1842 | 0.1460 | |
3200 | 0.2073 | 0.3120 | 0.2529 | 0.2382 | 0.1885 | 0.1948 | 0.3055 | 0.2415 | |
6400 | 0.3843 | 0.5893 | 0.4792 | 0.4646 | 0.3539 | 0.3576 | 0.5978 | 0.4677 |
We compare the solution paths based on the diabetes samples available from lars package (Efron et al. 2004), where the sample size \(n=442\) and the number of covariates \(p=64\), including quadratic and interaction terms. Figure 2 shows four plots where the top two panels draw the solution paths from the LASSO and SCAD with \(\tau=3.7\) given by ncvreg and bottom two panels draw those from the SCAD with \(\tau=3.7\) based on ncpen with and without standardization of covariates. Two solution paths from ncvreg and ncpen with standardization are almost the same since ncvreg standardizes the covariates by default, which is somewhat different from that of ncpen without standardization. Figure 3 shows the solution paths from six penalties with standardization by default in ncpen: the MCP, truncated \(\ell_1\), modified log, bridge, moderately clipped LASSO and sparse ridge.
There are cases when we need to introduce the ridge penalty for some reasons, and ncpen provides a hybrid version of the penalties: \(\alpha J_{\lambda}(|t|)+(1-\alpha)|t|^2\), where \(\alpha\) is the mixing parameter between the penalty \(J_{\lambda}\) and ridge effects. For example, the non-convex penalties often produce parameters that diverge to infinity for the logistic regression because of perfect fitting. Figure 4 shows the effects of ridge penalty where the prostate tumor gene expression data in spls are used for illustration. The solution paths using the top 50 variables with high variances are drawn when \(\alpha\in\{1,0.7,0.3,0\}\) for the SCAD and modified bridge penalties. The solution paths without ridge effect (\(\alpha=1\)) tend to diverge as \(\lambda\) decreases and become stabilized as the ridge effect increases (\(\alpha\downarrow\) ) (Lee and Breheny 2015).
We introduced the warm start strategy for speed up the algorithm but the solution path, in fact, depends on the initial solution because of the non-convexity. For comparison, we use the prostate tumor gene expression data and the results are displayed in Figure 5 and Table 3. In the figure, left panels show the solution paths for the SCAD, MCP and clipped LASSO obtained by the warm start, and the right panels show those obtained by using the LASSO as a global initial for the CCCP algorithm. Figure 5 shows two strategies for initial provide very different solution paths, which may result in different performances of the estimators. We compare the prediction accuracy and selectivity of the estimators by two strategies. The results are obtained by 300 random partitions of data set divided into two parts, training (70%) and test (30%) datasets. For each training data, the optimal tuning parameter values are selected by 10-fold cross-validation, and then we compute the prediction error(the percentage of the misclassified samples) on each test dataset and the number of selected nonzero variables on each training dataset. Table 3 shows all methods by the global initial perform better than those by the warm start strategy. In summary, the nonconvex penalized estimation depends on the initial solution, and the non-convex penalized estimator by a good initial would improve its performance.
warm start | global initial | ||||
Method | prediction error | # variables | prediction error | # variables | |
SCAD | 9.44 (.2757) | 1.19 (.0268) | 9.30 (.3091) | 7.02 (.3907) | |
MCP | 9.38 (.2774) | 1.14 (.0234) | 8.84 (.2657) | 7.41 (.3771) | |
TLP | 9.44 (.2647) | 1.18 (.0254) | 7.47 (.2677) | 19.01 (.1710) | |
CLASSO | 9.12 (.2749) | 4.65 (.1914) | 8.20 (.2785) | 8.14 (.1542) | |
SR | 9.38 (.2875) | 5.15 (.2290) | 7.94 (.2677) | 15.13 (.2283) | |
MBR | 9.78 (.2621) | 1.29 (.0352) | 8.21 (.3004) | 8.75 (.1712) | |
MLOG | 9.51 (.2627) | 1.16 (.0233) | 7.66 (.2746) | 15.21 (.1816) |
We have developed the R package ncpen for estimating generalized linear models with various concave penalties. The unified algorithm implemented in ncpen is flexible and efficient. The package also provides various user-friendly functions and user-specific options for different penalized estimators. The package is currently available with a general public license (GPL) from the Comprehensive R Archive Network at https://CRAN.R-project.org/package=ncpen. Our ncpen package implements internal optimization algorithms implemented in C++ benefiting from Rcpp package (Eddelbuettel et al. 2011).
This research is supported by the Julian Virtue Professorship 2016-18 Endowment at the Pepperdine Graziadio Business School at Pepperdine University, and the National Research Foundation of Korea (NRF) funded by the Korea government (No. 2020R1I1A3071646 and 2020R1F1A1A01071036).
lars, glmpath, glmnet, plus, sparsenet, cvplogit, ncvreg, ncpen, spls, Rcpp
ChemPhys, HighPerformanceComputing, MachineLearning, NumericalMathematics, Survival
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Kim, et al., "A Unified Algorithm for the Non-Convex Penalized Estimation: The ncpen Package", The R Journal, 2021
BibTeX citation
@article{RJ-2021-003, author = {Kim, Dongshin and Lee, Sangin and Kwon, Sunghoon}, title = {A Unified Algorithm for the Non-Convex Penalized Estimation: The ncpen Package}, journal = {The R Journal}, year = {2021}, note = {https://rjournal.github.io/}, volume = {12}, issue = {2}, issn = {2073-4859}, pages = {120-133} }