Shared Concepts

Elastic Net Selection

This section applies to actions in the regression action set.

When the method parameter value is ELASTICNET, the elastic net method proposed by Zou and Hastie (2005) is performed. The elastic net method, which bridges the LASSO method and ridge regression, strikes a balance between having a parsimonious model and borrowing strength from correlated regressors, by solving the least squares regression problem with constraints on both the sum of the absolute coefficients and the sum of the squared coefficients.

More specifically, the elastic net coefficients bold-italic beta equals left-parenthesis beta 1 comma beta 2 comma ellipsis comma beta Subscript m Baseline right-parenthesis are the solution to the constrained optimization problem

min StartMetric bold y minus bold upper X bold italic beta EndMetric squared subject to sigma summation Underscript j equals 1 Overscript m Endscripts StartAbsoluteValue beta Subscript j Baseline EndAbsoluteValue less than or equals t 1 comma sigma summation Underscript j equals 1 Overscript m Endscripts beta Subscript j Superscript 2 Baseline less than or equals t 2

This can be written as the equivalent Lagrangian form

min StartMetric bold y minus bold upper X bold italic beta EndMetric squared plus lamda 1 sigma summation Underscript j equals 1 Overscript m Endscripts StartAbsoluteValue beta Subscript j Baseline EndAbsoluteValue plus lamda 2 sigma summation Underscript j equals 1 Overscript m Endscripts beta Subscript j Superscript 2

Elastic net can be treated as a convex combination of LASSO and ridge penalty; pure LASSO and pure ridge are two limiting cases. If t 1 is set to a very large value or, equivalently, if lamda 1 is set to 0, then the elastic net method reduces to ridge regression. If t 2 is set to a very large value or, equivalently, if lamda 2 is set to 0, then the elastic net method reduces to LASSO. If t 1 and t 2 are both large or, equivalently, if lamda 1 and lamda 2 are both set to 0, then the elastic net method reduces to ordinary least squares regression.

The elastic net method can overcome the limitations of LASSO in the following three scenarios:

  • If you have more parameters than observations (m greater-than n), the LASSO method selects at most n variables before it saturates, because of the nature of the convex optimization problem. This can be a defect for a variable selection method. By contrast, the elastic net method can select more than n variables in this case because of the ridge regression regularization.

  • If there is a group of variables that have high pairwise correlations, then whereas LASSO tends to select only one variable from that group, the elastic net method can select more than one variable.

  • If you have more observations than parameters (n greater-than m), and there are high correlations between predictors, then it has been empirically observed that the prediction performance of LASSO is dominated by ridge regression. In this case, the elastic net method can achieve better prediction performance by using ridge regression regularization.

The Lagrangian form of the elastic net optimization problem can be reformulated as

min StartMetric bold y overtilde minus bold upper X overtilde bold italic beta EndMetric squared plus lamda 1 sigma summation Underscript j equals 1 Overscript m Endscripts StartAbsoluteValue beta Subscript j Baseline EndAbsoluteValue

where the augmented design matrix bold upper X overTilde and response bold y overTilde are defined by

bold upper X overTilde Subscript left-parenthesis n plus m right-parenthesis times m Baseline equals StartBinomialOrMatrix bold upper X Choose StartRoot lamda 2 EndRoot bold upper I EndBinomialOrMatrix comma bold y overTilde Subscript left-parenthesis n plus m right-parenthesis times 1 Baseline equals StartBinomialOrMatrix bold y Choose bold 0 EndBinomialOrMatrix

This implies that for a given lamda 2, the coefficients of the elastic net fit follow the same piecewise linear path as LASSO and can be solved using least angle regression (LARS) algorithm. Moreover, Zou and Hastie (2005) suggest rescaling the coefficients by 1 plus lamda 2 to deal with the double amount of shrinkage in the elastic net fit; such rescaling is applied when you specify the enScale subparameter.

If you have a good estimate of lamda 2, you can specify the value in the L2 subparameter. If you do not specify a value for lamda 2, then by default the glm action searches for a value between 0 and 1 that is optimal according to the current value of the choose criterion (by default, the choose subparameter value is ‘SBC’).

Computing the entire solution path can be prohibitive for large problems when m is large. Instead of using the LARS algorithm, you can use other optimization techniques to solve LASSO and elastic net problems for a reduced set of regularization parameters. You can reformulate the general elastic net objective into the following optimization problem:

min Underscript bold italic beta Endscripts StartFraction 1 Over 2 n EndFraction StartMetric bold y minus bold upper X bold italic beta EndMetric squared plus alpha lamda sigma summation Underscript j equals 1 Overscript m Endscripts StartAbsoluteValue beta Subscript j Baseline EndAbsoluteValue plus StartFraction left parenthesis 1 minus alpha right parenthesis lamda Over 2 EndFraction sigma summation Underscript j equals 1 Overscript m Endscripts beta Subscript j Superscript 2

where lamda is the regularization parameter and alpha is the mixing parameter that controls the balance between the LASSO penalty and the ridge penalty. If alpha equals 1, the problem reduces to the LASSO regression; if alpha equals 0, the problem reduces to the ridge regression. Because the LASSO penalty is nonsmooth, the optimization is not readily solved by traditional techniques that use gradient information. To solve the problem, the action takes the following three general approaches:

  • Uses the Orthant-Wise limited-memory quasi-Newton (OWL-QN) method (Andrew and Gao 2007), which is based on LBFGS and can efficiently optimize upper L 1 regularized objective functions. This solver efficiently handles wide data, where m much greater than n. You can use it by specifying the solver subparameter value ‘LBFGS’.

  • Uses the alternating direction method of multipliers (ADMM), which decomposes the objective into smooth and nonsmooth parts and solves efficiently with augmented Lagrangians (Boyd et al. 2011). This solver efficiently handles tall data, where n much greater than m. You can use it by specifying the solver subparameter value ‘ADMM’. This method is available only for the glm action.

  • Turns the nonsmooth LASSO penalty into a smooth penalty by using the reformulation of beta equals beta Subscript plus Baseline minus beta Subscript minus, and thus StartAbsoluteValue beta EndAbsoluteValue equals beta Subscript plus Baseline plus beta Subscript minus, where beta Subscript plus Baseline equals max left parenthesis beta comma 0 right parenthesis comma beta Subscript minus Baseline equals max left parenthesis negative beta comma 0 right parenthesis. This reformulation converts the optimization to a constrained nonlinear problem. You can use two solvers by specifying the solver subparameter values ‘BFGS’ and ‘NLP’. For low-dimensional problems, both solvers can provide accurate solutions.

If you use one of the solvers to optimize the elastic net objective without supplying a list of regularization parameters or specifying the number of regularization parameters, the glm action uses a single heuristic value for regularization:

lamda equals StartLayout Enlarged left brace 1st Row 1st Column 0.001 lamda Subscript max Baseline 2nd Column if n greater than m 2nd Row 1st Column 0.1 lamda Subscript max Baseline 2nd Column otherwise EndLayout

where alpha lamda Subscript max makes the lower bound of the upper L 1 regularization such that bold italic beta equals bold 0. Note that for ridge regression when alpha equals 0, lamda Subscript max does not exist because the ridge penalty is not sparsity-inducing. In this case, elastic net selection computes lamda Subscript max by assuming that alpha equals 1 if you do not supply a list of regularization parameters.

If you supply a list of regularization parameters by using the lambda subparameter, the action sorts the values in descending order and then selects candidate models. If you supply only the number of regularization parameters n Subscript lamda by using the nLambda subparameter, the action constructs a series of lamda values by using the following mechanism:

lamda Subscript k Baseline equals lamda Subscript max Baseline dot rho Superscript k minus 1 Baseline comma k equals 1 comma ellipsis comma n Subscript lamda Baseline

You can specify rho by using the rho subparameter. Otherwise the action uses

rho equals StartLayout Enlarged left brace 1st Row 1st Column 10 Superscript negative 4 divided by left parenthesis n Super Subscript lamda Superscript minus 1 right parenthesis Baseline 2nd Column if n greater than m 2nd Row 1st Column 10 Superscript negative 2 divided by left parenthesis n Super Subscript lamda Superscript minus 1 right parenthesis Baseline 2nd Column otherwise EndLayout

The idea of penalization by using both LASSO and ridge penalty also extends to generalized linear models, including logistic regression models. For those models, the general elastic net objective function is formulated as

min Underscript bold italic beta Endscripts minus StartFraction 1 Over n EndFraction upper L left parenthesis bold italic mu left parenthesis bold italic beta right parenthesis semicolon bold y right parenthesis plus alpha lamda sigma summation Underscript j equals 1 Overscript m Endscripts StartAbsoluteValue beta Subscript j Baseline EndAbsoluteValue plus StartFraction left parenthesis 1 minus alpha right parenthesis lamda Over 2 EndFraction sigma summation Underscript j equals 1 Overscript m Endscripts beta Subscript j Superscript 2

where upper L left parenthesis bold italic mu left parenthesis bold italic beta right parenthesis semicolon bold y right parenthesis is the log-likelihood function. If you do not supply a list of regularization parameters or do not specify the number of regularization parameters, the logistic action uses the default nLambda parameter value of 20 in order to choose a good candidate in a reasonable amount of time.

Last updated: March 05, 2026