COUNTREG Procedure

Variable Selection

Variable Selection Methods

Variable Selection Using an Information Criterion

This type of variable selection uses either Akaike’s information criterion (AIC) or the Schwartz Bayesian criterion (SBC) and either a forward selection method or a backward elimination method.

Forward selection starts from a small subset of variables. In each step, the variable that gives the largest decrease in the value of the information criterion specified in the CRITER= option (AIC or SBC) is added. The process stops when the next candidate to be added does not reduce the value of the information criterion by more than the amount specified in the LSTOP= option in the MODEL statement.

Backward elimination starts from a larger subset of variables. In each step, one variable is dropped based on the information criterion chosen.

You can force a variable to be retained in the variable selection process by adding a RETAIN list to the SELECT=INFO (or SELECTVAR=INFO) option in your model. For example, suppose you add a RETAIN list to the SELECT=INFO option in your model as follows:

MODEL Art = Mar Kid5 Phd  / dist=negbin(p=2) SELECT=INFO(lstop=0.001 RETAIN(Phd));

Then this causes the variable selection process to consider only those models that contain Phd as a regressor. As a result, you are guaranteed that Phd will appear as one of the regressor variables in whatever model the variable selection process produces. The model that results is the "best" (relative to your selection criterion) of all the possible models that contain Phd.

When a ZEROMODEL statement is used in conjunction with a MODEL statement, then all the variables that appear in the ZEROMODEL statement are retained by default unless the ZEROMODEL statement itself contains a SELECT=INFO option.

For example, suppose you have the following:

MODEL Art = Mar Kid5 Phd  / dist=negbin(p=2) SELECT=INFO(lstop=0.001 RETAIN(Phd));
ZEROMODEL Art ~ Fem Ment / link=normal;

Then Phd is retained in the MODEL statement and all the variables in the ZEROMODEL statement (Fem and Ment) are retained as well. You can add an empty SELECT=INFO clause to the ZEROMODEL statement to indicate that all the variables in that statement are eligible for elimination (that is, need not be retained) during variable selection. For example:

MODEL Art = Mar Kid5 Phd  / dist=negbin(p=2) SELECT=INFO(lstop=0.001 RETAIN(Phd));
ZEROMODEL Art ~ Fem Ment / link=normal SELECT=INFO();

In this example, only Phd from the MODEL statement is guaranteed to be retained. All the other variables in the MODEL statement and all the variables in the ZEROMODEL statement are eligible for elimination.

Similarly, if your ZEROMODEL statement contains a SELECT=INFO option but your MODEL statement does not, then all the variables in the MODEL statement are retained, whereas only those variables listed in the RETAIN() list of the SELECT=INFO option for your ZEROMODEL statement are retained. For example:

MODEL Art = Mar Kid5 Phd  / dist=negbin(p=2) ;
ZEROMODEL Art ~ Fem Ment / link=normal SELECT=INFO(RETAIN(Ment));

Here, all the variables in the MODEL statement (Mar Kid5 Phd) are retained, but only the Ment variable in the ZEROMODEL statement is retained.

Variable Selection and Class Variables

When a model that contains a classification variable is evaluated, the classification variable is effectively replaced by a set of parameters, each of which corresponds to some level of the classification variable. This is known as levelizing the classification variable. In the following discussion, the parameters that result from levelizing a classification variable are called level-qualified parameters.

By default, as variable selection proceeds, PROC COUNTREG treats each level-qualified parameter as an effect in its own right. This is described as splitting the original classification variable effect. Thus, at any particular step during the variable selection process, a candidate model can contain all, none, or only some of the level-qualified parameters that result from levelizing a classification variable.

Variable Selection with Split Effects

For example, suppose that Fem and Ment are continuous variables and that Kid5 is a classification variable that has four levels: 0, 1, 2, and 3. Suppose your model is the following:

CLASS Kid5;
MODEL Art = Fem Kid5 Ment / dist=poisson SELECT=INFO( lstop=0.001 );

Levelizing the Kid5 classification variable produces four level-qualified parameters: Kid5_0, Kid5_1, Kid5_2, and Kid5_3. Because the Intercept is an effect in the model (by default), PROC COUNTREG eliminates the last level-qualified parameter for each levelized class variable in the model. This prevents problems that would otherwise ensue because of collinearity. In this case, PROC COUNTREG eliminates Kid5_3 from the model from the outset. Thus, Kid5_3 will never be included in any candidate model. PROC COUNTREG evaluates the following candidates at Step 1:

{Intercept, Fem}
{Intercept, Kid5_0}
{Intercept, Kid5_1}
{Intercept, Kid5_2}
{Intercept, Ment}

Note how each candidate contains either none or only one of the level-qualified parameters that result from levelizing the Kid5 classification variable. Thus, the classification variable Kid5 has been split: its associated level-qualified parameters are treated as individual effects. Suppose that {Intercept, Ment} is selected from among the candidates. Then PROC COUNTREG evaluates the following candidates at Step 2:

{Intercept, Ment, Fem}
{Intercept, Ment, Kid5_0}
{Intercept, Ment, Kid5_1}
{Intercept, Ment, Kid5_2}

Suppose that {Intercept, Ment, Fem} is selected from among the candidates. Then PROC COUNTREG evaluates the following candidates at Step 3:

{Intercept, Ment, Fem, Kid5_0}
{Intercept, Ment, Fem, Kid5_1}
{Intercept, Ment, Fem, Kid5_2}

Suppose that {Intercept, Ment, Fem, Kid5_0} is selected from among the candidates. Depending on the data, it is entirely possible that none of the Step 4 candidates improves the information criterion that is associated with the model that was selected at Step 3. As a result, the final selected model is:

{Intercept, Ment, Fem, Kid5_0}

As this example shows, when classification effects are split, it is possible for the final selected model to contain some, but not all, of the level-qualified parameters that are associated with the Kid5 classification variable.

Variable Selection without Split Effects

If you do not want the variable selection process in PROC COUNTREG to split classification effects as illustrated in the preceding section, then you must specify the NOSPLITEFFECTS option. If you specify the NOSPLITEFFECTS option (which can be abbreviated as NOSPLIT), then as variable selection proceeds, a particular candidate model will contain either all or none of the level-qualified parameters that result from levelizing the classification variable. When the NOSPLIT option is specified, no candidate will ever contain only some but not all of the level-qualified parameters that are associated with a classification variable.

Suppose your model is the following:

CLASS Kid5;
MODEL Art = Fem Kid5 Ment / dist=poisson SELECT=INFO( lstop=0.001 NOSPLIT );

Because the NOSPLIT option is specified, PROC COUNTREG evaluates the following candidates at Step 1:

{Intercept, Fem}
{Intercept, Kid5_0, Kid5_1, Kid5_2}
{Intercept, Ment}

Note how each candidate contains either all or none of the level-qualified parameters that result from levelizing the Kid5 classification variable. Thus, the classification variable Kid5 is not split: its associated level-qualified parameters are not treated as individual effects. Suppose that {Intercept, Ment} is selected from among the candidates. Then PROC COUNTREG evaluates the following candidates at Step 2:

{Intercept, Ment, Fem}
{Intercept, Ment, Kid5_0, Kid5_1, Kid5_2}

Suppose that {Intercept, Ment, Fem} is selected from among the candidates. Depending on the data, it is entirely possible that none of the Step 3 candidates improves the information criterion that is associated with the model that was selected at Step 2. As a result, the final selected model is:

{Intercept, Ment, Fem}

As this example shows, when the NOSPLIT option is specified, the final selected model contains either all or none of the level-qualified parameters that are associated with the Kid5 classification variable.

Classification Variables and the RETAIN Option

As described earlier in this section, if you want to constrain the variable selection process in such a way that it considers only candidates that include a certain variable, then you can use the RETAIN option. However, you cannot refer to a classification variable by name in the RETAIN list. Recall that by default, the variable selection process in PROC COUNTREG splits classification effects into individual effects that correspond to the levels of the classification variable. Thus, if you want to retain the original classification variable Kid5, you must list each of its level-qualified parameters by name. You can also retain some but not all of the level-qualified parameters. For example, to retain the level-qualified parameters Kid5_0 and Kid5_2 of the Kid5 classification variable, you would specify the RETAIN option as follows:

MODEL Art = Fem Kid5 Ment / dist=poisson
      SELECT=INFO( lstop=0.001 RETAIN(Kid5_0 Kid5_2) );

The RETAIN option can be used to retain effects only when the NOSPLITEFFECTS option is not specified. The RETAIN option is ignored if the NOSPLITEFFECTS option is specified.

Classification Variables and the RETAINEFFECT Option

When the NOSPLITEFFECTS option is specified, you must use the RETAINEFFECT option if you want to constrain the variable selection process in such a way that it considers only candidates that include a certain variable. Any effect in your MODEL statement can be added to a RETAINEFFECT list. Thus, if you want to retain the original classification variable Kid5, you can refer to it by name in the RETAINEFFECT option as follows:

MODEL Art = Fem Kid5 Ment / dist=poisson
      SELECT=INFO( lstop=0.001  NOSPLIT  RETAINEFFECT(Kid5) );

Effects in other modeling statements can be retained in a similar fashion. In the following example, the RETAINEFFECT option in the ZEROMODEL statement causes the zero-inflated Kid5 classification variable to be retained:

MODEL Art = Fem Kid5 Ment / dist=ZIP SELECT=INFO( lstop=0.001  NOSPLIT );
ZEROMODEL Art ~ Mar Kid5 / SELECT=INFO( RETAINEFFECT(Kid5) );

Individual level-qualified parameters that are associated with a classification variable cannot be retained using the RETAINEFFECT option. The RETAINEFFECT option can be used to retain effects only when the NOSPLITEFFECTS option is specified. The RETAINEFFECT option is ignored if the NOSPLITEFFECTS option is not specified.

Variable Selection Using Penalized Likelihood

Variable selection in the linear regression context can be achieved by adding some form of penalty on the regression coefficients. One particular such form is upper L 1 norm penalty, which leads to LASSO:

min Underscript beta Endscripts double-vertical-bar upper Y minus upper X beta double-vertical-bar squared plus lamda sigma-summation Underscript j equals 1 Overscript p Endscripts StartAbsoluteValue beta Subscript j Baseline EndAbsoluteValue

This penalty method is becoming more popular in linear regression, because of the computational development in the recent years. However, how to generalize the penalty method for variable selection to the more general statistical models is not trivial. Some work has been done for the generalized linear models, in the sense that the likelihood depends on the data through a linear combination of the parameters and the data:

l left-parenthesis beta vertical-bar x right-parenthesis equals l left-parenthesis x Superscript upper T Baseline beta right-parenthesis

In the more general form, the likelihood as a function of the parameters can be denoted by l left-parenthesis theta right-parenthesis equals sigma-summation Underscript i Endscripts l Subscript i Baseline left-parenthesis theta right-parenthesis, where theta is a vector that can include any parameters and l left-parenthesis dot right-parenthesis is the likelihood for each observation. For example, in the Poisson model, theta equals left-parenthesis beta 0 comma beta 1 comma ellipsis comma beta Subscript p Baseline right-parenthesis, and in the negative binomial model theta equals left-parenthesis beta 0 comma beta 1 comma ellipsis comma beta Subscript p Baseline comma alpha right-parenthesis. The following discussion introduces the penalty method, using the Poisson model as an example, but it applies similarly to the negative binomial model. The penalized likelihood function takes the form

upper Q left-parenthesis beta right-parenthesis equals sigma-summation Underscript i Endscripts l Subscript i Baseline left-parenthesis beta right-parenthesis minus n sigma-summation Underscript j equals 1 Overscript p Endscripts p Subscript lamda Sub Subscript j Baseline left-parenthesis StartAbsoluteValue beta Subscript j Baseline EndAbsoluteValue right-parenthesis

The upper L 1 norm penalty function that is used in the calculation is specified as

p Subscript lamda Baseline left-parenthesis StartAbsoluteValue beta EndAbsoluteValue right-parenthesis equals lamda

The main challenge for this penalized likelihood method is on the computation side. The penalty function is nondifferentiable at zero, posing a computational problem for the optimization. To get around this nondifferentiability problem, Fan and Li (2001) suggested a local quadratic approximation for the penalty function. However, it was later found that the numerical performance is not satisfactory in a few respects. Zou and Li (2008) proposed local linear approximation (LLA) to solve the problem numerically. The algorithm replaces the penalty function with a linear approximation around a fixed point beta Superscript left-parenthesis 0 right-parenthesis:

p Subscript lamda Baseline left-parenthesis StartAbsoluteValue beta Subscript j Baseline EndAbsoluteValue right-parenthesis almost-equals p Subscript lamda Baseline left-parenthesis StartAbsoluteValue beta Subscript j Superscript left-parenthesis 0 right-parenthesis Baseline EndAbsoluteValue right-parenthesis plus p prime Subscript lamda Baseline left-parenthesis StartAbsoluteValue beta Subscript j Superscript left-parenthesis 0 right-parenthesis Baseline EndAbsoluteValue right-parenthesis left-parenthesis StartAbsoluteValue beta Subscript j Baseline EndAbsoluteValue minus StartAbsoluteValue beta Subscript j Superscript left-parenthesis 0 right-parenthesis Baseline EndAbsoluteValue right-parenthesis

Then the problem can be solved iteratively. Start from beta Superscript left-parenthesis 0 right-parenthesis Baseline equals ModifyingAbove beta With caret Subscript upper M, which denotes the usual MLE estimate. For iteration k,

beta Superscript left-parenthesis k plus 1 right-parenthesis Baseline equals arg max Underscript beta Endscripts left-brace sigma-summation Underscript i Endscripts l Subscript i Baseline left-parenthesis beta right-parenthesis minus n sigma-summation Underscript j equals 1 Overscript p Endscripts p prime Subscript lamda Baseline left-parenthesis StartAbsoluteValue beta Subscript j Superscript left-parenthesis k right-parenthesis Baseline EndAbsoluteValue right-parenthesis StartAbsoluteValue beta Subscript j Baseline EndAbsoluteValue right-brace

The algorithm stops when double-vertical-bar beta Superscript left-parenthesis k plus 1 right-parenthesis Baseline minus beta Superscript left-parenthesis k right-parenthesis Baseline double-vertical-bar is small. To save computing time, you can also choose a maximum number of iterations. This number can be specified by the LLASTEPS= option.

The objective function is nondifferentiable. The optimization problem can be solved using an optimization methods with constraints, by a variable transformation

beta Subscript j Baseline equals beta Subscript j Superscript plus Baseline minus beta Subscript j Superscript minus Baseline comma beta Subscript j Superscript plus Baseline greater-than-or-equal-to 0 comma beta Subscript j Superscript minus Baseline greater-than-or-equal-to 0

For each fixed tuning parameter lamda, you can solve the preceding optimization problem to obtain an estimate for beta. Because of the property of the upper L 1 norm penalty, some of the coefficients in beta can be exactly zero. The remaining question is to choose the best tuning parameter lamda. You can use either of the approaches that are described in the following subsections.

The GCV Approach

In the GCV approach, the generalized cross validation criteria (GCV) is computed for each value of lamda on a predetermined grid StartSet lamda 1 comma ellipsis comma lamda Subscript upper L Baseline EndSet; the value of lamda that achieves the minimum of the GCV is the optimal tuning parameter. The maximum value lamda Subscript upper L can be determined by lemma 1 in Park and Hastie (2007) as follows. Suppose beta 0 is free of penalty in the objective function. Let ModifyingAbove beta With caret Subscript 0 be the MLE of beta 0 by forcing the rest of the parameters to be zero. Then the maximum value of lamda is

StartLayout 1st Row 1st Column lamda Subscript upper L 2nd Column equals arg max Underscript lamda Endscripts left-brace max Underscript lamda Endscripts colon StartAbsoluteValue StartFraction partial-differential l Over partial-differential beta Subscript j Baseline EndFraction left-parenthesis ModifyingAbove beta With caret Subscript 0 Baseline right-parenthesis EndAbsoluteValue less-than-or-equal-to n upper P prime Subscript lamda Baseline left-parenthesis StartAbsoluteValue beta Subscript j Baseline EndAbsoluteValue right-parenthesis comma j equals 1 comma ellipsis comma p right-brace 2nd Row 1st Column Blank 2nd Column equals arg max Underscript lamda Endscripts left-brace StartAbsoluteValue StartFraction 1 Over n EndFraction StartFraction partial-differential l Over partial-differential beta Subscript j Baseline EndFraction left-parenthesis ModifyingAbove beta With caret Subscript 0 Baseline right-parenthesis EndAbsoluteValue comma j equals 1 comma ellipsis comma p right-brace EndLayout

You can compute the GCV by using the LASSO framework. In the last step of Newton-Raphson approximation, you have

one-half min Underscript beta Endscripts double-vertical-bar left-parenthesis nabla squared l left-parenthesis beta Superscript left-parenthesis k right-parenthesis Baseline right-parenthesis right-parenthesis Superscript 1 slash 2 Baseline left-parenthesis beta minus beta Superscript left-parenthesis k right-parenthesis Baseline right-parenthesis plus left-parenthesis nabla squared l left-parenthesis beta Superscript left-parenthesis k right-parenthesis Baseline right-parenthesis right-parenthesis Superscript negative 1 slash 2 Baseline nabla l left-parenthesis beta Superscript left-parenthesis k right-parenthesis Baseline right-parenthesis double-vertical-bar squared plus n sigma-summation Underscript j equals 1 Overscript p Endscripts p prime Subscript lamda Baseline left-parenthesis StartAbsoluteValue beta Subscript j Superscript left-parenthesis k right-parenthesis Baseline EndAbsoluteValue right-parenthesis StartAbsoluteValue beta Subscript j Baseline EndAbsoluteValue

The solution ModifyingAbove beta With caret satisfies

ModifyingAbove beta With caret minus beta Superscript left-parenthesis k right-parenthesis Baseline equals minus left-parenthesis nabla squared l left-parenthesis beta Superscript left-parenthesis k right-parenthesis Baseline right-parenthesis minus 2 upper W Superscript minus Baseline right-parenthesis Superscript negative 1 Baseline left-parenthesis nabla l left-parenthesis beta Superscript left-parenthesis k right-parenthesis Baseline right-parenthesis minus 2 bold b right-parenthesis

where

StartLayout 1st Row 1st Column upper W Superscript minus 2nd Column equals n diag left-parenthesis upper W 1 Superscript minus Baseline comma ellipsis comma upper W Subscript p Superscript minus Baseline right-parenthesis 2nd Row 1st Column upper W Subscript j Superscript minus 2nd Column equals StartLayout Enlarged left-brace 1st Row  StartFraction p prime Subscript lamda Baseline left-parenthesis StartAbsoluteValue beta Subscript j Superscript left-parenthesis k right-parenthesis Baseline EndAbsoluteValue right-parenthesis Over StartAbsoluteValue beta Subscript j Baseline EndAbsoluteValue EndFraction comma if beta Subscript j Baseline not-equals 0 2nd Row  0 comma if beta Subscript j Baseline equals 0 EndLayout 3rd Row 1st Column bold b 2nd Column equals n diag left-parenthesis p prime Subscript lamda Baseline left-parenthesis StartAbsoluteValue beta 1 Superscript left-parenthesis k right-parenthesis Baseline EndAbsoluteValue right-parenthesis sgn left-parenthesis beta 1 right-parenthesis comma ellipsis comma p prime Subscript lamda Baseline left-parenthesis StartAbsoluteValue beta Subscript p Superscript left-parenthesis k right-parenthesis Baseline EndAbsoluteValue right-parenthesis sgn left-parenthesis beta Subscript p Baseline right-parenthesis right-parenthesis EndLayout

Note that the intercept term has no penalty on its absolute value, and therefore the upper W Subscript j Superscript minus term that corresponds to the intercept is 0. More generally, you can make any parameter (such as the alpha in the negative binomial model) in the likelihood function free of penalty, and you treat them the same as the intercept.

The effective number of parameters is

StartLayout 1st Row 1st Column e left-parenthesis lamda right-parenthesis 2nd Column equals trace StartSet left-parenthesis nabla squared l left-parenthesis beta Superscript left-parenthesis k right-parenthesis Baseline right-parenthesis right-parenthesis Superscript 1 slash 2 Baseline left-parenthesis nabla squared l left-parenthesis beta Superscript left-parenthesis k right-parenthesis Baseline right-parenthesis minus 2 upper W Superscript minus Baseline right-parenthesis Superscript negative 1 Baseline left-parenthesis nabla squared l left-parenthesis beta Superscript left-parenthesis k right-parenthesis Baseline right-parenthesis right-parenthesis Superscript 1 slash 2 Baseline EndSet 2nd Row 1st Column Blank 2nd Column equals trace StartSet left-parenthesis nabla squared l left-parenthesis beta Superscript left-parenthesis k right-parenthesis Baseline right-parenthesis minus 2 upper W Superscript minus Baseline right-parenthesis Superscript negative 1 Baseline nabla squared l left-parenthesis beta Superscript left-parenthesis k right-parenthesis Baseline right-parenthesis EndSet EndLayout

and the generalized cross validation error is

GCV left-parenthesis lamda right-parenthesis equals StartFraction l left-parenthesis ModifyingAbove beta With caret right-parenthesis Over n left-bracket 1 minus e left-parenthesis lamda right-parenthesis slash n right-bracket squared EndFraction
The GCV1 Approach

Another form of GCV uses the number of nonzero coefficients as the degrees of freedom:

StartLayout 1st Row 1st Column e 1 left-parenthesis lamda right-parenthesis 2nd Column equals sigma-summation Underscript j equals 0 Overscript p Endscripts bold 1 Subscript left-bracket beta Sub Subscript j Subscript not-equals 0 right-bracket Baseline 2nd Row 1st Column GCV Subscript 1 Baseline left-parenthesis lamda right-parenthesis 2nd Column equals StartFraction l left-parenthesis ModifyingAbove beta With caret right-parenthesis Over n left-bracket 1 minus e 1 left-parenthesis lamda right-parenthesis slash n right-bracket squared EndFraction EndLayout

The standard errors follow the sandwich formula:

StartLayout 1st Row 1st Column cov left-parenthesis ModifyingAbove beta With caret right-parenthesis 2nd Column equals StartSet nabla squared l left-parenthesis beta Superscript left-parenthesis k right-parenthesis Baseline right-parenthesis minus 2 upper W Superscript minus Baseline EndSet Superscript negative 1 Baseline ModifyingAbove cov With caret left-parenthesis nabla l left-parenthesis beta Superscript left-parenthesis k right-parenthesis Baseline right-parenthesis minus 2 bold b right-parenthesis StartSet nabla squared l left-parenthesis beta Superscript left-parenthesis k right-parenthesis Baseline right-parenthesis minus 2 upper W Superscript minus Baseline EndSet Superscript negative 1 Baseline 2nd Row 1st Column Blank 2nd Column equals StartSet nabla squared l left-parenthesis beta Superscript left-parenthesis k right-parenthesis Baseline right-parenthesis minus 2 upper W Superscript minus Baseline EndSet Superscript negative 1 Baseline ModifyingAbove cov With caret left-parenthesis nabla l left-parenthesis beta Superscript left-parenthesis k right-parenthesis Baseline right-parenthesis right-parenthesis StartSet nabla squared l left-parenthesis beta Superscript left-parenthesis k right-parenthesis Baseline right-parenthesis minus 2 upper W Superscript minus Baseline EndSet Superscript negative 1 EndLayout

It is common practice to report only the standard errors of the nonzero parameters.

Variable Selection with a NOINT Model

If you specify the NOINT option in your MODEL statement, the model produced by variable selection will always contain at least one effect from the original MODEL statement. If you request forward selection with a NOINT model and you do not retain any main model effect, then the only effects that will be candidates for the single-effect model that is derived in the first step will be the effects that are present in the original MODEL statement. For all subsequent steps, all effects from the MODEL, ZEROMODEL, DISPMODEL, and SPATIALEFFECTS statements will be candidates for inclusion in the model that is derived at that step in the process. Meanwhile, if you request backward selection with a NOINT model, you do not retain a specific main model effect, and a model that contains only one effect from the original MODEL statement is derived at a particular step, then that effect will remain in all the models that are evaluated in all subsequent steps.

Last updated: June 19, 2025