Shared Concepts

Best-Subset Selection

This section applies to the glm action in the regression action set. When the method subparameter value is bestsubset, the model selection action performs the best-subset selection method, which uses the branch-and-bound technique (Furnival and Wilson 1974) to efficiently search for subsets of model effects that best predict the response variable. For a multiple regression model that contains m predictors, there are 2 Superscript m Baseline minus 1 candidate subset models. When m becomes large, the number of candidate submodels grows exponentially, and searching within the model space becomes impossible. By presenting all possible candidates in a tree diagram where the full model is at the root node and simple one-parameter models are at the leaf nodes, the problem becomes how to efficiently search the tree. When model B consists of a subset of model effects that are in model A, a criterion g left-parenthesis dot right-parenthesis that satisfies the inequality

g left-parenthesis upper A right-parenthesis less-than-or-equal-to g left-parenthesis upper B right-parenthesis

so that better models have smaller values of g left-parenthesis dot right-parenthesis, enables you to avoid fitting models such as B farther down the branch when the criterion g left-parenthesis upper A right-parenthesis is already larger than the current best criterion at one node. Thus the technique can greatly reduce the computation cost of searching for the best candidates.

Three model fit criteria are available for the search: the R-square, the adjusted R-square, and Mallow’s upper C Subscript p statistics. The computation time that the analysis requires is highly dependent on the data and on the values of the best, minEffects, and maxEffects subparameters. In particular, for these three criteria for the select subparameter and a large value of the best subparameter, adding one more effect to the list from which regressors are selected might significantly increase the computation time.

RSQUARE

When the select subparameter value is rsquare, the action finds subsets of effects that best predict a dependent variable by linear regression in the given data. You can specify the largest and smallest number of effects to appear in a subset and the number of subsets of each size to be selected. Specifying the rsquare value can efficiently perform all possible subset regressions and display the models in decreasing order of R-square magnitude within each subset size. Other statistics are available for comparing subsets of different sizes.

The subset models that the action selects when you specify the rsquare value are optimal in terms of R-square for the given data, but they are not necessarily optimal for the population from which the data sample is drawn or for any other data for which you might want to make predictions. If a subset model is selected on the basis of a large R-square value or any other criterion commonly used for model selection, then all regression statistics that are computed for that model under the assumption that the model is given a priori are biased, including all other computed statistics.

Although specifying the rsquare value in the select subparameter is useful for building exploratory models, no single statistical method can be relied on to identify the "true" model. Effective model building requires substantive theory to suggest relevant predictors and plausible functional forms for the model.

Specifying the rsquare value in the select subparameter differs from the other selection specifications in that it always identifies the model that has the largest R-square value for each number of effects considered. The other selection methods are not guaranteed to find the model that has the largest R-square.

ADJRSQ

Specifying the adjrsq value in the select subparameter is similar to specifying the rsquare value in the select subparameter, except that the adjusted R-square statistic is used as the criterion for selecting models, and the option finds the models that have the highest adjusted R-square value within the range of sizes.

CP

Specifying the cp value in the select subparameter is similar to specifying the adjrsq value in the select subparameter, except that Mallows’ upper C Subscript p statistic is used as the criterion for the best-subset selection. Models are listed in ascending order of upper C Subscript p.

Last updated: March 05, 2026