QLIM Procedure

Selection Models

In sample selection models, one or several dependent variables are observed when another variable takes certain values. For example, the standard Heckman selection model can be defined as

z Subscript i Superscript asterisk Baseline equals bold w prime Subscript i Baseline bold-italic gamma plus u Subscript i

z Subscript i Baseline equals StartLayout Enlarged left-brace 1st Row 1st Column 1 2nd Column if z Subscript i Superscript asterisk Baseline greater-than 0 2nd Row 1st Column 0 2nd Column if z Subscript i Superscript asterisk Baseline less-than-or-equal-to 0 EndLayout

y Subscript i Baseline equals bold x prime Subscript i Baseline bold-italic beta plus epsilon Subscript i Baseline if z Subscript i Baseline equals 1

where and are jointly normal with 0 mean, standard deviations of 1 and , respectively, and correlation of . Selection is based on the variable z, and y is observed when z has a value of 1. Least squares regression that uses the observed data of y produces inconsistent estimates of . The maximum likelihood method is used to estimate selection models. It is also possible to estimate these models by using Heckman’s method, which is more computationally efficient. But it can be shown that the resulting estimates, although consistent, are not asymptotically efficient under a normality assumption. Moreover, this method often violates the constraint on the correlation coefficient .

The log-likelihood function of the Heckman selection model is written as

The selection can be based on only one variable, but the selection can lead to several variables. For example, selection is based on the variable z in the following switching regression model:

StartLayout 1st Row 1st Column y Subscript 1 i 2nd Column equals 3rd Column bold x prime Subscript 1 i Baseline bold-italic beta 1 plus epsilon Subscript 1 i Baseline if z Subscript i Baseline equals 0 2nd Row 1st Column y Subscript 2 i 2nd Column equals 3rd Column bold x prime Subscript 2 i Baseline bold-italic beta 2 plus epsilon Subscript 2 i Baseline if z Subscript i Baseline equals 1 EndLayout

If , then is observed. If , then is observed. Because and are never observed at the same time, the correlation between and cannot be estimated. Only the correlation between z and and the correlation between z and can be estimated. This estimation uses the maximum likelihood method.

A brief example of the SAS statements for this model can be found in Sample Selection Model.

The Heckman selection model can be extended to include censoring or truncation. For a brief example of the SAS statements for these models, see Sample Selection Model with Truncation and Censoring. The following example shows a variable that is censored from below at zero:

y Subscript i Superscript asterisk Baseline equals bold x prime Subscript i Baseline bold-italic beta plus epsilon Subscript i Baseline if z Subscript i Baseline equals 1

y Subscript i Baseline equals StartLayout Enlarged left-brace 1st Row 1st Column y Subscript i Superscript asterisk Baseline 2nd Column normal i normal f y Subscript i Superscript asterisk Baseline greater-than 0 2nd Row 1st Column 0 2nd Column normal i normal f y Subscript i Superscript asterisk Baseline less-than-or-equal-to 0 EndLayout

In this case, the log-likelihood function of the Heckman selection model needs to be modified as follows to include the censored region:

In case is truncated from below at 0 instead of censored, the likelihood function can be written as

The basic selection model can also be extended to include the treatment effects models. You can find the details for treatment effects models in the section Endogenous Dummy Variable Models—Treatment Effects Regression.

Heckman’s Two-Step Selection Method

Sample selection bias arises from nonrandom selection of the sample from the population. A classic example is using a sample of market wages for working women to estimate female labor supply function. This sample is nonrandom because it includes only the wages of women whose market wage exceeds their home wage at zero hours of work.

A simple selection model can be written as the latent model

where and are jointly normal with 0 mean, standard deviations of 1 and , respectively, and correlation of . The dependent variable (wage) is observed if the latent variable (the difference between market wage and reservation wage) is positive or if the indicator variable (labor force participation) is 1.

The model of interest that applies to the observations in the selected sample can be written as

upper E left-parenthesis y Subscript i Baseline vertical-bar bold x Subscript i Baseline comma z Subscript i Baseline equals 1 right-parenthesis equals bold x prime Subscript i Baseline bold-italic beta plus rho sigma lamda left-parenthesis bold w prime Subscript i Baseline bold-italic gamma right-parenthesis

where . Hence, the following regression equation is valid for the observations for which :

y Subscript i Baseline equals bold x prime Subscript i Baseline bold-italic beta plus rho sigma lamda left-parenthesis bold w prime Subscript i Baseline bold-italic gamma right-parenthesis plus v Subscript i

Therefore, estimates of that are obtained from the OLS regression of y on by using the selected sample (that is, the sample for which ) suffer from omitted variable bias if selection bias is really the case. Although maximum likelihood estimation of is consistent and efficient, Heckman’s two-step method is more frequently used. Heckman’s two-step method can be requested by specifying the HECKIT option of the QLIM statement.

Heckman’s two-step method is as follows:

Obtain , the estimate of the parameters of the probability that , by using regressors and the binary dependent variable by probit analysis for the full sample. Compute .
Obtain and , the estimates of and , by least squares regression of on and by using observations on the selected subsample.

The standard least squares estimators of the population variance and the variances of the estimated coefficients are incorrect. To test hypotheses, the correct ones need to be calculated. An estimator of is

ModifyingAbove sigma With caret squared equals StartFraction 1 Over upper N 1 EndFraction sigma-summation Underscript i equals 1 Overscript upper N 1 Endscripts e Subscript i Superscript 2 Baseline plus ModifyingAbove beta With caret Subscript lamda Superscript 2 Baseline StartFraction 1 Over upper N 1 EndFraction sigma-summation Underscript i equals 1 Overscript upper N 1 Endscripts ModifyingAbove delta With caret Subscript i

where is the selected subsample size, is the residual for the ith observation obtained from step 2, and . Let be an matrix with ith row , and define similarly with ith row . Then the estimator of the asymptotic covariance of is

EstAsyVar left-bracket ModifyingAbove bold-italic beta With caret comma ModifyingAbove beta With caret Subscript lamda Baseline right-bracket equals ModifyingAbove sigma With caret squared left-bracket bold upper X prime Subscript asterisk Baseline bold upper X Subscript asterisk Baseline right-bracket Superscript negative 1 Baseline left-bracket bold upper X prime Subscript asterisk Baseline left-parenthesis bold upper I minus ModifyingAbove rho With caret squared ModifyingAbove bold upper Delta With caret right-parenthesis bold upper X Subscript asterisk Baseline plus bold upper Q right-bracket left-bracket bold upper X prime Subscript asterisk Baseline bold upper X Subscript asterisk Baseline right-bracket Superscript negative 1

where , , and

bold upper Q equals ModifyingAbove sigma With caret squared left-parenthesis bold upper X prime Subscript asterisk Baseline ModifyingAbove bold upper Delta With caret bold upper W right-parenthesis Est period Asy period Var left-parenthesis ModifyingAbove bold-italic gamma With caret right-parenthesis left-parenthesis bold upper W prime ModifyingAbove bold upper Delta With caret bold upper X Subscript asterisk Baseline right-parenthesis

where is the estimator of the asymptotic covariance of the probit coefficients that are obtained in step 1. When you specify the HECKIT option, PROC QLIM uses a numerical estimated asymptotic variance.

When the HECKIT option is specified, PROC QLIM reports the corrected standard errors for automatically. However, if you need the conventional OLS standard errors, you can specify the HECKIT(UNCORRECTED) option.

In the selected regression model, when the coefficient of is 0, you do not need Heckman’s two-step estimation method; a simple regression of y on produces consistent estimates for , and the OLS standard errors are correct. Thus, a standard t test on (which uses the estimate from step 2 and the uncorrected standard errors) is a valid test of the null hypothesis of no selection bias.

Although Heckman’s two-step method uses the OLS method in the second stage, you can request the ML method by specifying the HECKIT(SECONDSTAGE=ML) option. When the second-stage method is the ML method, the model for can be nonlinear.

Last updated: June 19, 2025