QLIM Procedure

Selection Models

In sample selection models, one or several dependent variables are observed when another variable takes certain values. For example, the standard Heckman selection model can be defined as

z Subscript i Superscript asterisk Baseline equals bold w prime Subscript i Baseline bold-italic gamma plus u Subscript i
z Subscript i Baseline equals StartLayout Enlarged left-brace 1st Row 1st Column 1 2nd Column if z Subscript i Superscript asterisk Baseline greater-than 0 2nd Row 1st Column 0 2nd Column if z Subscript i Superscript asterisk Baseline less-than-or-equal-to 0 EndLayout
y Subscript i Baseline equals bold x prime Subscript i Baseline bold-italic beta plus epsilon Subscript i Baseline if z Subscript i Baseline equals 1

where u Subscript i and epsilon Subscript i are jointly normal with 0 mean, standard deviations of 1 and sigma, respectively, and correlation of rho. Selection is based on the variable z, and y is observed when z has a value of 1. Least squares regression that uses the observed data of y produces inconsistent estimates of bold-italic beta. The maximum likelihood method is used to estimate selection models. It is also possible to estimate these models by using Heckman’s method, which is more computationally efficient. But it can be shown that the resulting estimates, although consistent, are not asymptotically efficient under a normality assumption. Moreover, this method often violates the constraint on the correlation coefficient StartAbsoluteValue rho EndAbsoluteValue less-than-or-equal-to 1.

The log-likelihood function of the Heckman selection model is written as

StartLayout 1st Row 1st Column script l 2nd Column equals 3rd Column sigma-summation Underscript i element-of StartSet z Subscript i Baseline equals 0 EndSet Endscripts ln left-bracket 1 minus normal upper Phi left-parenthesis bold w prime Subscript i Baseline bold-italic gamma right-parenthesis right-bracket 2nd Row 1st Column Blank 2nd Column plus 3rd Column sigma-summation Underscript i element-of StartSet z Subscript i Baseline equals 1 EndSet Endscripts StartSet ln phi left-parenthesis StartFraction y Subscript i Baseline minus bold x prime Subscript bold i Baseline bold-italic beta Over sigma EndFraction right-parenthesis minus ln sigma plus ln normal upper Phi left-parenthesis StartStartFraction bold w prime Subscript i Baseline bold-italic gamma plus rho StartFraction y Subscript i Baseline minus bold x prime Subscript bold i Baseline bold-italic beta Over sigma EndFraction OverOver StartRoot 1 minus rho squared EndRoot EndEndFraction right-parenthesis EndSet EndLayout

The selection can be based on only one variable, but the selection can lead to several variables. For example, selection is based on the variable z in the following switching regression model:

z Subscript i Superscript asterisk Baseline equals bold w prime Subscript i Baseline bold-italic gamma plus u Subscript i
z Subscript i Baseline equals StartLayout Enlarged left-brace 1st Row 1st Column 1 2nd Column if z Subscript i Superscript asterisk Baseline greater-than 0 2nd Row 1st Column 0 2nd Column if z Subscript i Superscript asterisk Baseline less-than-or-equal-to 0 EndLayout
StartLayout 1st Row 1st Column y Subscript 1 i 2nd Column equals 3rd Column bold x prime Subscript 1 i Baseline bold-italic beta 1 plus epsilon Subscript 1 i Baseline if z Subscript i Baseline equals 0 2nd Row 1st Column y Subscript 2 i 2nd Column equals 3rd Column bold x prime Subscript 2 i Baseline bold-italic beta 2 plus epsilon Subscript 2 i Baseline if z Subscript i Baseline equals 1 EndLayout

If z equals 0, then y 1 is observed. If z equals 1, then y 2 is observed. Because y Baseline 1 and y Baseline 2 are never observed at the same time, the correlation between y 1 and y 2 cannot be estimated. Only the correlation between z and y 1 and the correlation between z and y 2 can be estimated. This estimation uses the maximum likelihood method.

A brief example of the SAS statements for this model can be found in Sample Selection Model.

The Heckman selection model can be extended to include censoring or truncation. For a brief example of the SAS statements for these models, see Sample Selection Model with Truncation and Censoring. The following example shows a variable y Subscript i that is censored from below at zero:

z Subscript i Superscript asterisk Baseline equals bold w prime Subscript i Baseline bold-italic gamma plus u Subscript i
z Subscript i Baseline equals StartLayout Enlarged left-brace 1st Row 1st Column 1 2nd Column if z Subscript i Superscript asterisk Baseline greater-than 0 2nd Row 1st Column 0 2nd Column if z Subscript i Superscript asterisk Baseline less-than-or-equal-to 0 EndLayout
y Subscript i Superscript asterisk Baseline equals bold x prime Subscript i Baseline bold-italic beta plus epsilon Subscript i Baseline if z Subscript i Baseline equals 1
y Subscript i Baseline equals StartLayout Enlarged left-brace 1st Row 1st Column y Subscript i Superscript asterisk Baseline 2nd Column normal i normal f y Subscript i Superscript asterisk Baseline greater-than 0 2nd Row 1st Column 0 2nd Column normal i normal f y Subscript i Superscript asterisk Baseline less-than-or-equal-to 0 EndLayout

In this case, the log-likelihood function of the Heckman selection model needs to be modified as follows to include the censored region:

StartLayout 1st Row 1st Column script l 2nd Column equals 3rd Column sigma-summation Underscript StartSet i vertical-bar z Subscript i Baseline equals 0 EndSet Endscripts ln left-bracket 1 minus normal upper Phi left-parenthesis bold w prime Subscript i Baseline bold-italic gamma right-parenthesis right-bracket 2nd Row 1st Column Blank 2nd Column plus 3rd Column sigma-summation Underscript StartSet i vertical-bar z Subscript i Baseline equals 1 comma y Subscript i Baseline equals y Subscript i Superscript asterisk Baseline EndSet Endscripts StartSet ln left-bracket phi left-parenthesis StartFraction y Subscript i Baseline minus bold x prime Subscript bold i Baseline bold-italic beta Over sigma EndFraction right-parenthesis right-bracket minus ln sigma plus ln left-bracket normal upper Phi left-parenthesis StartStartFraction bold w prime Subscript i Baseline bold-italic gamma plus rho StartFraction y Subscript i Baseline minus bold x prime Subscript bold i Baseline bold-italic beta Over sigma EndFraction OverOver StartRoot 1 minus rho squared EndRoot EndEndFraction right-parenthesis right-bracket EndSet 3rd Row 1st Column Blank 2nd Column plus 3rd Column sigma-summation Underscript StartSet i vertical-bar z Subscript i Baseline equals 1 comma y Subscript i Baseline equals 0 EndSet Endscripts ln integral Subscript negative normal infinity Superscript StartFraction minus bold x prime Subscript bold i Baseline bold-italic beta Over sigma EndFraction Baseline integral Subscript minus bold w prime Subscript bold i Baseline gamma Superscript normal infinity Baseline phi 2 left-parenthesis u comma v comma rho right-parenthesis d u d v EndLayout

In case y Subscript i is truncated from below at 0 instead of censored, the likelihood function can be written as

StartLayout 1st Row 1st Column script l 2nd Column equals 3rd Column sigma-summation Underscript StartSet i vertical-bar z Subscript i Baseline equals 0 EndSet Endscripts ln left-bracket 1 minus normal upper Phi left-parenthesis bold w prime Subscript i Baseline bold-italic gamma right-parenthesis right-bracket 2nd Row 1st Column Blank 2nd Column plus 3rd Column sigma-summation Underscript StartSet i vertical-bar z Subscript i Baseline equals 1 EndSet Endscripts StartSet ln left-bracket phi left-parenthesis StartFraction y Subscript i Baseline minus bold x prime Subscript bold i Baseline bold-italic beta Over sigma EndFraction right-parenthesis right-bracket minus ln sigma plus ln left-bracket normal upper Phi left-parenthesis StartStartFraction bold w prime Subscript i Baseline bold-italic gamma plus rho StartFraction y Subscript i Baseline minus bold x prime Subscript bold i Baseline bold-italic beta Over sigma EndFraction OverOver StartRoot 1 minus rho squared EndRoot EndEndFraction right-parenthesis right-bracket minus ln left-bracket normal upper Phi left-parenthesis bold x prime Subscript i Baseline bold-italic beta slash sigma right-parenthesis right-bracket EndSet EndLayout

The basic selection model can also be extended to include the treatment effects models. You can find the details for treatment effects models in the section Endogenous Dummy Variable Models—Treatment Effects Regression.

Heckman’s Two-Step Selection Method

Sample selection bias arises from nonrandom selection of the sample from the population. A classic example is using a sample of market wages for working women to estimate female labor supply function. This sample is nonrandom because it includes only the wages of women whose market wage exceeds their home wage at zero hours of work.

A simple selection model can be written as the latent model

z Subscript i Superscript asterisk Baseline equals bold w prime Subscript i Baseline bold-italic gamma plus u Subscript i
z Subscript i Baseline equals StartLayout Enlarged left-brace 1st Row 1st Column 1 2nd Column if z Subscript i Superscript asterisk Baseline greater-than 0 2nd Row 1st Column 0 2nd Column if z Subscript i Superscript asterisk Baseline less-than-or-equal-to 0 EndLayout
y Subscript i Baseline equals bold x prime Subscript i Baseline bold-italic beta plus epsilon Subscript i Baseline if z Subscript i Baseline equals 1

where u Subscript i and epsilon Subscript i are jointly normal with 0 mean, standard deviations of 1 and sigma, respectively, and correlation of rho. The dependent variable y Subscript i (wage) is observed if the latent variable z Subscript i Superscript asterisk (the difference between market wage and reservation wage) is positive or if the indicator variable z Subscript i (labor force participation) is 1.

The model of interest that applies to the observations in the selected sample can be written as

upper E left-parenthesis y Subscript i Baseline vertical-bar bold x Subscript i Baseline comma z Subscript i Baseline equals 1 right-parenthesis equals bold x prime Subscript i Baseline bold-italic beta plus rho sigma lamda left-parenthesis bold w prime Subscript i Baseline bold-italic gamma right-parenthesis

where lamda left-parenthesis bold w prime Subscript i Baseline bold-italic gamma right-parenthesis equals phi left-parenthesis bold w prime Subscript i Baseline bold-italic gamma right-parenthesis slash normal upper Phi left-parenthesis bold w prime Subscript i Baseline bold-italic gamma right-parenthesis. Hence, the following regression equation is valid for the observations for which z Subscript i Baseline equals 1:

y Subscript i Baseline equals bold x prime Subscript i Baseline bold-italic beta plus rho sigma lamda left-parenthesis bold w prime Subscript i Baseline bold-italic gamma right-parenthesis plus v Subscript i

Therefore, estimates of bold-italic beta that are obtained from the OLS regression of y on bold x by using the selected sample (that is, the sample for which z Subscript i Baseline equals 1) suffer from omitted variable bias if selection bias is really the case. Although maximum likelihood estimation of bold-italic beta is consistent and efficient, Heckman’s two-step method is more frequently used. Heckman’s two-step method can be requested by specifying the HECKIT option of the QLIM statement.

Heckman’s two-step method is as follows:

  1. Obtain ModifyingAbove bold-italic gamma With caret, the estimate of the parameters of the probability that z Subscript i Superscript asterisk Baseline greater-than 0, by using regressors bold w Subscript i and the binary dependent variable z Subscript i by probit analysis for the full sample. Compute ModifyingAbove lamda With caret Subscript i Baseline equals lamda left-parenthesis bold w prime Subscript i Baseline ModifyingAbove bold-italic gamma With caret right-parenthesis.

  2. Obtain ModifyingAbove bold-italic beta With caret and ModifyingAbove beta With caret Subscript lamda, the estimates of bold-italic beta and rho sigma, by least squares regression of y Subscript i on bold x Subscript i and ModifyingAbove lamda With caret Subscript i by using observations on the selected subsample.

The standard least squares estimators of the population variance sigma squared and the variances of the estimated coefficients are incorrect. To test hypotheses, the correct ones need to be calculated. An estimator of sigma squared is

ModifyingAbove sigma With caret squared equals StartFraction 1 Over upper N 1 EndFraction sigma-summation Underscript i equals 1 Overscript upper N 1 Endscripts e Subscript i Superscript 2 Baseline plus ModifyingAbove beta With caret Subscript lamda Superscript 2 Baseline StartFraction 1 Over upper N 1 EndFraction sigma-summation Underscript i equals 1 Overscript upper N 1 Endscripts ModifyingAbove delta With caret Subscript i

where upper N 1 is the selected subsample size, e Subscript i is the residual for the ith observation obtained from step 2, and ModifyingAbove delta With caret Subscript i Baseline equals ModifyingAbove lamda With caret Subscript i Superscript 2 Baseline plus ModifyingAbove lamda With caret Subscript i Baseline bold w prime Subscript i Baseline ModifyingAbove bold-italic gamma With caret. Let bold upper X Subscript asterisk be an upper N 1 times left-parenthesis upper K plus 1 right-parenthesis matrix with ith row left-bracket bold x prime Subscript i Baseline lamda Subscript i Baseline right-bracket, and define bold upper W similarly with ith row bold w prime Subscript i. Then the estimator of the asymptotic covariance of left-bracket ModifyingAbove bold-italic beta With caret comma ModifyingAbove bold-italic beta With caret Subscript lamda Baseline right-bracket is

EstAsyVar left-bracket ModifyingAbove bold-italic beta With caret comma ModifyingAbove beta With caret Subscript lamda Baseline right-bracket equals ModifyingAbove sigma With caret squared left-bracket bold upper X prime Subscript asterisk Baseline bold upper X Subscript asterisk Baseline right-bracket Superscript negative 1 Baseline left-bracket bold upper X prime Subscript asterisk Baseline left-parenthesis bold upper I minus ModifyingAbove rho With caret squared ModifyingAbove bold upper Delta With caret right-parenthesis bold upper X Subscript asterisk Baseline plus bold upper Q right-bracket left-bracket bold upper X prime Subscript asterisk Baseline bold upper X Subscript asterisk Baseline right-bracket Superscript negative 1

where ModifyingAbove rho With caret squared equals ModifyingAbove beta With caret Subscript lamda Superscript 2 Baseline slash ModifyingAbove sigma With caret squared, ModifyingAbove bold upper Delta With caret equals diag left-parenthesis ModifyingAbove delta With caret Subscript i Baseline right-parenthesis, and

bold upper Q equals ModifyingAbove sigma With caret squared left-parenthesis bold upper X prime Subscript asterisk Baseline ModifyingAbove bold upper Delta With caret bold upper W right-parenthesis Est period Asy period Var left-parenthesis ModifyingAbove bold-italic gamma With caret right-parenthesis left-parenthesis bold upper W prime ModifyingAbove bold upper Delta With caret bold upper X Subscript asterisk Baseline right-parenthesis

where Est period Asy period Var left-parenthesis ModifyingAbove bold-italic gamma With caret right-parenthesis is the estimator of the asymptotic covariance of the probit coefficients that are obtained in step 1. When you specify the HECKIT option, PROC QLIM uses a numerical estimated asymptotic variance.

When the HECKIT option is specified, PROC QLIM reports the corrected standard errors for left-bracket ModifyingAbove bold-italic beta With caret comma ModifyingAbove beta With caret Subscript lamda Baseline right-bracket automatically. However, if you need the conventional OLS standard errors, you can specify the HECKIT(UNCORRECTED) option.

In the selected regression model, when the coefficient of lamda left-parenthesis bold w prime Subscript i Baseline bold-italic gamma right-parenthesis is 0, you do not need Heckman’s two-step estimation method; a simple regression of y on bold x produces consistent estimates for bold-italic beta, and the OLS standard errors are correct. Thus, a standard t test on ModifyingAbove beta With caret Subscript lamda (which uses the estimate from step 2 and the uncorrected standard errors) is a valid test of the null hypothesis of no selection bias.

Although Heckman’s two-step method uses the OLS method in the second stage, you can request the ML method by specifying the HECKIT(SECONDSTAGE=ML) option. When the second-stage method is the ML method, the model for y Subscript i can be nonlinear.

Last updated: June 19, 2025