QLIM Procedure

Endogeneity and Instrumental Variables

The PROC QLIM models such as qualitative response or limited dependent variable models assume that the errors are independent of the explanatory variables. If this assumption fails to hold, the distributional form that the likelihood is based on is misspecified and the obtained coefficients are inconsistent.

To begin, consider a linear model

y Subscript i Baseline equals y Subscript i Superscript asterisk Baseline equals beta 0 plus beta 1 x Subscript 1 i Baseline plus midline-horizontal-ellipsis plus beta Subscript k Baseline x Subscript k i Baseline plus u Subscript i

Assume that upper E left-parenthesis u right-parenthesis equals 0, Cov left-parenthesis x Subscript j Baseline comma u right-parenthesis equals 0 for j equals 1 comma ellipsis comma k minus 1, and Cov left-parenthesis x Subscript k Baseline comma u right-parenthesis equals rho not-equals 0. Therefore, x Subscript k is endogenous. The endogeneity comes from many sources, such as x Subscript k having measurement error or omitting a variable that is correlated with x Subscript k. If you ignore the endogeneity, you can estimate this model in PROC QLIM as follows (assuming k equals 4):

proc qlim data=a;
   model y = x1 x2 x3 x4;
run;

However, this approach produces inconsistent maximum likelihood estimates. To obtain consistent maximum likelihood estimates, you should consider the joint density of the dependent variable and the endogenous variables. To do this in PROC QLIM, you need at least one instrument—that is, an observable variable, z 1—that is not in the structural equation and that satisfies two conditions: z 1 is exogenous (that is, Cov left-parenthesis z 1 comma u right-parenthesis equals 0), and z 1 must be correlated with the endogenous regressor x Subscript k. Then, you can model x Subscript k as

x Subscript k i Baseline equals pi 0 plus pi 1 x Subscript 1 i Baseline plus midline-horizontal-ellipsis plus pi Subscript k minus 1 Baseline x Subscript left-parenthesis k minus 1 right-parenthesis i Baseline plus theta z Subscript 1 i Baseline plus epsilon Subscript i

You can now write this reduced form equation along with the structural equation to obtain the consistent maximum likelihood estimates as follows:

proc qlim data=a;
   model y = x1 x2 x3 x4;
   model x4 = x1 x2 x3 z1;
run;

Estimating the structural model together with the reduced form models for the endogenous explanatory variables gives you the full information maximum likelihood (FIML) estimates. Because of the linearity of the structural model, you can estimate it efficiently and more simply by using the two-stage least squares estimator. However, PROC QLIM handles nonlinear models such as qualitative response and limited dependent variable models, and in their estimation it maximizes the corresponding joint likelihood function (for more information and an application, see Wooldridge 2010, Section 15.7.3). In the case of endogeneity, when the reduced form models for the endogenous explanatory variables are written along with the structural model, PROC QLIM maximizes the likelihood function that is obtained from the joint density of the response variable and the endogenous explanatory variables. For example, consider the following censored regression model in which one of the explanatory variables is a continuous endogenous variable:

StartLayout 1st Row 1st Column y Subscript 1 i Superscript asterisk 2nd Column equals 3rd Column alpha y Subscript 2 i plus bold z prime Subscript 1 i Baseline bold-italic beta plus u Subscript i 2nd Row 1st Column y Subscript 2 i 2nd Column equals 3rd Column bold z prime Subscript i Baseline bold-italic pi plus epsilon Subscript i 3rd Row 1st Column y Subscript 1 i 2nd Column equals 3rd Column StartLayout Enlarged left-brace 1st Row 1st Column y Subscript 1 i Superscript asterisk 2nd Column normal i normal f y Subscript 1 i Superscript asterisk Baseline greater-than 0 2nd Row 1st Column 0 2nd Column normal i normal f y Subscript 1 i Superscript asterisk Baseline less-than-or-equal-to 0 EndLayout EndLayout

The exogenous explanatory variables are bold z Subscript 1 i, and the continuous endogenous explanatory variable is y Subscript 2 i.

The likelihood function to maximize is

upper L equals product Underscript i element-of StartSet y Subscript 1 i Baseline greater-than 0 EndSet Endscripts f left-parenthesis y Subscript 1 i Baseline comma y Subscript 2 i Baseline right-parenthesis dot product Underscript i element-of StartSet y Subscript 1 i Baseline equals 0 EndSet Endscripts integral Subscript negative normal infinity Superscript 0 Baseline f left-parenthesis y Subscript 1 i Superscript asterisk Baseline comma y Subscript 2 i Baseline right-parenthesis d y Subscript 1 i Superscript asterisk Baseline

where f left-parenthesis y Subscript 1 i Superscript asterisk Baseline comma y Subscript 2 i Baseline right-parenthesis is the joint density of y Subscript 1 i Superscript asterisk and y Subscript 2 i. Note that y Subscript 1 i is substituted for y Subscript 1 i Superscript asterisk when y Subscript 1 i Baseline greater-than 0. If you assume left-parenthesis u Subscript i Baseline comma epsilon Subscript i Baseline right-parenthesis tilde Overscript i i d Endscripts upper N left-parenthesis bold 0 comma bold upper Sigma right-parenthesis with bold upper Sigma equals Start 2 By 2 Matrix 1st Row 1st Column sigma Subscript u Superscript 2 Baseline 2nd Column eta 2nd Row 1st Column eta 2nd Column sigma Subscript epsilon Superscript 2 EndMatrix, then, by using f left-parenthesis y Subscript 1 i Superscript asterisk Baseline comma y Subscript 2 i Baseline right-parenthesis equals f left-parenthesis y Subscript 1 i Superscript asterisk Baseline vertical-bar y Subscript 2 i Baseline right-parenthesis dot f left-parenthesis y Subscript 2 i Baseline right-parenthesis, you can write the likelihood function for each i as a multiplication of two parts. The first part is the probability density function of the normal distribution with mean bold z prime Subscript i Baseline bold-italic pi and variance sigma Subscript epsilon Superscript 2, and the second part follows a Tobit model that has latent mean alpha y Subscript 2 i plus bold z prime Subscript 1 i Baseline bold-italic pi plus left-parenthesis eta slash sigma Subscript epsilon Superscript 2 Baseline right-parenthesis left-parenthesis y Subscript 2 i Baseline minus bold z prime Subscript i Baseline bold-italic pi right-parenthesis and variance sigma Subscript u Superscript 2 Baseline minus left-parenthesis eta squared slash sigma Subscript epsilon Superscript 2 Baseline right-parenthesis. Then, you can obtain the log-likelihood function by taking the log of this multiplication and summing over i (for more information, see Wooldridge 2002, Section 16.6.2). This is the log-likelihood function that PROC QLIM maximizes. The parameters left-parenthesis ModifyingAbove alpha With caret comma ModifyingAbove bold-italic beta With caret comma ModifyingAbove bold-italic pi With caret comma ModifyingAbove sigma With caret Subscript u Superscript 2 Baseline comma ModifyingAbove sigma With caret Subscript epsilon Superscript 2 Baseline comma ModifyingAbove eta With caret right-parenthesis that are obtained from this maximization are the FIML estimators. Assuming that the latent model includes two instrumental variables and two exogenous explanatory variables, you can estimate this model in PROC QLIM as follows:

proc qlim data=a;
   model y1 = y2 z11 z12 / censored(lb=0);
   model y2 = z11 z12 z21 z22;
run;

For simple examples like the preceding ones, you can derive the likelihood function easily. However, as the number of endogenous explanatory variables increases, if these variables have a discontinuous nature, if simultaneity among equations exists, or if a combination of these occurs, then the derivation of the likelihood function becomes cumbersome, or, in some cases, the likelihood function does not even have a closed analytical form.

PROC QLIM can handle endogeneity regardless of the nature of the endogenous explanatory variables for a single structural model. In the case of one endogenous explanatory variable, PROC QLIM reports the FIML estimates that are calculated by using the analytical likelihood function that is obtained from the joint distribution of the dependent variable and the endogenous variable. When there is more than one endogenous explanatory variable, the analytical form of the likelihood function is usually not available; in this case PROC QLIM reports the simulated maximum likelihood estimates. For the simulated maximum likelihood estimation method, PROC QLIM uses the Geweke-Hajivassiliou-Keane (GHK) simulator (see, among others, Hajivassiliou, McFadden, and Ruud 1996) to simulate the joint distribution of the dependent variable and the endogenous variables. The simulation is facilitated by assuming that the error terms in the latent models for the dependent variable and the endogenous explanatory variables are distributed as multivariate normal.

When you estimate a model in PROC QLIM, you can take the endogeneity into account by writing the structural model along with the reduced form models for each endogenous variable. Examples are provided in the following sections.

Probit Model with a Continuous Endogenous Explanatory Variable

Consider a probit model that contains a single endogenous explanatory variable in addition to two instruments and two exogenous explanatory variables. The model is

StartLayout 1st Row 1st Column y Subscript 1 i Superscript asterisk 2nd Column equals 3rd Column alpha 1 y Subscript 2 i plus beta 1 z Subscript 1 i plus beta 2 z Subscript 2 i plus u Subscript i 2nd Row 1st Column y Subscript 2 i Superscript asterisk 2nd Column equals 3rd Column pi 1 z Subscript 1 i plus pi 2 z Subscript 2 i plus pi 3 z Subscript 3 i plus pi 4 z Subscript 4 i plus epsilon Subscript i 3rd Row 1st Column y Subscript 1 i 2nd Column equals 3rd Column StartLayout Enlarged left-brace 1st Row 1st Column 1 2nd Column normal i normal f y Subscript 1 i Superscript asterisk Baseline greater-than 0 2nd Row 1st Column 0 2nd Column normal i normal f y Subscript 1 i Superscript asterisk Baseline less-than-or-equal-to 0 EndLayout 4th Row 1st Column y Subscript 2 i 2nd Column equals 3rd Column y Subscript 2 i Superscript asterisk EndLayout

where Cov left-parenthesis u comma epsilon right-parenthesis equals eta. You can estimate this model by using the following statements:

proc qlim data=a;
   model y1 = y2 z1 z2 / discrete;
   model y2 = z1 z2 z3 z4;
run;

Probit Model with a Binary Endogenous Explanatory Variable

Consider a probit model that contains a single binary endogenous explanatory variable in addition to two instruments and two exogenous explanatory variables. The model is

StartLayout 1st Row 1st Column y Subscript 1 i Superscript asterisk 2nd Column equals 3rd Column alpha 1 y Subscript 2 i plus beta 1 z Subscript 1 i plus beta 2 z Subscript 2 i plus u Subscript i 2nd Row 1st Column y Subscript 2 i Superscript asterisk 2nd Column equals 3rd Column pi 1 z Subscript 1 i plus pi 2 z Subscript 2 i plus pi 3 z Subscript 3 i plus pi 4 z Subscript 4 i plus epsilon Subscript i 3rd Row 1st Column y Subscript 1 i 2nd Column equals 3rd Column StartLayout Enlarged left-brace 1st Row 1st Column 1 2nd Column normal i normal f y Subscript 1 i Superscript asterisk Baseline greater-than 0 2nd Row 1st Column 0 2nd Column normal i normal f y Subscript 1 i Superscript asterisk Baseline less-than-or-equal-to 0 EndLayout 4th Row 1st Column y Subscript 2 i 2nd Column equals 3rd Column StartLayout Enlarged left-brace 1st Row 1st Column 1 2nd Column normal i normal f y Subscript 2 i Superscript asterisk Baseline greater-than 0 2nd Row 1st Column 0 2nd Column normal i normal f y Subscript 2 i Superscript asterisk Baseline less-than-or-equal-to 0 EndLayout EndLayout

where Cov left-parenthesis u comma epsilon right-parenthesis equals eta. You can estimate this model by using the following statements:

proc qlim data=a;
   model y1 = y2 z1 z2 / discrete;
   model y2 = z1 z2 z3 z4 / discrete;
run;

Probit Model with a Censored Endogenous Explanatory Variable

Consider a probit model that contains a single censored (below zero) endogenous explanatory variable in addition to two instruments and two exogenous explanatory variables. The model is

StartLayout 1st Row 1st Column y Subscript 1 i Superscript asterisk 2nd Column equals 3rd Column alpha 1 y Subscript 2 i plus beta 1 z Subscript 1 i plus beta 2 z Subscript 2 i plus u Subscript i 2nd Row 1st Column y Subscript 2 i Superscript asterisk 2nd Column equals 3rd Column pi 1 z Subscript 1 i plus pi 2 z Subscript 2 i plus pi 3 z Subscript 3 i plus pi 4 z Subscript 4 i plus epsilon Subscript i 3rd Row 1st Column y Subscript 1 i 2nd Column equals 3rd Column StartLayout Enlarged left-brace 1st Row 1st Column 1 2nd Column normal i normal f y Subscript 1 i Superscript asterisk Baseline greater-than 0 2nd Row 1st Column 0 2nd Column normal i normal f y Subscript 1 i Superscript asterisk Baseline less-than-or-equal-to 0 EndLayout 4th Row 1st Column y Subscript 2 i 2nd Column equals 3rd Column StartLayout Enlarged left-brace 1st Row 1st Column y Subscript 2 i Superscript asterisk 2nd Column normal i normal f y Subscript 2 i Superscript asterisk Baseline greater-than 0 2nd Row 1st Column 0 2nd Column normal i normal f y Subscript 2 i Superscript asterisk Baseline less-than-or-equal-to 0 EndLayout EndLayout

where Cov left-parenthesis u comma epsilon right-parenthesis equals eta. You can estimate this model by using the following statements:

proc qlim data=a;
   model y1 = y2 z1 z2 / discrete;
   model y2 = z1 z2 z3 z4 / censored(lb=0);
run;

Censored Regression Model with a Binary Endogenous Explanatory Variable

Consider a Type 1 Tobit model that contains a single binary endogenous explanatory variable in addition to two instruments and two exogenous explanatory variables. The model is

StartLayout 1st Row 1st Column y Subscript 1 i Superscript asterisk 2nd Column equals 3rd Column alpha 1 y Subscript 2 i plus beta 1 z Subscript 1 i plus beta 2 z Subscript 2 i plus u Subscript i 2nd Row 1st Column y Subscript 2 i Superscript asterisk 2nd Column equals 3rd Column pi 1 z Subscript 1 i plus pi 2 z Subscript 2 i plus pi 3 z Subscript 3 i plus pi 4 z Subscript 4 i plus epsilon Subscript i 3rd Row 1st Column y Subscript 1 i 2nd Column equals 3rd Column StartLayout Enlarged left-brace 1st Row 1st Column y Subscript 1 i Superscript asterisk 2nd Column normal i normal f y Subscript 1 i Superscript asterisk Baseline greater-than 0 2nd Row 1st Column 0 2nd Column normal i normal f y Subscript 1 i Superscript asterisk Baseline less-than-or-equal-to 0 EndLayout 4th Row 1st Column y Subscript 2 i 2nd Column equals 3rd Column StartLayout Enlarged left-brace 1st Row 1st Column 1 2nd Column normal i normal f y Subscript 2 i Superscript asterisk Baseline greater-than 0 2nd Row 1st Column 0 2nd Column normal i normal f y Subscript 2 i Superscript asterisk Baseline less-than-or-equal-to 0 EndLayout EndLayout

where Cov left-parenthesis u comma epsilon right-parenthesis equals eta. You can estimate this model by using the following statements:

proc qlim data=a;
   model y1 = y2 z1 z2 / censored(lb=0);
   model y2 = z1 z2 z3 z4 / discrete;
run;

Censored Regression Model with Binary and Continuous Endogenous Explanatory Variables

Consider a Type 1 Tobit model that contain binary and continuous endogenous explanatory variables in addition to two instruments and two exogenous explanatory variables. The model is

StartLayout 1st Row 1st Column y Subscript 1 i Superscript asterisk 2nd Column equals 3rd Column alpha 1 y Subscript 21 i plus alpha 2 y Subscript 22 i plus beta 1 z Subscript 1 i plus beta 2 z Subscript 2 i plus u Subscript i 2nd Row 1st Column y Subscript 21 i Superscript asterisk 2nd Column equals 3rd Column pi 11 z Subscript 1 i plus pi 12 z Subscript 2 i plus pi 13 z Subscript 3 i plus pi 14 z Subscript 4 i plus epsilon Subscript 1 i 3rd Row 1st Column y Subscript 22 i Superscript asterisk 2nd Column equals 3rd Column pi 21 z Subscript 1 i plus pi 22 z Subscript 2 i plus pi 23 z Subscript 3 i plus pi 24 z Subscript 4 i plus epsilon Subscript 2 i 4th Row 1st Column y Subscript 1 i 2nd Column equals 3rd Column StartLayout Enlarged left-brace 1st Row 1st Column y Subscript 1 i Superscript asterisk 2nd Column normal i normal f y Subscript 1 i Superscript asterisk Baseline greater-than 0 2nd Row 1st Column 0 2nd Column normal i normal f y Subscript 1 i Superscript asterisk Baseline less-than-or-equal-to 0 EndLayout 5th Row 1st Column y Subscript 21 i 2nd Column equals 3rd Column StartLayout Enlarged left-brace 1st Row 1st Column 1 2nd Column normal i normal f y Subscript 21 i Superscript asterisk Baseline greater-than 0 2nd Row 1st Column 0 2nd Column normal i normal f y Subscript 21 i Superscript asterisk Baseline less-than-or-equal-to 0 EndLayout 6th Row 1st Column y Subscript 22 i 2nd Column equals 3rd Column y Subscript 22 i Superscript asterisk EndLayout

where Cov left-parenthesis u comma epsilon 1 comma epsilon 2 right-parenthesis equals eta. You can estimate this model by using the following statements:

proc qlim data=a;
   model y1 = y21 y22 z1 z2 / censored(lb=0);
   model y21 = z1 z2 z3 z4  / discrete;
   model y22 = z1 z2 z3 z4;
run;

Probit Model with Binary, Censored, and Truncated Endogenous Explanatory Variables

Consider a probit model that contains binary, censored (below zero), and truncated (below zero) endogenous explanatory variables. The model is

StartLayout 1st Row 1st Column y Subscript 1 i Superscript asterisk 2nd Column equals 3rd Column alpha 1 y Subscript 21 i plus alpha 2 y Subscript 22 i plus alpha 3 y Subscript 23 i plus u Subscript i 2nd Row 1st Column y Subscript 21 i Superscript asterisk 2nd Column equals 3rd Column pi 11 z Subscript 1 i plus pi 12 z Subscript 2 i plus pi 13 z Subscript 3 i plus pi 14 z Subscript 4 i plus epsilon Subscript 1 i 3rd Row 1st Column y Subscript 22 i Superscript asterisk 2nd Column equals 3rd Column pi 21 z Subscript 1 i plus pi 22 z Subscript 2 i plus pi 23 z Subscript 3 i plus pi 24 z Subscript 4 i plus epsilon Subscript 2 i 4th Row 1st Column y Subscript 23 i Superscript asterisk 2nd Column equals 3rd Column pi 31 z Subscript 1 i plus pi 32 z Subscript 2 i plus pi 33 z Subscript 3 i plus pi 34 z Subscript 4 i plus epsilon Subscript 3 i 5th Row 1st Column y Subscript 1 i 2nd Column equals 3rd Column StartLayout Enlarged left-brace 1st Row 1st Column 1 2nd Column normal i normal f y Subscript 1 i Superscript asterisk Baseline greater-than 0 2nd Row 1st Column 0 2nd Column normal i normal f y Subscript 1 i Superscript asterisk Baseline less-than-or-equal-to 0 EndLayout 6th Row 1st Column y Subscript 21 i 2nd Column equals 3rd Column StartLayout Enlarged left-brace 1st Row 1st Column 1 2nd Column normal i normal f y Subscript 21 i Superscript asterisk Baseline greater-than 0 2nd Row 1st Column 0 2nd Column normal i normal f y Subscript 21 i Superscript asterisk Baseline less-than-or-equal-to 0 EndLayout 7th Row 1st Column y Subscript 22 i 2nd Column equals 3rd Column StartLayout Enlarged left-brace 1st Row 1st Column y Subscript 22 i Superscript asterisk 2nd Column normal i normal f y Subscript 22 i Superscript asterisk Baseline greater-than 0 2nd Row 1st Column 0 2nd Column normal i normal f y Subscript 22 i Superscript asterisk Baseline less-than-or-equal-to 0 EndLayout 8th Row 1st Column y Subscript 23 i 2nd Column equals 3rd Column y Subscript 23 i Superscript asterisk Baseline normal i normal f y Subscript 23 i Superscript asterisk Baseline greater-than 0 EndLayout

where z 1 comma ellipsis comma z 4 are the instrumental variables that are independent of the errors. You can estimate this model by using the following statements:

proc qlim data=a;
   model y1  = y21 y22 y23 / discrete;
   model y21 = z1 z2 z3 z4 / discrete;
   model y22 = z1 z2 z3 z4 / censored(lb=0);
   model y23 = z1 z2 z3 z4 / truncated(lb=0);
run;

Note that the dependent variable y 1 should not occur in the models for the endogenous explanatory variables, because this causes inconsistent coefficient estimates. In other words, you should write the models for the endogenous explanatory variables as reduced form models. PROC QLIM does not handle simultaneous equations models.

Endogenous Dummy Variable Models—Treatment Effects Regression

Often, the effect of participation in a treatment on a particular outcome is the main focus. For example, you might be interested in explaining the effect of attending a college on individuals’ earnings. A model of only the earnings that includes an indicator for college attendance as an explanatory variable ignores the possible endogeneity of the indicator variable. Most likely, the factors that motivate an individual to get a college degree also motivate his or her earnings. In this case, you can estimate the earnings consistently by modeling the earnings equation along with the probit equation for the college attendance. This can be formalized as

y Subscript i Baseline equals alpha z Subscript i Baseline plus bold x prime Subscript i Baseline bold-italic beta plus epsilon Subscript i
z Subscript i Superscript asterisk Baseline equals bold w prime Subscript i Baseline bold-italic gamma plus u Subscript i
z Subscript i Baseline equals StartLayout Enlarged left-brace 1st Row 1st Column 1 2nd Column if z Subscript i Superscript asterisk Baseline greater-than 0 2nd Row 1st Column 0 2nd Column if z Subscript i Superscript asterisk Baseline less-than-or-equal-to 0 EndLayout

where u Subscript i and epsilon Subscript i are correlated. In the preceding formulation, earnings is represented by y Subscript i and the college degree indicator by z Subscript i. The parameters of interest are alpha and bold-italic beta. Note that modeling y Subscript i along with the probit model for z Subscript i is very similar to the Heckman selection model that is covered in section Selection Models. This model is a specification of the selection, known as a treatment effects model. The difference is that z itself appears in the equation of interest.

You can estimate this model in the QLIM procedure as follows:

proc qlim data=a;
   model y = z x1 x2;
   model z = x1 x2 x3 x4/ discrete;
run;

In these statements, bold x prime is specified as x 1 x 2 in the first MODEL statement and bold w prime is specified as x 1 x 2 x 3 x 4 in the second MODEL statement. The estimation is done using the entire sample.

Test for Endogeneity

PROC QLIM has two ways to test the null hypothesis that an endogenous explanatory variable (EEV) is in fact exogenous. In the case of a single EEV, the first testing method involves a likelihood ratio test of upper H 0 colon normal bar rho equals 0. For example, consider the probit model with a binary endogenous explanatory variable that was considered earlier; y 2 is exogenous if the error term in the model for y 1 Superscript asterisk is uncorrelated with the error term in the model for y 2 Superscript asterisk. Therefore, testing to determine whether this correlation is 0 or not provides an endogeneity test for y 2. You can do this in PROC QLIM as follows:

proc qlim data=a;
   model y1 = y2 z1 z2 / discrete;
   model y2 = z1 z2 z3 z4 / discrete;
   test  _rho = 0 / LR;
run;

Failing to reject the null hypothesis favors the decision that y Baseline 2 is exogenous in the model for y Baseline 1.

When there are two or more EEVs, the test becomes the joint likelihood ratio test of whether corresponding correlations are 0 or not.

The second testing method is similar to the approach of Rivers and Vuong (1988). Considering the same model, you can write

u Subscript i Baseline equals theta epsilon Subscript i Baseline plus e Subscript i

where theta equals eta slash sigma Subscript epsilon Superscript 2 and e is independent of zs and epsilon. You can now write

StartLayout 1st Row 1st Column y Subscript 1 i Superscript asterisk 2nd Column equals 3rd Column alpha 1 y Subscript 2 i plus beta 1 z Subscript 1 i plus beta 2 z Subscript 2 i plus theta epsilon Subscript i plus e Subscript i EndLayout

Testing upper H 0 colon theta equals 0 is the same as testing whether u Subscript i is correlated with epsilon Subscript i or testing whether y Subscript 2 i is endogenous or not. Because epsilon Subscript i are unobserved, you can replace them with the OLS residuals from the model for y Subscript 2 i Superscript asterisk and apply a robust t test. Note that even though y Subscript 2 i is binary (or censored), the test is still correct under upper H 0.

This approach can be summarized as a two-step procedure. In the first step, generated regressors—that is, the OLS residuals from the models for each of the EEVs—are obtained. In the second step, the structural model that includes the generated regressors as additional explanatory variables is estimated by the maximum likelihood method and the joint significance of these generated regressors is tested by the Wald test.

In PROC QLIM, you can apply the second method for the same test that was considered previously as follows:

proc qlim data=a;
   model y1 = y2 z1 z2 / discrete endotest(y2);
   model y2 = z1 z2 z3 z4 / discrete;
run;

Overidentification Test

In PROC QLIM you can test the validity of instrumental variables (IVs) by specifying the OVERID option in the ENDOGENOUS or MODEL statement. The OVERID test is a maximum likelihood version of the overidentifying restrictions test in the IV framework. If you have more IVs than are necessary for identification—that is, overidentifying IVs—you can use them to test the validity of your IVs. When you use the OVERID option to specify the overidentifying IVs, it applies the likelihood ratio test of the joint significance of these IVs, included as additional explanatory variables in the structural model that it estimates by the MLE jointly with the reduced form models. In effect, you test whether the overidentifying IVs are correlated with the error term in the structural model. You specify the reduced form models through the overidentifying IVs. The structural model is the model that includes the OVERID option. For example, consider the probit model that contains a continuous endogenous explanatory variable. You can consider z Baseline 3 or z Baseline 4 in the model for y Baseline 2 as an overidentifying IV; therefore, you can specify the OVERID test as follows:

proc qlim data=a;
   model y1 = y2 z1 z2 / discrete overid(y2.z4);
   model y2 = z1 z2 z3 z4;
run;

In this case, PROC QLIM estimates the structural model y Baseline 1, including the overidentifying IV z Baseline 4 as an additional explanatory variable in this model, jointly with the reduced form model y Baseline 2. Then it uses the likelihood ratio test to test the hypothesis that the overidentifying IV is insignificant. Rejecting this hypothesis raises doubts about the validity of the instruments z Baseline 3 and z Baseline 4.

Note that, as long as you have continuous endogenous explanatory variables, the test result is invariant to which overidentifying IVs you specify in the test.

Last updated: June 19, 2025