Shared Concepts

Informative Missingness

This section applies to actions in the following action sets: phreg, quantreg, regression, and varReduce.

The concept of informative missingness is one way to account for missing values in statistical analyses and, in particular, statistical modeling. Missing values can be a problem because they reduce the amount of available data. When you work with classification variables (factors, which are levelized variables), you can treat a missing value as an actual level of the variable and allow it to participate in the analysis.

However, when continuous variables have missing values, the observation is removed from the analysis. In data that have many missing values, removing observations can reduce the amount of available data greatly, and the sets of observations used in one model versus another model can vary based on which variables are included in the model.

Of course, there are many reasons for missing values, and substituting values for missing values has to be done with caution. For example, the famous Framingham Heart study data set contains 5,209 observations on subjects in a longitudinal study that helped understand the relationship between smoking, cholesterol, and coronary heart disease. One of the variables in the data set is AgeCHDdiag. This variable represents the age at which a patient was diagnosed with coronary heart disease (CHD). If you include this variable in a statistical model, only 1,449 observations are available, because the value cannot be observed unless a patient has experienced CHD. Including this variable acts as a filter that reduces the analysis set to the subjects who have CHD. You cannot impute the value for subjects where the variable has a missing value, because you cannot impute an age at which someone who has not had CHD would have contracted coronary heart disease.

With informative missingness, you are not so much substituting imputed values for the missing values as you are modeling the missingness. Consider a simple linear regression model:

y equals beta 0 plus beta 1 x plus epsilon

Suppose that some of the values for the regressor variable x are missing. The fitted model uses only observations for which y and x have been observed.

In order to predict the outcome y for an observation that has a missing x, either you assume that y is missing or you substitute a value (such as the average value, x overbar) for the missing x. Because the estimate for the intercept is in the simple linear regression model, the predicted value would be the average response of the nonmissing values,y overbar.

With informative missingness, you extend the model by adding extra effects for each effect that contains at least one continuous variable. In the simple linear regression model, you add one column to the model and slightly change the content of the x variable:

y equals beta 0 plus beta 1 x Superscript asterisk Baseline plus beta 2 x normal bar m i s s plus epsilon 1

The variable x Superscript asterisk contains the original values of x if they are not missing, and the average of x otherwise:

StartLayout 1st Row 1st Column x Superscript asterisk Baseline equals 2nd Column StartLayout Enlarged left-brace 1st Row 1st Column x 2nd Column if x is not missing 2nd Row 1st Column x overbar 2nd Column otherwise EndLayout EndLayout

The variable x_miss is a dummy variable whose value is 1 when x is missing, and 0 otherwise:

StartLayout 1st Row 1st Column x normal bar m i s s equals 2nd Column StartLayout Enlarged left-brace 1st Row 1st Column 1 2nd Column if x is missing 2nd Row 1st Column 0 2nd Column otherwise EndLayout EndLayout

The fitted model is not the same model that results from substituting x overbar for the missing values during training, because the model that simply substitutes x overbar for the missing values is

y equals beta 0 plus beta 1 x Superscript asterisk Baseline plus epsilon 2

The informative missing model has an extra parameter, and unless all values of x_miss are 0 (in which case there are no missing values), the informative missing model has a higher R-square value, because it picks up more variation.

The parameter estimate for beta 2 measures the amount by which the predicted value differs from a predicted value at x overbar.

Last updated: March 05, 2026