This section applies to actions in the following action sets: phreg, quantreg, regression, and varReduce.
The concept of informative missingness is one way to account for missing values in statistical analyses and, in particular, statistical modeling. Missing values can be a problem because they reduce the amount of available data. When you work with classification variables (factors, which are levelized variables), you can treat a missing value as an actual level of the variable and allow it to participate in the analysis.
However, when continuous variables have missing values, the observation is removed from the analysis. In data that have many missing values, removing observations can reduce the amount of available data greatly, and the sets of observations used in one model versus another model can vary based on which variables are included in the model.
Of course, there are many reasons for missing values, and substituting values for missing values has to be done with caution. For example, the famous Framingham Heart study data set contains 5,209 observations on subjects in a longitudinal study that helped understand the relationship between smoking, cholesterol, and coronary heart disease. One of the variables in the data set is AgeCHDdiag. This variable represents the age at which a patient was diagnosed with coronary heart disease (CHD). If you include this variable in a statistical model, only 1,449 observations are available, because the value cannot be observed unless a patient has experienced CHD. Including this variable acts as a filter that reduces the analysis set to the subjects who have CHD. You cannot impute the value for subjects where the variable has a missing value, because you cannot impute an age at which someone who has not had CHD would have contracted coronary heart disease.
With informative missingness, you are not so much substituting imputed values for the missing values as you are modeling the missingness. Consider a simple linear regression model:
Suppose that some of the values for the regressor variable x are missing. The fitted model uses only observations for which y and x have been observed.
In order to predict the outcome y for an observation that has a missing x, either you assume that y is missing or you substitute a value (such as the average value, ) for the missing x. Because the estimate for the intercept is in the simple linear regression model, the predicted value would be the average response of the nonmissing values,
.
With informative missingness, you extend the model by adding extra effects for each effect that contains at least one continuous variable. In the simple linear regression model, you add one column to the model and slightly change the content of the x variable:
The variable contains the original values of x if they are not missing, and the average of x otherwise:
The variable x_miss is a dummy variable whose value is 1 when x is missing, and 0 otherwise:
The fitted model is not the same model that results from substituting for the missing values during training, because the model that simply substitutes
for the missing values is
The informative missing model has an extra parameter, and unless all values of x_miss are 0 (in which case there are no missing values), the informative missing model has a higher R-square value, because it picks up more variation.
The parameter estimate for measures the amount by which the predicted value differs from a predicted value at
.