The analysis of correlated data that arise from repeated measurements when the measurements are assumed to be multivariate normal has been studied extensively. However, the normality assumption might not always be reasonable; for example, you must use a different methodology in the data analysis when the responses are discrete and correlated. Generalized estimating equations (GEEs) provide a practical method with reasonable statistical efficiency to analyze such data.
Liang and Zeger (1986) introduced GEEs as a method of dealing with correlated data when, except for the correlation among responses, the data can be modeled as a generalized linear model. For example, in many cases you can model correlated binary and count data in this way. For more information about GEEs, see Hardin and Hilbe (2003), Diggle, Liang, and Zeger (1994), and Lipsitz et al. (1994).
Let ,
,
, represent the jth measurement on the ith subject. There are
measurements on subject i and
total measurements, where
is the cluster frequency that you specify in the
freq parameter. If you omit the freq parameter, then for all observations. The frequencies must be the same for all observations within each cluster.
You model correlated data by using the same link function and linear predictor specification that you use to model independent data. You use the same variance functions as in the independent case, but you must also specify a structure for the correlation between the repeated measurements. In the following material, denotes the variance function for the specified distribution.
Let the vector of measurements on the ith subject be , where the corresponding vector of means is
, and let
be the covariance matrix of
. Let the vector of independent, or explanatory, variables for the jth measurement on the ith subject be
. The generalized estimating equation of Liang and Zeger (1986) for estimating the
vector of regression parameters
is an extension of the independence estimating equation to correlated data and is given by
where
The relationship between the linear predictor and the mean is specified by
where g is the link function, and so the matrix of partial derivatives of the mean with respect to the regression parameters for the ith subject is given by
Let be an
"working" correlation matrix that is fully specified by the vector of parameters
. The covariance matrix of
is modeled as
where is an
diagonal matrix whose jth diagonal element is
, and
is an
diagonal matrix whose jth diagonal is
, where
is a weight specified in the
weight parameter. If there is no weight parameter, then for all i and j. If
is the true correlation matrix of
, then
is the true covariance matrix of
.
The working correlation matrix is usually unknown and must be estimated. You estimate it in the iterative fitting process by using the current value of the parameter vector to compute appropriate functions of the Pearson residual
If you specify the working correlation as , which is the identity matrix, the GEE reduces to the independence estimating equation.
Table 6 shows the working correlation types that the logistic action supports and the estimators that are used to estimate the working correlations.
Table 6: Working Correlation Structures and Correlation Parameter Estimators
The following algorithm fits the specified model by using GEEs. Note that this is not in general a likelihood-based method of estimation; inferences based on likelihoods are not possible for GEE methods.
See Diggle, Liang, and Zeger (1994, Chapter 11) for a discussion of missing values in longitudinal data. Suppose you intend to take measurements for the ith unit. Missing values for which
are missing whenever
is missing for all
are called dropouts. Otherwise, missing values that are intermixed with nonmissing values are called intermittent missing values. The
logistic action can estimate the working correlation from data that contain both types of missing values by using the all available pairs method, in which all nonmissing pairs of data are used in the moment estimators of the working correlation parameters that are defined previously. The resulting covariances and standard errors are valid under the missing completely at random (MCAR) assumption.
For example, for the unstructured working correlation model,
where the sum is over the units that have nonmissing measurements at times j and k, and is the number of units that have nonmissing measurements at j and k. Estimates of the parameters for other working correlation types are computed in a similar manner, using available nonmissing pairs in the appropriate moment estimators.
The contribution of the ith unit to the parameter update equation is computed by omitting the elements of , the columns of
, and the rows and columns of
that correspond to missing measurements.
The model-based estimator of is given by
where
This is a consistent estimator of the covariance matrix of if the mean model and the working correlation matrix are correctly specified.
The estimator
is called the empirical, or robust, estimator of the covariance matrix of , where
It has the property of being a consistent estimator of the covariance matrix of , even if the working correlation matrix is misspecified—that is, if
. For further information about the robust variance estimate, see Zeger, Liang, and Albert (1988), Royall (1986), and White (1982). In computing
, the parameters
and
are replaced by estimates, and
is replaced by the estimate
The quasi-likelihood information criterion (QIC) was developed by Pan (2001) as a modification of the Akaike information criterion (AIC) to apply to models that are fit by GEEs.
Define the quasi-likelihood under the independence working correlation assumption, which is evaluated using the parameter estimates under the working correlation R as
where the quasi-likelihood contribution of the jth observation in the ith cluster is defined in the section Quasi-likelihood Functions and are the parameter estimates obtained from GEEs that use the working correlation R.
QIC is defined as
where is the robust covariance estimate and
is the inverse of the model-based covariance estimate under the independence working correlation assumption, evaluated at
, which are the parameter estimates obtained from GEEs that use the working correlation R.
The logistic action also computes an approximation to , which Pan (2001) defines as
where p is the number of regression parameters.
Pan (2001) notes that QIC is appropriate for selecting regression models and working correlations, whereas is appropriate only for selecting regression models.
For detailed discussions of quasi-likelihood functions, see McCullagh and Nelder (1989) and Hardin and Hilbe (2003). The contribution of observation j in cluster i to the quasi-likelihood function that is evaluated at the regression parameters is given by
, where
These are used in the computation of the quasi-likelihood information criteria (QIC) for the goodness of fit of models that are fit using GEEs. The are prior weights, if any, that are specified in the
weight or freq parameter.
GEE analysis is not supported when you perform model selection or when you perform a partitioned analysis.
If you use multiple variables to define the within-subject effect, the specification of the variables within this effect will have an impact on the analysis if the working correlation type depends on the ordering of the repeated measures. The autoregressive, m-dependent, and unstructured working correlation types use the levelization of the within-subject effect to locate observations within the correlation matrix, and depend on the order that is produced by the levelization.
In particular, if you define the within-subject effect by using the interaction of two or more variables, then the hierarchy of levelization and the resulting ordering of the repeated measures will depend on the order in which you list the variables in the effects subparameter of the repeated parameter.
For example, assume that you are using the interaction of the variables w1 and w2 as the within-subject effect in a repeated measures analysis. Furthermore, assume that w1 takes the values 'A' and 'B' and w2 takes the values 'L' and 'H'. The following statements list these variables in the order w2, w1 in the effects subparameter of the repeated parameter:
regression.logistic
table={name='health'},
class={{vars={'subject', 'w1', 'w2'}}},
model={depVars={{name='outcome', options={event='1'}}},
effects={{vars={'age', 'smoke'}}} },
repeated={{depVars={{name='outcome', options={event='1'}}},
effects={{vars={'w2', 'w1'}, interaction='CROSS'}},
type='REPEATED',
subject={{vars={'subject'} }}, corrtype='MDEP', mdepm=2}};
In this case, the order of the variables in the effects subparameter of the repeated parameter results in a levelization in which the value of w1 changes faster than the value of w2. Table 7 shows this levelization.
Table 7: Levelization Order When Variable Order Is w2, w1
Value of w1 |
Value of w2 |
Levelization Order |
|---|---|---|
| A | L | 0 |
| B | L | 1 |
| A | H | 2 |
| B | H | 3 |
If you reverse the order of the variables in the effects subparameter of the repeated parameter, the levelization has a different hierarchy. In this hierarchy, the value of w2 changes faster than the value of w1. Table 8 shows that the levelization order that is assigned to each unique combination of the variables does not match the levelization order shown in Table 7.
Table 8: Levelization Order When Variable Order Is w1, w2
Value of w1 |
Value of w2 |
Levelization Order |
|---|---|---|
| A | L | 0 |
| B | L | 2 |
| A | H | 1 |
| B | H | 3 |
Because the specified working correlation structure (m-dependent) relies on this ordering, the parameter estimates differ from those that are produced by the ordering shown in Table 7. The autoregressive and unstructured working correlation structures also result in different parameter estimates for the different orderings.
If you use two variables to define the within-subject effect, you can use a nested effect to enforce a specific levelization hierarchy. However, the best practice is to use a single variable in the within-subject effect.
You can add group and order information to a table by using the partition action in the table action set. If you want to use the group and order information in a table to achieve a reproducible processing order and you also want to do a repeated measures analysis, the group information must follow specific rules. In particular, the variables that you specify in the groupBy parameter of the partition action must match the variables that you specify in the vars sub-subparameter of the subject subparameter of the repeated parameter.
There are some variations to this requirement, depending on which type of subject effect you specify and which other options you use. If you specify a single-variable subject effect or an interaction subject effect, you must use all these variables in the groupBy parameter of the partition action. If you specify a nested subject effect, you must include the nested variable in the groupBy parameter. You can also include the nesting variables in the groupBy parameter, but you cannot include any other variables in the groupBy parameter. However, if you specify the multipass parameter of the logistic action, you must include the nested variable and all nesting variables in the groupBy parameter. You cannot include any other variables in the groupBy parameter.
If you use an order-sensitive working correlation structure, you do not need to use the variables that form the within-subject effect in the orderBy parameter of the partition action. These sets of variables can overlap, but it is not required that the preexisting table order correspond to the ordering for the within-subject effect.