SAS Macros and Functions

BOXCOXAR Macro

The %BOXCOXAR macro finds the optimal Box-Cox transformation for a time series.

Transformations of the dependent variable are a useful way of dealing with nonlinear relationships or heteroscedasticity. For example, the logarithmic transformation is often used for modeling and forecasting time series that show exponential growth or that show variability proportional to the level of the series.

The Box-Cox transformation is a general class of power transformations that include the log transformation and no transformation as special cases. The Box-Cox transformation is

StartLayout 1st Row  upper Y Subscript t Baseline equals StartLayout Enlarged left-brace 1st Row 1st Column StartFraction left-parenthesis upper X Subscript t Baseline plus c right-parenthesis Superscript lamda Baseline minus 1 Over lamda EndFraction 2nd Column for lamda not-equals 0 2nd Row 1st Column ln left-parenthesis upper X Subscript t Baseline plus c right-parenthesis 2nd Column for lamda equals 0 EndLayout EndLayout

The parameter lamda controls the shape of the transformation. For example, lamda=0 produces a log transformation, while lamda=0.5 results in a square root transformation. When lamda=1, the transformed series differs from the original series by c minus 1.

The constant c is optional. It can be used when some upper X Subscript t values are negative or 0. You choose c so that the series upper X Subscript t is always greater than negative c.

The %BOXCOXAR macro tries a range of lamda values and reports which of the values tried produces the optimal Box-Cox transformation. To evaluate different lamda values, the %BOXCOXAR macro transforms the series with each lamda value and fits an autoregressive model to the transformed series. It is assumed that this autoregressive model is a reasonably good approximation to the true time series model appropriate for the transformed series. The likelihood of the data under each autoregressive model is computed, and the lamda value that produces the maximum likelihood over the values tried is reported as the optimal Box-Cox transformation for the series.

The %BOXCOXAR macro prints and optionally writes to a SAS data set all of the lamda values tried, the corresponding log-likelihood value, and related statistics for the autoregressive model.

You can control the range and number of lamda values tried. You can also control the order of the autoregressive models fit to the transformed series. You can difference the transformed series before the autoregressive model is fit.

Note that the Box-Cox transformation might be appropriate when the data have a common distribution (apart from heteroscedasticity) but not when groups of observations for the variable are quite different. Thus the %BOXCOXAR macro is more often appropriate for time series data than for cross-sectional data.

Syntax

The form of the %BOXCOXAR macro is

  • %BOXCOXAR ( SAS-data-set, variable < , options > );

The first argument, SAS-data-set, specifies the name of the SAS data set that contains the time series to be analyzed. The second argument, variable, specifies the time series variable name to be analyzed. The first two arguments are required.

The following options can be used with the %BOXCOXAR macro. Options must follow the required arguments and are separated by commas.

AR=n

specifies the order of the autoregressive model fit to the transformed series. The default is AR=5.

CONST=value

specifies a constant c to be added to the series before transformation. Use the CONST= option when some values of the series are 0 or negative. The default is CONST=0.

DIF=( differencing-list )

specifies the degrees of differencing to apply to the transformed series before the autoregressive model is fit. The differencing-list is a list of positive integers separated by commas and enclosed in parentheses. For example, DIF=(1,12) specifies that the transformed series be differenced once at lag 1 and once at lag 12. For more information, see the section IDENTIFY Statement in Chapter 7, ARIMA Procedure.

LAMBDAHI=value

specifies the maximum value of lambda for the grid search. The default is LAMBDAHI=1. A large (in magnitude) LAMBDAHI= value can result in problems with floating point arithmetic.

LAMBDALO=value

specifies the minimum value of lambda for the grid search. The default is LAMBDALO=0. A large (in magnitude) LAMBDALO= value can result in problems with floating point arithmetic.

NLAMBDA=value

specifies the number of lambda values considered, including the LAMBDALO= and LAMBDAHI= option values. The default is NLAMBDA=2.

OUT=SAS-data-set

writes the results to an output data set. The output data set includes the lambda values tried (LAMBDA), and for each lambda value, the log likelihood (LOGLIK), the residual mean squared error (RMSE), Akaike’s information criterion (AIC), and Schwarz’s Bayesian criterion (SBC).

PRINT=YES | NO

specifies whether results are printed. The default is PRINT=YES. The printed output contains the lambda values, log likelihoods, residual mean square errors, Akaike’s information criterion (AIC), and Schwarz’s Bayesian criterion (SBC).

Results

The value of lamda that produces the maximum log likelihood is returned in the macro variable &BOXCOXAR. The value of the variable &BOXCOXAR is "ERROR" if the %BOXCOXAR macro is unable to compute the best transformation due to errors. This might be the result of large lambda values. The Box-Cox transformation parameter involves exponentiation of the data, so that large lambda values can cause floating-point overflow.

Results are printed unless the PRINT=NO option is specified. Results are also stored in SAS data sets when the OUT= option is specified.

Details

Assume that the transformed series upper Y Subscript t is a stationary pth-order autoregressive process generated by independent normally distributed innovations.

left-parenthesis 1 minus normal upper Theta left-parenthesis upper B right-parenthesis right-parenthesis left-parenthesis upper Y Subscript t Baseline minus mu right-parenthesis equals epsilon Subscript t
epsilon Subscript t Baseline tilde i i d normal upper N left-parenthesis 0 comma sigma squared right-parenthesis

Given these assumptions, the log-likelihood function of the transformed data upper Y Subscript t is

StartLayout 1st Row 1st Column l Subscript upper Y Baseline left-parenthesis dot right-parenthesis equals 2nd Column minus 3rd Column StartFraction n Over 2 EndFraction ln left-parenthesis 2 pi right-parenthesis minus one-half ln left-parenthesis StartAbsoluteValue normal upper Sigma EndAbsoluteValue right-parenthesis minus StartFraction n Over 2 EndFraction ln left-parenthesis sigma squared right-parenthesis 2nd Row 1st Column Blank 2nd Column minus 3rd Column StartFraction 1 Over 2 sigma squared EndFraction left-parenthesis bold upper Y minus bold 1 mu right-parenthesis prime normal upper Sigma Superscript negative 1 Baseline left-parenthesis bold upper Y minus bold 1 mu right-parenthesis EndLayout

In this equation, n is the number of observations, mu is the mean of upper Y Subscript t, 1 is the n-dimensional column vector of 1s, sigma squared is the innovation variance, bold upper Y equals left-parenthesis upper Y 1 comma ellipsis comma upper Y Subscript n Baseline right-parenthesis prime, and normal upper Sigma is the covariance matrix of Y.

The log-likelihood function of the original data upper X 1 comma ellipsis comma upper X Subscript n Baseline is

l Subscript upper X Baseline left-parenthesis dot right-parenthesis equals l Subscript upper Y Baseline left-parenthesis dot right-parenthesis plus left-parenthesis lamda minus 1 right-parenthesis sigma-summation Underscript t equals 1 Overscript n Endscripts ln left-parenthesis upper X Subscript t Baseline plus c right-parenthesis

where c is the value of the CONST= option.

For each value of lamda, the maximum log-likelihood of the original data is obtained from the maximum log-likelihood of the transformed data given the maximum likelihood estimate of the autoregressive model.

The maximum log-likelihood values are used to compute Akaike’s information criterion (AIC) and Schwarz’s Bayesian criterion (SBC) for each lamda value. The residual mean squared error based on the maximum likelihood estimator is also produced. To compute the mean squared error, the predicted values from the model are transformed again to the original scale (Pankratz 1983, pp. 256–258; Taylor 1986).

After differencing as specified by the DIF= option, the process is assumed to be a stationary autoregressive process. You can check for stationarity of the series with the %DFTEST macro. If the process is not stationary, differencing with the DIF= option is recommended. For a process with moving-average terms, a large value for the AR= option might be appropriate.

Last updated: June 19, 2025