This section applies to actions in the following action sets: gam, phreg, pls, quantreg, and regression.
When you have sufficient data, you can divide your data into three parts called the training, validation, and test data. During the selection process, models are fit on the training data, and the prediction errors for the models so obtained are found by using the validation data. This prediction error on the validation data can be used to decide when to terminate the selection process and to decide which model to select. Finally, after a model has been selected, the test set can be used to assess how the selected model generalizes on data that played no role in selecting the model.
In some cases, you might want to use only training and test data. For example, you might decide to use an information criterion to decide which effects to include and when to terminate the selection process. In this case, no validation data are required, but test data can still be useful in assessing the predictive performance of the selected model. In other cases, you might decide to use validation data during the selection process but forgo assessing the selected model on test data. Hastie, Tibshirani, and Friedman (2001) note that it is difficult to provide a general rule for how many observations you should assign to each role. They note that a typical split might be 50% for training and 25% each for validation and testing.
You use a partByFrac parameter to logically subdivide the input data table into separate roles. You can specify the fractions of the data that you want to reserve as test data and validation data. For example, the following CASL language statements randomly divide the input data table, reserving 50% for training and 25% each for validation and testing:
partByFrac={test=0.25,validate=0.25}
You can specify the seed subparameter in the partByFrac parameter to create the same partition data tables for a particular number of compute nodes. However, changing the number of compute nodes changes the initial distribution of data, resulting in different partition data tables.
In some cases, you might need to exercise more control over the partitioning of the input data table. You can do this by using the partByVar parameter to name both a variable in the input data table and a formatted value of that variable for each role. For example, the following CASL language statements assign roles to the observations in the input data table that are based on the value of the variable Group in that data table. Observations whose value of Group is Group 1 are assigned for testing, and those whose value is Group 2 are assigned to training. All other observations are ignored.
partByVar={name='Group', test='Group 1', train='Group 2'}
When you have reserved observations for training, validation, and testing, a model that is fit on the training data is scored on the validation and test data, and statistics are computed separately for each of these subsets.