Data Science Pilot Action Set

Provides actions for automating data science workflows, including automatic machine learning pipeline exploration, execution and ranking.

dsAutoMl Action

Automated machine learning pipeline exploration, execution and ranking..

CASL Syntax

dataSciencePilot.dsAutoMl <result=results> <status=rc> /
ecdfTolerance=double,
event="string",
explorationPolicy={
cv={
lowMoment=double
lowRobust=double
},
dateTimeVariables={"variable-name-1" <, "variable-name-2", ...>},
dateVariables={"variable-name-1" <, "variable-name-2", ...>},
iqv={ },
nominal={
includeNegative=TRUE | FALSE
includeNonIntegral=TRUE | FALSE
intervals={"variable-name-1" <, "variable-name-2", ...>}
nominals={"variable-name-1" <, "variable-name-2", ...>}
},
timeVariables={"variable-name-1" <, "variable-name-2", ...>}
},
required parameter featureOut={
caslib="string",
indexVars={"variable-name-1" <, "variable-name-2", ...>},
lifetime=64-bit-integer,
name="table-name",
promote=TRUE | FALSE,
replace=TRUE | FALSE,
},
hyperParameterOptimizer="MODELCOMPOSER" | "TUNEALL",
inputs={{
format="string",
formattedLength=integer,
label="string",
required parameter name="variable-name",
nfd=integer,
nfl=integer
}, {...}},
kFolds=integer,
logLevel=integer,
misraGries=TRUE | FALSE,
modelTypes={"DECISIONTREE", "FOREST", "GLM", "GRADBOOST", "LOGISTIC", "NEURALNET"},
required parameter pipelineOut={
caslib="string",
indexVars={"variable-name-1" <, "variable-name-2", ...>},
lifetime=64-bit-integer,
name="table-name",
promote=TRUE | FALSE,
replace=TRUE | FALSE,
},
sampleSize=integer,
saveState={casouttable} | {savePipelinesOptions},
screenPolicy={
constant=TRUE | FALSE,
groupRareLevels=TRUE | FALSE,
lowCv=TRUE | FALSE,
redundant=double
},
seed=integer,
required parameter table={
caslib="string",
computedOnDemand=TRUE | FALSE,
computedVars={{
format="string",
formattedLength=integer,
label="string",
required parameter name="variable-name",
nfd=integer,
nfl=integer
}, {...}},
dataSourceOptions={key-1=any-list-or-data-type-1 <, key-2=any-list-or-data-type-2, ...>},
importOptions={fileType="ANY" | "AUDIO" | "AUTO" | "BASESAS" | "CSV" | "DELIMITED" | "DOCUMENT" | "DTA" | "ESP" | "EXCEL" | "FMT" | "HDAT" | "IMAGE" | "JMP" | "LASR" | "PARQUET" | "SOUND" | "SPSS" | "VIDEO" | "XLS", fileType-specific-parameters},
required parameter name="table-name",
singlePass=TRUE | FALSE,
vars={{
format="string",
formattedLength=integer,
label="string",
required parameter name="variable-name",
nfd=integer,
nfl=integer
}, {...}},
where="where-expression",
whereTable={
casLib="string"
dataSourceOptions={adls_noreq-parameters | bigquery-parameters | cas_noreq-parameters | clouddex-parameters | db2-parameters | dnfs-parameters | esp-parameters | fedsvr-parameters | gcs_noreq-parameters | hadoop-parameters | hana-parameters | impala-parameters | informix-parameters | jdbc-parameters | mongodb-parameters | mysql-parameters | odbc-parameters | oracle-parameters | path-parameters | postgres-parameters | redshift-parameters | s3-parameters | sapiq-parameters | sforce-parameters | singlestore_standard-parameters | snowflake-parameters | spark-parameters | spde-parameters | sqlserver-parameters | ss_noreq-parameters | teradata-parameters | vertica-parameters | yellowbrick-parameters}
importOptions={fileType="ANY" | "AUDIO" | "AUTO" | "BASESAS" | "CSV" | "DELIMITED" | "DOCUMENT" | "DTA" | "ESP" | "EXCEL" | "FMT" | "HDAT" | "IMAGE" | "JMP" | "LASR" | "PARQUET" | "SOUND" | "SPSS" | "VIDEO" | "XLS", fileType-specific-parameters}
required parameter name="table-name"
vars={{
format="string",
formattedLength=integer,
label="string",
required parameter name="variable-name",
nfd=integer,
nfl=integer
}, {...}}
where="where-expression"
}
},
required parameter target="variable-name",
topKPipelines=integer,
required parameter transformationOut={
caslib="string",
indexVars={"variable-name-1" <, "variable-name-2", ...>},
lifetime=64-bit-integer,
name="table-name",
promote=TRUE | FALSE,
replace=TRUE | FALSE,
},
transformationPolicy={
cardinality=TRUE | FALSE,
entropy=TRUE | FALSE,
interaction=TRUE | FALSE,
iqv=TRUE | FALSE,
kurtosis=TRUE | FALSE,
missing=TRUE | FALSE,
outlier=TRUE | FALSE,
skewness=TRUE | FALSE
},
;
indicates a required parameter

Summary: Input and Output Tables

If a row includes a subparameter, you can specify the name, caslib, and so on in the subparameter. Otherwise, you can specify the name, caslib, and so on in the parameter.

Parameters for Reading Input Tables

Parameter

Subparameter

Description

required parametertable

specifies the table name, caslib, and other common parameters.

Parameters for Creating Output Tables

Parameter

Subparameter

Description

required parameterfeatureOut

specifies the CAS table to store the feature transformation and generation pipelines.

required parameterpipelineOut

specifies the CAS table to store the analysis results.

 saveState

specifies the CAS table to store the analysis results.

required parametertransformationOut

specifies the CAS table to store the feature transformation and generation pipelines.

Parameter Descriptions

distinctCountLimit=integer

specifies the distinct count limit. If the limit is exceeded, and the misraGries parameter is set to True, the Misra-Gries frequency sketch algorithm is used to estimate the frequency distribution. Otherwise, the distinct count operation is aborted.

Alias maxNVals
Default 10000
Minimum value 256

ecdfTolerance=double

specifies the tolerance value for the empirical cumulative distribution function. This value is used by the quantile sketch algorithm.

Default 0.001
Range 1E-06–0.1

event="string"

specifies the target variable level that you want to model. Multilevel classification problems are cast into a one-versus-all binary classification problem, where the value of the event parameter denotes the level that you are modeling.

explorationPolicy={avaptPolicy}

specifies the automatic variable analysis and grouping (AVAPT) policy.

Alias avaptPolicy

The avaptPolicy value can be one or more of the following:

cardinality={cardinalityAvaptPolicy}

specifies the automatic variable analysis and grouping (AVAPT) cardinality policy.

The cardinalityAvaptPolicy value can be one or more of the following:

lowMediumCutoff=double

specifies the cardinality threshold for the low-medium cutoff.

Default 32
Range 2–256
mediumHighCutoff=double

specifies the cardinality threshold for the medium-high cutoff.

Default 64
Range 2–1024
minNObsPerTargetLevel=double

specifies the minimum number of observations for each target level.

Default 10
Range 5–100
cv={cvAvaptPolicy}

specifies the automatic variable analysis and grouping (AVAPT) coefficient of variation policy.

Alias coefficientVariation

The cvAvaptPolicy value can be one or more of the following:

lowMoment=double

specifies the absolute value of the low-high percentage threshold for the moment coefficient of variation (CV).

Default 1
Minimum value 0
lowRobust=double

specifies the absolute value of the low-high percentage threshold for the robust coefficient of variation (CV).

Default 1
Minimum value 0
dateTimeVariables={"variable-name-1" <, "variable-name-2", ...>}

specifies the datetime variables.

Alias dateTime
dateVariables={"variable-name-1" <, "variable-name-2", ...>}

specifies the date variables.

Alias date
entropy={entropyAvaptPolicy}

specifies the automatic variable analysis and grouping (AVAPT) entropy policy.

The entropyAvaptPolicy value can be one or more of the following:

giniLowMediumCutoff=double

specifies the Gini entropy threshold for the low-medium cutoff.

Default 0.25
Range 0–1
giniMediumHighCutoff=double

specifies the Gini entropy threshold for the medium-high cutoff.

Default 0.75
Range 0–1
shannonLowMediumCutoff=double

specifies the Shannon entropy threshold for the low-medium cutoff.

Default 0.25
Range 0–1
shannonMediumHighCutoff=double

specifies the Shannon entropy threshold for the medium-high cutoff.

Default 0.75
Range 0–1
iqv={iqvAvaptPolicy}

specifies the automatic variable analysis and grouping (AVAPT) index of qualitative variation policy.

Alias qualitativeVariationIndex

The iqvAvaptPolicy value can be one or more of the following:

highTopBottom=double

specifies the low-high cutoff frequency ratio threshold between the most frequent and least frequent levels of a nominal variable.

Alias highTop1Bottom1
Default 100
Minimum value 1
highTopTwo=double

specifies the low-high cutoff frequency ratio threshold between the most frequent and second most frequent levels of a nominal variable.

Alias highTop1Top2
Default 10
Minimum value 1
highVariationRatio=double

specifies the variation ratio threshold for the low-high cutoff.

Alias highModVr
Default 0.5
Range (0–1]
kurtosis={kurtosisAvaptPolicy}

specifies the automatic variable analysis and grouping (AVAPT) kurtosis policy.

The kurtosisAvaptPolicy value can be one or more of the following:

momentLowMediumCutoff=double

specifies the absolute value of the moment kurtosis threshold for the low-medium cutoff.

Default 5
Minimum value 0
momentMediumHighCutoff=double

specifies the absolute value of the moment kurtosis threshold for the medium-high cutoff.

Default 10
Minimum value 0
robustLowMediumCutoff=double

specifies the absolute value of the robust kurtosis threshold for the low-medium cutoff.

Default 2
Minimum value 0
robustMediumHighCutoff=double

specifies the absolute value of the robust kurtosis threshold for the medium-high cutoff.

Default 3
Minimum value 0
missing={missingAvaptPolicy}

specifies the automatic variable analysis and grouping (AVAPT) missing grouping policy.

The missingAvaptPolicy value can be one or more of the following:

lowMediumCutoff=double

specifies the missing percentage threshold for the low-medium cutoff.

Default 5
Range 0–100
mediumHighCutoff=double

specifies the missing percentage threshold for the medium-high cutoff.

Default 25
Range 0–100
nominal={nominalAvaptPolicy}

specifies the automatic variable analysis and grouping (AVAPT) nominal policy.

The nominalAvaptPolicy value can be one or more of the following:

cardinalityRatio=double

specifies the AVAPT nominal policy cardinality ratio threshold.

Default 0.25
Range (0–1]
cardinalityThreshold=double

specifies the AVAPT nominal policy cardinality threshold.

Default 1024
Minimum value 32
includeNegative=TRUE | FALSE

when set to True, includes numeric variables with some negative values in the nominal analysis.

Default FALSE
includeNonIntegral=TRUE | FALSE

when set to True, includes numeric variables with some nonintegral values in the nominal analysis.

Default FALSE
intervals={"variable-name-1" <, "variable-name-2", ...>}

specifies variables to consider as intervals.

nominals={"variable-name-1" <, "variable-name-2", ...>}

specifies variables to consider as nominals.

outlier={outlierAvaptPolicy}

specifies the automatic variable analysis and grouping (AVAPT) outlier policy.

The outlierAvaptPolicy value can be one or more of the following:

momentLowMediumCutoff=double

specifies the z-score outlier percentage threshold for the low-medium cutoff.

Default 1
Range 0–100
momentMediumHighCutoff=double

specifies the z-score outlier percentage threshold for the medium-high cutoff.

Default 2.5
Range 0–100
robustLowMediumCutoff=double

specifies the modified interquartile range outlier percentage threshold for the low-medium cutoff.

Default 1
Range 0–100
robustMediumHighCutoff=double

specifies the modified interquartile range outlier percentage threshold for the medium-high cutoff.

Default 2.5
Range 0–100
skewness={skewnessAvaptPolicy}

specifies the automatic variable analysis and grouping (AVAPT) skewness policy.

The skewnessAvaptPolicy value can be one or more of the following:

momentLowMediumCutoff=double

specifies the moment skewness threshold for the low-medium cutoff.

Default 2
Range 0–100
momentMediumHighCutoff=double

specifies the moment skewness threshold for the medium-high cutoff.

Default 10
Range 0–100
robustLowMediumCutoff=double

specifies the robust skewness threshold for the low-medium cutoff.

Default 0.75
Range 0–3
robustMediumHighCutoff=double

specifies the robust skewness threshold for the medium-high cutoff.

Default 2
Range 0–3
timeVariables={"variable-name-1" <, "variable-name-2", ...>}

specifies the time variables.

Alias time

* featureOut={casouttable}

specifies the CAS table to store the feature transformation and generation pipelines.

Alias featuresOut
Long form featureOut={name="table-name"}
Shortcut form featureOut="table-name"

The casouttable value can be one or more of the following:

caslib="string"

specifies the name of the caslib for the output table.

indexVars={"variable-name-1" <, "variable-name-2", ...>}

specifies the list of variables to create indexes for in the output data.

lifetime=64-bit-integer

specifies the number of seconds to keep the table in memory after it is last accessed. The table is dropped if it is not accessed for the specified number of seconds.

Default 0
Minimum value 0
memoryFormat="DVR" | "INHERIT" | "STANDARD"

specifies the memory format for the output table.

Default INHERIT
DVR

use the duplicate value reduction memory format. This memory format can reduce the memory consumption and file size when the input data contains duplicate values.

INHERIT

use the default memory format that is set for the server. By default, the server uses the standard memory format. If an administrator sets the CAS_DEFAULT_MEMORY_FORMAT environment variable to DVR, then the DVR memory format is set as the default for the server.

STANDARD

use the standard memory format.

name="table-name"

specifies the name for the output table.

promote=TRUE | FALSE

when set to True, adds the output table with a global scope. This enables other sessions to access the table, subject to access controls. The target caslib must also have a global scope.

Default FALSE
replace=TRUE | FALSE

when set to True, overwrites an existing table that has the same name.

Default FALSE
tableRedistUpPolicy="DEFER" | "NOREDIST" | "REBALANCE"

Specifies the Table Redistribution Policy when the number of worker pods increases on a running CAS server.

DEFER

Defer redistribution policy selection to higher-level entity.

NOREDIST

Do not redistribute table data when the number of worker pods changes on a running CAS server.

REBALANCE

Rebalance table data when the number of worker pods changes on a running CAS server.

hyperParameterOptimizer="MODELCOMPOSER" | "TUNEALL"

specifies the method to use for hyperparameter optimization.

Alias hpOptimizer
Default TUNEALL

inputs={{casinvardesc-1} <, {casinvardesc-2}, ...>}

specifies the variables to use for the analysis. You can specify a subset of the variables from the input table.

For more information about specifying the inputs parameter, see the common casinvardesc parameter (Appendix A: Common Parameters).

Alias vars

kFolds=integer

specifies the number of folds for cross validation.

Default 5
Range 2–10

logLevel=integer

specifies the logging level.

Default 0
Range 0–3

misraGries=TRUE | FALSE

when set to True, uses the Misra-Gries algorithm for the frequency distribution estimation, if the distinct count limit is exceeded.

Default TRUE

modelTypes={"DECISIONTREE", "FOREST", "GLM", "GRADBOOST", "LOGISTIC", "NEURALNET"}

specifies the values to control the types and classes of machine learning algorithms to include in the pipeline exploration.

DECISIONTREE

specifies the decision tree model.

FOREST

specifies the random forest model.

GLM

specifies the generalized linear model.

GRADBOOST

specifies the gradient boosting model.

LOGISTIC

specifies the logistic regression model.

NEURALNET

specifies the neural network model.

objective="ASE" | "AUC" | "F1" | "MAE" | "MCE" | "MSLE" | "RASE" | "RMAE" | "RMSLE"

specifies the model performance metric to use.

ASE

uses the average square error.

AUC

uses the area under the receiver operating characteristic curve.

F1

uses the F1 coefficient.

MAE

uses the mean absolute error.

MCE

uses the misclassification error.

Alias MCR
MSLE

uses the mean square logarithmic error.

RASE

uses the root average square error.

RMAE

uses the root mean absolute error.

RMSLE

uses the root mean square logarithmic error.

* pipelineOut={casouttable}

specifies the CAS table to store the analysis results.

Alias pipelinesOut
Long form pipelineOut={name="table-name"}
Shortcut form pipelineOut="table-name"

The casouttable value can be one or more of the following:

caslib="string"

specifies the name of the caslib for the output table.

indexVars={"variable-name-1" <, "variable-name-2", ...>}

specifies the list of variables to create indexes for in the output data.

lifetime=64-bit-integer

specifies the number of seconds to keep the table in memory after it is last accessed. The table is dropped if it is not accessed for the specified number of seconds.

Default 0
Minimum value 0
memoryFormat="DVR" | "INHERIT" | "STANDARD"

specifies the memory format for the output table.

Default INHERIT
DVR

use the duplicate value reduction memory format. This memory format can reduce the memory consumption and file size when the input data contains duplicate values.

INHERIT

use the default memory format that is set for the server. By default, the server uses the standard memory format. If an administrator sets the CAS_DEFAULT_MEMORY_FORMAT environment variable to DVR, then the DVR memory format is set as the default for the server.

STANDARD

use the standard memory format.

name="table-name"

specifies the name for the output table.

promote=TRUE | FALSE

when set to True, adds the output table with a global scope. This enables other sessions to access the table, subject to access controls. The target caslib must also have a global scope.

Default FALSE
replace=TRUE | FALSE

when set to True, overwrites an existing table that has the same name.

Default FALSE
tableRedistUpPolicy="DEFER" | "NOREDIST" | "REBALANCE"

Specifies the Table Redistribution Policy when the number of worker pods increases on a running CAS server.

DEFER

Defer redistribution policy selection to higher-level entity.

NOREDIST

Do not redistribute table data when the number of worker pods changes on a running CAS server.

REBALANCE

Rebalance table data when the number of worker pods changes on a running CAS server.

sampleSize=integer

specifies the maximum number of pipelines to sample.

Default 10
Minimum value 1

saveState={casouttable} | {savePipelinesOptions}

specifies the CAS table to store the analysis results.

Alias saveModel

The casouttable value can be one or more of the following:

caslib="string"

specifies the name of the caslib for the output table.

compress=TRUE | FALSE

when set to True, applies data compression to the table.

Default FALSE
indexVars={"variable-name-1" <, "variable-name-2", ...>}

specifies the list of variables to create indexes for in the output data.

label="string"

specifies the descriptive label to associate with the table.

lifetime=64-bit-integer

specifies the number of seconds to keep the table in memory after it is last accessed. The table is dropped if it is not accessed for the specified number of seconds.

Default 0
Minimum value 0
maxMemSize=64-bit-integer

specifies the maximum amount of memory, in bytes, that each thread should allocate for in-memory blocks before converting to a memory-mapped file. Files are written in the directories that are specified in the CAS_DISK_CACHE environment variable.

TIP You can enclose the value in quotation marks and specify B, K, M, G, or T as a suffix to indicate the units. For example, "8M" specifies eight megabytes.
memoryFormat="DVR" | "INHERIT" | "STANDARD"

specifies the memory format for the output table.

Default INHERIT
DVR

use the duplicate value reduction memory format. This memory format can reduce the memory consumption and file size when the input data contains duplicate values.

INHERIT

use the default memory format that is set for the server. By default, the server uses the standard memory format. If an administrator sets the CAS_DEFAULT_MEMORY_FORMAT environment variable to DVR, then the DVR memory format is set as the default for the server.

STANDARD

use the standard memory format.

name="table-name"

specifies the name for the output table.

promote=TRUE | FALSE

when set to True, adds the output table with a global scope. This enables other sessions to access the table, subject to access controls. The target caslib must also have a global scope.

Default FALSE
replace=TRUE | FALSE

when set to True, overwrites an existing table that has the same name.

Default FALSE
replication=integer

specifies the number of copies of the table to make for fault tolerance. Larger values result in slower performance and use more memory, but provide high availability for data in the event of a node failure. Data redundancy applies to distributed servers only.

Default 1
Minimum value 0
tableRedistUpPolicy="DEFER" | "NOREDIST" | "REBALANCE"

Specifies the Table Redistribution Policy when the number of worker pods increases on a running CAS server.

DEFER

Defer redistribution policy selection to higher-level entity.

NOREDIST

Do not redistribute table data when the number of worker pods changes on a running CAS server.

REBALANCE

Rebalance table data when the number of worker pods changes on a running CAS server.

threadBlockSize=64-bit-integer

specifies the number of bytes to use for blocks in the output table. The blocks are read by threads. Gradually increase this value when you have a large table with millions or billions of rows and you are tuning for performance. Larger values can increase performance with indexed tables. However, if the value is too large, then you can cause thread starvation due to too few blocks for threads to work on.

Alias blockSize
Default 1048576
Minimum value 0
TIP You can enclose the value in quotation marks and specify B, K, M, G, or T as a suffix to indicate the units. For example, "8M" specifies eight megabytes.
timeStamp="string"

specifies to add a timestamp column to the table. Support for timeStamp is action-specific. Specify the value in the form that is appropriate for your session locale.

where={"string-1" <, "string-2", ...>}

specifies one or more expressions for subsetting the output data. When multiple expressions are specified, the expressions are effectively combined using AND to form the final output filter. If an expression contains quoted values, use nested quotation marks.

The savePipelinesOptions value can be one or more of the following:

modelNamePrefix="string"

specifies the prefix to use for the names of the saved models.

replace=TRUE | FALSE

when set to True, overwrites already existing models that have the same name.

Default TRUE
topK=integer

specifies the number of best-performing models to save.

Default 5
Minimum value 1

screenPolicy={sweeperPolicy}

specifies the variable screening policy to use for recommending that variables be screened out, transformed, or copied.

Alias sweeperPolicy

The sweeperPolicy value can be one or more of the following:

constant=TRUE | FALSE

when set to True, uses the variable screening policy to identify variables that have constant values.

Alias unique
Default TRUE
groupRareLevels=TRUE | FALSE

when set to True, uses the variable screening policy to identify nominal variables that have rare levels.

Alias groupRare
Default TRUE
leakagePercentThreshold=double

specifies the variable screening policy for variables that have a very high level of information about the target. Variables that have a greater target entropy percentage reduction than the specified threshold are flagged as leakage variables.

Alias leakagePercentageThreshold
Default 90
Range (0–100]
lowCv=TRUE | FALSE

when set to True, uses the variable screening policy to identify variables that have a low coefficient of variation (CV).

Alias lowCoefficientVariation
Default TRUE
lowMutualInformation=double

specifies the variable screening policy for variables that have a low level of information about the target.

Alias lowInformation
Default 0.05
Minimum value 0
missingIndicatorPercent=double

specifies the variable screening policy for generating missing indicator variables.

Alias missingIndicatorPercentage
Default 75
Range [10–100)
missingPercentThreshold=double

specifies the variable screening policy for identifying variables that have a very high missing rate.

Alias missingPercentageThreshold
Default 90
Range [10–100)
redundant=double

specifies the symmetric uncertainty (SU) threshold for identifying redundant variables. If the SU for two variables exceeds the threshold, the variable that has less information about the target is flagged as redundant.

Default 1
Range (0–1]

seed=integer

specifies a seed value for random number generation. This value is used for repeatable random number generation in some scenarios.

Default 0

selectionPolicy={featureSelectOptions}

specifies the feature selection policy.

Long form selectionPolicy={criterion="CHISQ" | "CRAMERSV" | "ENTROPY" | "FTEST" | "G2" | "IV" | "MI" | "NORMMI" | "PEARSON" | "SU"}
Shortcut form selectionPolicy="CHISQ" | "CRAMERSV" | "ENTROPY" | "FTEST" | "G2" | "IV" | "MI" | "NORMMI" | "PEARSON" | "SU"

The featureSelectOptions value can be one or more of the following:

criterion="CHISQ" | "CRAMERSV" | "ENTROPY" | "FTEST" | "G2" | "IV" | "MI" | "NORMMI" | "PEARSON" | "SU"

specifies the filter feature selection criterion to use.

Alias stat
Default MI
CHISQ

uses the chi-square statistic.

CRAMERSV

uses Cramer's V.

ENTROPY

uses the entropy percentage decrease.

FTEST

uses the F test.

G2

uses the G2 statistic.

IV

uses the information value statistic.

MI

uses the mutual information statistic.

NORMMI

uses the normalized mutual information statistic.

PEARSON

uses the Pearson correlation.

SU

uses the symmetric uncertainty statistic.

topK=integer

specifies that the number of features that have the highest filter selection criterion value be selected. If both topK and topKPercent are specified, then topKPercent is used.

Default 50
Minimum value 1
topKPercent=double

specifies that the percentage of features that have the highest filter selection criterion value be selected. If both topK and topKPercent are specified, then topKPercent is used.

Alias topKPercentage
Range (0–100]

* table={castable}

specifies the table name, caslib, and other common parameters.

Long form table={name="table-name"}
Shortcut form table="table-name"

The castable value can be one or more of the following:

caslib="string"

specifies the caslib for the input table that you want to use with the action. By default, the active caslib is used. Specify a value only if you need to access a table from a different caslib.

computedOnDemand=TRUE | FALSE

when set to True, creates the computed variables when the table is loaded instead of when the action begins.

Alias compOnDemand
Default FALSE
computedVars={{casinvardesc-1} <, {casinvardesc-2}, ...>}

specifies the names of the computed variables to create. Specify an expression for each variable in the computedVarsProgram parameter. If you do not specify this parameter, then all variables from computedVarsProgram are automatically included.

Alias compVars

The casinvardesc value can be one or more of the following:

format="string"

specifies the format to apply to the variable.

formattedLength=integer

specifies the length of the format field plus the length of the format precision.

label="string"

specifies the descriptive label for the variable.

* name="variable-name"

specifies the name for the variable.

nfd=integer

specifies the length of the format precision.

nfl=integer

specifies the length of the format field.

computedVarsProgram="string"

specifies an expression for each computed variable that you include in the computedVars parameter.

Alias compPgm
dataSourceOptions={key-1=any-list-or-data-type-1 <, key-2=any-list-or-data-type-2, ...>}

specifies data source options.

Aliases options
dataSource
importOptions={fileType="ANY" | "AUDIO" | "AUTO" | "BASESAS" | "CSV" | "DOCUMENT" | "DTA" | "ESP" | "EXCEL" | "FMT" | "HDAT" | "IMAGE" | "JMP" | "LASR" | "PARQUET" | "SPSS" | "VIDEO" | "XLS", fileType-specific-parameters}

specifies the settings for reading a table from a data source.

Alias import

For more information about specifying the importOptions parameter, see the common importOptions parameter (Appendix A: Common Parameters).

* name="table-name"

specifies the name of the input table.

singlePass=TRUE | FALSE

when set to True, does not create a transient table on the server. Setting this parameter to True can be efficient, but the data might not have stable ordering upon repeated runs.

Default FALSE
vars={{casinvardesc-1} <, {casinvardesc-2}, ...>}

specifies the variables to use in the action.

The casinvardesc value can be one or more of the following:

format="string"

specifies the format to apply to the variable.

formattedLength=integer

specifies the length of the format field plus the length of the format precision.

label="string"

specifies the descriptive label for the variable.

* name="variable-name"

specifies the name for the variable.

nfd=integer

specifies the length of the format precision.

nfl=integer

specifies the length of the format field.

where="where-expression"

specifies an expression for subsetting the input data.

whereTable={groupbytable}

specifies an input table that contains rows to use as a WHERE filter. If the vars parameter is not specified, then all the variable names that are common to the input table and the filtering table are used to find matching rows. If the where parameter for the input table and this parameter are specified, then this filtering table is applied first.

The groupbytable value can be one or more of the following:

casLib="string"

specifies the caslib for the filter table. By default, the active caslib is used.

dataSourceOptions={adls_noreq-parameters | bigquery-parameters | cas_noreq-parameters | clouddex-parameters | db2-parameters | dnfs-parameters | esp-parameters | fedsvr-parameters | gcs_noreq-parameters | hadoop-parameters | hana-parameters | impala-parameters | informix-parameters | jdbc-parameters | mongodb-parameters | mysql-parameters | odbc-parameters | oracle-parameters | path-parameters | postgres-parameters | redshift-parameters | s3-parameters | sapiq-parameters | sforce-parameters | singlestore_standard-parameters | snowflake-parameters | spark-parameters | spde-parameters | sqlserver-parameters | ss_noreq-parameters | teradata-parameters | vertica-parameters | yellowbrick-parameters}

specifies data source options.

Aliases options
dataSource

For more information about specifying the dataSourceOptions parameter, see the common dataSourceOptions parameter (Appendix A: Common Parameters).

importOptions={fileType="ANY" | "AUDIO" | "AUTO" | "BASESAS" | "CSV" | "DOCUMENT" | "DTA" | "ESP" | "EXCEL" | "FMT" | "HDAT" | "IMAGE" | "JMP" | "LASR" | "PARQUET" | "SPSS" | "VIDEO" | "XLS", fileType-specific-parameters}

specifies the settings for reading a table from a data source.

Alias import

For more information about specifying the importOptions parameter, see the common importOptions parameter (Appendix A: Common Parameters).

* name="table-name"

specifies the name of the filter table.

vars={{casinvardesc-1} <, {casinvardesc-2}, ...>}

specifies the variable names to use from the filter table.

The casinvardesc value can be one or more of the following:

format="string"

specifies the format to apply to the variable.

formattedLength=integer

specifies the length of the format field plus the length of the format precision.

label="string"

specifies the descriptive label for the variable.

* name="variable-name"

specifies the name for the variable.

nfd=integer

specifies the length of the format precision.

nfl=integer

specifies the length of the format field.

where="where-expression"

specifies an expression for subsetting the data from the filter table.

* target="variable-name"

specifies the target variable.

Alias evalVar

topKPipelines=integer

specifies the number of best-performing pipelines to save.

Default 10
Minimum value 1

* transformationOut={casouttable}

specifies the CAS table to store the feature transformation and generation pipelines.

Alias transformationsOut
Long form transformationOut={name="table-name"}
Shortcut form transformationOut="table-name"

The casouttable value can be one or more of the following:

caslib="string"

specifies the name of the caslib for the output table.

indexVars={"variable-name-1" <, "variable-name-2", ...>}

specifies the list of variables to create indexes for in the output data.

lifetime=64-bit-integer

specifies the number of seconds to keep the table in memory after it is last accessed. The table is dropped if it is not accessed for the specified number of seconds.

Default 0
Minimum value 0
memoryFormat="DVR" | "INHERIT" | "STANDARD"

specifies the memory format for the output table.

Default INHERIT
DVR

use the duplicate value reduction memory format. This memory format can reduce the memory consumption and file size when the input data contains duplicate values.

INHERIT

use the default memory format that is set for the server. By default, the server uses the standard memory format. If an administrator sets the CAS_DEFAULT_MEMORY_FORMAT environment variable to DVR, then the DVR memory format is set as the default for the server.

STANDARD

use the standard memory format.

name="table-name"

specifies the name for the output table.

promote=TRUE | FALSE

when set to True, adds the output table with a global scope. This enables other sessions to access the table, subject to access controls. The target caslib must also have a global scope.

Default FALSE
replace=TRUE | FALSE

when set to True, overwrites an existing table that has the same name.

Default FALSE
tableRedistUpPolicy="DEFER" | "NOREDIST" | "REBALANCE"

Specifies the Table Redistribution Policy when the number of worker pods increases on a running CAS server.

DEFER

Defer redistribution policy selection to higher-level entity.

NOREDIST

Do not redistribute table data when the number of worker pods changes on a running CAS server.

REBALANCE

Rebalance table data when the number of worker pods changes on a running CAS server.

transformationPolicy={transformationSpace}

specifies the feature transformation and generation space in which the feature machine operates.

Alias transformationSpace

The transformationSpace value can be one or more of the following:

cardinality=TRUE | FALSE

when set to True, includes cardinality-reducing transformations.

Default TRUE
entropy=TRUE | FALSE

when set to True, includes transformations for the treatment of low entropy.

Default FALSE
interaction=TRUE | FALSE

when set to True, detects and generates interaction features.

Default FALSE
iqv=TRUE | FALSE

when set to True, includes transformations for the treatment of low indices of qualitative variation (IQV).

Default FALSE
kurtosis=TRUE | FALSE

when set to True, includes transformations for the treatment of high kurtosis.

Default FALSE
missing=TRUE | FALSE

when set to True, includes transformations for the treatment of missing values.

Default TRUE
outlier=TRUE | FALSE

when set to True, includes transformations for the treatment of outliers.

Default FALSE
skewness=TRUE | FALSE

when set to True, includes transformations for the treatment of high skewness.

Default TRUE

validationPartitionFraction=double

specifies the percentage of the input data to use for validation.

Default 0.3
Range 0.01–0.99

dsAutoMl Action

Automated machine learning pipeline exploration, execution and ranking..

Lua Syntax

results, info = s:dataSciencePilot_dsAutoMl{
ecdfTolerance=double,
event="string",
explorationPolicy={
cv={
lowMoment=double
lowRobust=double
},
dateTimeVariables={"variable-name-1" <, "variable-name-2", ...>},
dateVariables={"variable-name-1" <, "variable-name-2", ...>},
iqv={ },
nominal={
includeNegative=true | false
includeNonIntegral=true | false
intervals={"variable-name-1" <, "variable-name-2", ...>}
nominals={"variable-name-1" <, "variable-name-2", ...>}
},
timeVariables={"variable-name-1" <, "variable-name-2", ...>}
},
required parameter featureOut={
caslib="string",
indexVars={"variable-name-1" <, "variable-name-2", ...>},
lifetime=64-bit-integer,
name="table-name",
promote=true | false,
replace=true | false,
},
hyperParameterOptimizer="MODELCOMPOSER" | "TUNEALL",
inputs={{
format="string",
formattedLength=integer,
label="string",
required parameter name="variable-name",
nfd=integer,
nfl=integer
}, {...}},
kFolds=integer,
logLevel=integer,
misraGries=true | false,
modelTypes={"DECISIONTREE", "FOREST", "GLM", "GRADBOOST", "LOGISTIC", "NEURALNET"},
required parameter pipelineOut={
caslib="string",
indexVars={"variable-name-1" <, "variable-name-2", ...>},
lifetime=64-bit-integer,
name="table-name",
promote=true | false,
replace=true | false,
},
sampleSize=integer,
saveState={casouttable} | {savePipelinesOptions},
screenPolicy={
constant=true | false,
groupRareLevels=true | false,
lowCv=true | false,
redundant=double
},
seed=integer,
required parameter table={
caslib="string",
computedOnDemand=true | false,
computedVars={{
format="string",
formattedLength=integer,
label="string",
required parameter name="variable-name",
nfd=integer,
nfl=integer
}, {...}},
dataSourceOptions={key-1=any-list-or-data-type-1 <, key-2=any-list-or-data-type-2, ...>},
importOptions={fileType="ANY" | "AUDIO" | "AUTO" | "BASESAS" | "CSV" | "DELIMITED" | "DOCUMENT" | "DTA" | "ESP" | "EXCEL" | "FMT" | "HDAT" | "IMAGE" | "JMP" | "LASR" | "PARQUET" | "SOUND" | "SPSS" | "VIDEO" | "XLS", fileType-specific-parameters},
required parameter name="table-name",
singlePass=true | false,
vars={{
format="string",
formattedLength=integer,
label="string",
required parameter name="variable-name",
nfd=integer,
nfl=integer
}, {...}},
where="where-expression",
whereTable={
casLib="string"
dataSourceOptions={adls_noreq-parameters | bigquery-parameters | cas_noreq-parameters | clouddex-parameters | db2-parameters | dnfs-parameters | esp-parameters | fedsvr-parameters | gcs_noreq-parameters | hadoop-parameters | hana-parameters | impala-parameters | informix-parameters | jdbc-parameters | mongodb-parameters | mysql-parameters | odbc-parameters | oracle-parameters | path-parameters | postgres-parameters | redshift-parameters | s3-parameters | sapiq-parameters | sforce-parameters | singlestore_standard-parameters | snowflake-parameters | spark-parameters | spde-parameters | sqlserver-parameters | ss_noreq-parameters | teradata-parameters | vertica-parameters | yellowbrick-parameters}
importOptions={fileType="ANY" | "AUDIO" | "AUTO" | "BASESAS" | "CSV" | "DELIMITED" | "DOCUMENT" | "DTA" | "ESP" | "EXCEL" | "FMT" | "HDAT" | "IMAGE" | "JMP" | "LASR" | "PARQUET" | "SOUND" | "SPSS" | "VIDEO" | "XLS", fileType-specific-parameters}
required parameter name="table-name"
vars={{
format="string",
formattedLength=integer,
label="string",
required parameter name="variable-name",
nfd=integer,
nfl=integer
}, {...}}
where="where-expression"
}
},
required parameter target="variable-name",
topKPipelines=integer,
required parameter transformationOut={
caslib="string",
indexVars={"variable-name-1" <, "variable-name-2", ...>},
lifetime=64-bit-integer,
name="table-name",
promote=true | false,
replace=true | false,
},
transformationPolicy={
cardinality=true | false,
entropy=true | false,
interaction=true | false,
iqv=true | false,
kurtosis=true | false,
missing=true | false,
outlier=true | false,
skewness=true | false
},
}
indicates a required parameter

Summary: Input and Output Tables

If a row includes a subparameter, you can specify the name, caslib, and so on in the subparameter. Otherwise, you can specify the name, caslib, and so on in the parameter.

Parameters for Reading Input Tables

Parameter

Subparameter

Description

required parametertable

specifies the table name, caslib, and other common parameters.

Parameters for Creating Output Tables

Parameter

Subparameter

Description

required parameterfeatureOut

specifies the CAS table to store the feature transformation and generation pipelines.

required parameterpipelineOut

specifies the CAS table to store the analysis results.

 saveState

specifies the CAS table to store the analysis results.

required parametertransformationOut

specifies the CAS table to store the feature transformation and generation pipelines.

Parameter Descriptions

distinctCountLimit=integer

specifies the distinct count limit. If the limit is exceeded, and the misraGries parameter is set to True, the Misra-Gries frequency sketch algorithm is used to estimate the frequency distribution. Otherwise, the distinct count operation is aborted.

Alias maxNVals
Default 10000
Minimum value 256

ecdfTolerance=double

specifies the tolerance value for the empirical cumulative distribution function. This value is used by the quantile sketch algorithm.

Default 0.001
Range 1E-06–0.1

event="string"

specifies the target variable level that you want to model. Multilevel classification problems are cast into a one-versus-all binary classification problem, where the value of the event parameter denotes the level that you are modeling.

explorationPolicy={avaptPolicy}

specifies the automatic variable analysis and grouping (AVAPT) policy.

Alias avaptPolicy

The avaptPolicy value can be one or more of the following:

cardinality={cardinalityAvaptPolicy}

specifies the automatic variable analysis and grouping (AVAPT) cardinality policy.

The cardinalityAvaptPolicy value can be one or more of the following:

lowMediumCutoff=double

specifies the cardinality threshold for the low-medium cutoff.

Default 32
Range 2–256
mediumHighCutoff=double

specifies the cardinality threshold for the medium-high cutoff.

Default 64
Range 2–1024
minNObsPerTargetLevel=double

specifies the minimum number of observations for each target level.

Default 10
Range 5–100
cv={cvAvaptPolicy}

specifies the automatic variable analysis and grouping (AVAPT) coefficient of variation policy.

Alias coefficientVariation

The cvAvaptPolicy value can be one or more of the following:

lowMoment=double

specifies the absolute value of the low-high percentage threshold for the moment coefficient of variation (CV).

Default 1
Minimum value 0
lowRobust=double

specifies the absolute value of the low-high percentage threshold for the robust coefficient of variation (CV).

Default 1
Minimum value 0
dateTimeVariables={"variable-name-1" <, "variable-name-2", ...>}

specifies the datetime variables.

Alias dateTime
dateVariables={"variable-name-1" <, "variable-name-2", ...>}

specifies the date variables.

Alias date
entropy={entropyAvaptPolicy}

specifies the automatic variable analysis and grouping (AVAPT) entropy policy.

The entropyAvaptPolicy value can be one or more of the following:

giniLowMediumCutoff=double

specifies the Gini entropy threshold for the low-medium cutoff.

Default 0.25
Range 0–1
giniMediumHighCutoff=double

specifies the Gini entropy threshold for the medium-high cutoff.

Default 0.75
Range 0–1
shannonLowMediumCutoff=double

specifies the Shannon entropy threshold for the low-medium cutoff.

Default 0.25
Range 0–1
shannonMediumHighCutoff=double

specifies the Shannon entropy threshold for the medium-high cutoff.

Default 0.75
Range 0–1
iqv={iqvAvaptPolicy}

specifies the automatic variable analysis and grouping (AVAPT) index of qualitative variation policy.

Alias qualitativeVariationIndex

The iqvAvaptPolicy value can be one or more of the following:

highTopBottom=double

specifies the low-high cutoff frequency ratio threshold between the most frequent and least frequent levels of a nominal variable.

Alias highTop1Bottom1
Default 100
Minimum value 1
highTopTwo=double

specifies the low-high cutoff frequency ratio threshold between the most frequent and second most frequent levels of a nominal variable.

Alias highTop1Top2
Default 10
Minimum value 1
highVariationRatio=double

specifies the variation ratio threshold for the low-high cutoff.

Alias highModVr
Default 0.5
Range (0–1]
kurtosis={kurtosisAvaptPolicy}

specifies the automatic variable analysis and grouping (AVAPT) kurtosis policy.

The kurtosisAvaptPolicy value can be one or more of the following:

momentLowMediumCutoff=double

specifies the absolute value of the moment kurtosis threshold for the low-medium cutoff.

Default 5
Minimum value 0
momentMediumHighCutoff=double

specifies the absolute value of the moment kurtosis threshold for the medium-high cutoff.

Default 10
Minimum value 0
robustLowMediumCutoff=double

specifies the absolute value of the robust kurtosis threshold for the low-medium cutoff.

Default 2
Minimum value 0
robustMediumHighCutoff=double

specifies the absolute value of the robust kurtosis threshold for the medium-high cutoff.

Default 3
Minimum value 0
missing={missingAvaptPolicy}

specifies the automatic variable analysis and grouping (AVAPT) missing grouping policy.

The missingAvaptPolicy value can be one or more of the following:

lowMediumCutoff=double

specifies the missing percentage threshold for the low-medium cutoff.

Default 5
Range 0–100
mediumHighCutoff=double

specifies the missing percentage threshold for the medium-high cutoff.

Default 25
Range 0–100
nominal={nominalAvaptPolicy}

specifies the automatic variable analysis and grouping (AVAPT) nominal policy.

The nominalAvaptPolicy value can be one or more of the following:

cardinalityRatio=double

specifies the AVAPT nominal policy cardinality ratio threshold.

Default 0.25
Range (0–1]
cardinalityThreshold=double

specifies the AVAPT nominal policy cardinality threshold.

Default 1024
Minimum value 32
includeNegative=true | false

when set to True, includes numeric variables with some negative values in the nominal analysis.

Default false
includeNonIntegral=true | false

when set to True, includes numeric variables with some nonintegral values in the nominal analysis.

Default false
intervals={"variable-name-1" <, "variable-name-2", ...>}

specifies variables to consider as intervals.

nominals={"variable-name-1" <, "variable-name-2", ...>}

specifies variables to consider as nominals.

outlier={outlierAvaptPolicy}

specifies the automatic variable analysis and grouping (AVAPT) outlier policy.

The outlierAvaptPolicy value can be one or more of the following:

momentLowMediumCutoff=double

specifies the z-score outlier percentage threshold for the low-medium cutoff.

Default 1
Range 0–100
momentMediumHighCutoff=double

specifies the z-score outlier percentage threshold for the medium-high cutoff.

Default 2.5
Range 0–100
robustLowMediumCutoff=double

specifies the modified interquartile range outlier percentage threshold for the low-medium cutoff.

Default 1
Range 0–100
robustMediumHighCutoff=double

specifies the modified interquartile range outlier percentage threshold for the medium-high cutoff.

Default 2.5
Range 0–100
skewness={skewnessAvaptPolicy}

specifies the automatic variable analysis and grouping (AVAPT) skewness policy.

The skewnessAvaptPolicy value can be one or more of the following:

momentLowMediumCutoff=double

specifies the moment skewness threshold for the low-medium cutoff.

Default 2
Range 0–100
momentMediumHighCutoff=double

specifies the moment skewness threshold for the medium-high cutoff.

Default 10
Range 0–100
robustLowMediumCutoff=double

specifies the robust skewness threshold for the low-medium cutoff.

Default 0.75
Range 0–3
robustMediumHighCutoff=double

specifies the robust skewness threshold for the medium-high cutoff.

Default 2
Range 0–3
timeVariables={"variable-name-1" <, "variable-name-2", ...>}

specifies the time variables.

Alias time

* featureOut={casouttable}

specifies the CAS table to store the feature transformation and generation pipelines.

Alias featuresOut
Long form featureOut={name="table-name"}
Shortcut form featureOut="table-name"

The casouttable value can be one or more of the following:

caslib="string"

specifies the name of the caslib for the output table.

indexVars={"variable-name-1" <, "variable-name-2", ...>}

specifies the list of variables to create indexes for in the output data.

lifetime=64-bit-integer

specifies the number of seconds to keep the table in memory after it is last accessed. The table is dropped if it is not accessed for the specified number of seconds.

Default 0
Minimum value 0
memoryFormat="DVR" | "INHERIT" | "STANDARD"

specifies the memory format for the output table.

Default INHERIT
DVR

use the duplicate value reduction memory format. This memory format can reduce the memory consumption and file size when the input data contains duplicate values.

INHERIT

use the default memory format that is set for the server. By default, the server uses the standard memory format. If an administrator sets the CAS_DEFAULT_MEMORY_FORMAT environment variable to DVR, then the DVR memory format is set as the default for the server.

STANDARD

use the standard memory format.

name="table-name"

specifies the name for the output table.

promote=true | false

when set to True, adds the output table with a global scope. This enables other sessions to access the table, subject to access controls. The target caslib must also have a global scope.

Default false
replace=true | false

when set to True, overwrites an existing table that has the same name.

Default false
tableRedistUpPolicy="DEFER" | "NOREDIST" | "REBALANCE"

Specifies the Table Redistribution Policy when the number of worker pods increases on a running CAS server.

DEFER

Defer redistribution policy selection to higher-level entity.

NOREDIST

Do not redistribute table data when the number of worker pods changes on a running CAS server.

REBALANCE

Rebalance table data when the number of worker pods changes on a running CAS server.

hyperParameterOptimizer="MODELCOMPOSER" | "TUNEALL"

specifies the method to use for hyperparameter optimization.

Alias hpOptimizer
Default TUNEALL

inputs={{casinvardesc-1} <, {casinvardesc-2}, ...>}

specifies the variables to use for the analysis. You can specify a subset of the variables from the input table.

For more information about specifying the inputs parameter, see the common casinvardesc parameter (Appendix A: Common Parameters).

Alias vars

kFolds=integer

specifies the number of folds for cross validation.

Default 5
Range 2–10

logLevel=integer

specifies the logging level.

Default 0
Range 0–3

misraGries=true | false

when set to True, uses the Misra-Gries algorithm for the frequency distribution estimation, if the distinct count limit is exceeded.

Default true

modelTypes={"DECISIONTREE", "FOREST", "GLM", "GRADBOOST", "LOGISTIC", "NEURALNET"}

specifies the values to control the types and classes of machine learning algorithms to include in the pipeline exploration.

DECISIONTREE

specifies the decision tree model.

FOREST

specifies the random forest model.

GLM

specifies the generalized linear model.

GRADBOOST

specifies the gradient boosting model.

LOGISTIC

specifies the logistic regression model.

NEURALNET

specifies the neural network model.

objective="ASE" | "AUC" | "F1" | "MAE" | "MCE" | "MSLE" | "RASE" | "RMAE" | "RMSLE"

specifies the model performance metric to use.

ASE

uses the average square error.

AUC

uses the area under the receiver operating characteristic curve.

F1

uses the F1 coefficient.

MAE

uses the mean absolute error.

MCE

uses the misclassification error.

Alias MCR
MSLE

uses the mean square logarithmic error.

RASE

uses the root average square error.

RMAE

uses the root mean absolute error.

RMSLE

uses the root mean square logarithmic error.

* pipelineOut={casouttable}

specifies the CAS table to store the analysis results.

Alias pipelinesOut
Long form pipelineOut={name="table-name"}
Shortcut form pipelineOut="table-name"

The casouttable value can be one or more of the following:

caslib="string"

specifies the name of the caslib for the output table.

indexVars={"variable-name-1" <, "variable-name-2", ...>}

specifies the list of variables to create indexes for in the output data.

lifetime=64-bit-integer

specifies the number of seconds to keep the table in memory after it is last accessed. The table is dropped if it is not accessed for the specified number of seconds.

Default 0
Minimum value 0
memoryFormat="DVR" | "INHERIT" | "STANDARD"

specifies the memory format for the output table.

Default INHERIT
DVR

use the duplicate value reduction memory format. This memory format can reduce the memory consumption and file size when the input data contains duplicate values.

INHERIT

use the default memory format that is set for the server. By default, the server uses the standard memory format. If an administrator sets the CAS_DEFAULT_MEMORY_FORMAT environment variable to DVR, then the DVR memory format is set as the default for the server.

STANDARD

use the standard memory format.

name="table-name"

specifies the name for the output table.

promote=true | false

when set to True, adds the output table with a global scope. This enables other sessions to access the table, subject to access controls. The target caslib must also have a global scope.

Default false
replace=true | false

when set to True, overwrites an existing table that has the same name.

Default false
tableRedistUpPolicy="DEFER" | "NOREDIST" | "REBALANCE"

Specifies the Table Redistribution Policy when the number of worker pods increases on a running CAS server.

DEFER

Defer redistribution policy selection to higher-level entity.

NOREDIST

Do not redistribute table data when the number of worker pods changes on a running CAS server.

REBALANCE

Rebalance table data when the number of worker pods changes on a running CAS server.

sampleSize=integer

specifies the maximum number of pipelines to sample.

Default 10
Minimum value 1

saveState={casouttable} | {savePipelinesOptions}

specifies the CAS table to store the analysis results.

Alias saveModel

The casouttable value can be one or more of the following:

caslib="string"

specifies the name of the caslib for the output table.

compress=true | false

when set to True, applies data compression to the table.

Default false
indexVars={"variable-name-1" <, "variable-name-2", ...>}

specifies the list of variables to create indexes for in the output data.

label="string"

specifies the descriptive label to associate with the table.

lifetime=64-bit-integer

specifies the number of seconds to keep the table in memory after it is last accessed. The table is dropped if it is not accessed for the specified number of seconds.

Default 0
Minimum value 0
maxMemSize=64-bit-integer

specifies the maximum amount of memory, in bytes, that each thread should allocate for in-memory blocks before converting to a memory-mapped file. Files are written in the directories that are specified in the CAS_DISK_CACHE environment variable.

TIP You can enclose the value in quotation marks and specify B, K, M, G, or T as a suffix to indicate the units. For example, "8M" specifies eight megabytes.
memoryFormat="DVR" | "INHERIT" | "STANDARD"

specifies the memory format for the output table.

Default INHERIT
DVR

use the duplicate value reduction memory format. This memory format can reduce the memory consumption and file size when the input data contains duplicate values.

INHERIT

use the default memory format that is set for the server. By default, the server uses the standard memory format. If an administrator sets the CAS_DEFAULT_MEMORY_FORMAT environment variable to DVR, then the DVR memory format is set as the default for the server.

STANDARD

use the standard memory format.

name="table-name"

specifies the name for the output table.

promote=true | false

when set to True, adds the output table with a global scope. This enables other sessions to access the table, subject to access controls. The target caslib must also have a global scope.

Default false
replace=true | false

when set to True, overwrites an existing table that has the same name.

Default false
replication=integer

specifies the number of copies of the table to make for fault tolerance. Larger values result in slower performance and use more memory, but provide high availability for data in the event of a node failure. Data redundancy applies to distributed servers only.

Default 1
Minimum value 0
tableRedistUpPolicy="DEFER" | "NOREDIST" | "REBALANCE"

Specifies the Table Redistribution Policy when the number of worker pods increases on a running CAS server.

DEFER

Defer redistribution policy selection to higher-level entity.

NOREDIST

Do not redistribute table data when the number of worker pods changes on a running CAS server.

REBALANCE

Rebalance table data when the number of worker pods changes on a running CAS server.

threadBlockSize=64-bit-integer

specifies the number of bytes to use for blocks in the output table. The blocks are read by threads. Gradually increase this value when you have a large table with millions or billions of rows and you are tuning for performance. Larger values can increase performance with indexed tables. However, if the value is too large, then you can cause thread starvation due to too few blocks for threads to work on.

Alias blockSize
Default 1048576
Minimum value 0
TIP You can enclose the value in quotation marks and specify B, K, M, G, or T as a suffix to indicate the units. For example, "8M" specifies eight megabytes.
timeStamp="string"

specifies to add a timestamp column to the table. Support for timeStamp is action-specific. Specify the value in the form that is appropriate for your session locale.

where={"string-1" <, "string-2", ...>}

specifies one or more expressions for subsetting the output data. When multiple expressions are specified, the expressions are effectively combined using AND to form the final output filter. If an expression contains quoted values, use nested quotation marks.

The savePipelinesOptions value can be one or more of the following:

modelNamePrefix="string"

specifies the prefix to use for the names of the saved models.

replace=true | false

when set to True, overwrites already existing models that have the same name.

Default true
topK=integer

specifies the number of best-performing models to save.

Default 5
Minimum value 1

screenPolicy={sweeperPolicy}

specifies the variable screening policy to use for recommending that variables be screened out, transformed, or copied.

Alias sweeperPolicy

The sweeperPolicy value can be one or more of the following:

constant=true | false

when set to True, uses the variable screening policy to identify variables that have constant values.

Alias unique
Default true
groupRareLevels=true | false

when set to True, uses the variable screening policy to identify nominal variables that have rare levels.

Alias groupRare
Default true
leakagePercentThreshold=double

specifies the variable screening policy for variables that have a very high level of information about the target. Variables that have a greater target entropy percentage reduction than the specified threshold are flagged as leakage variables.

Alias leakagePercentageThreshold
Default 90
Range (0–100]
lowCv=true | false

when set to True, uses the variable screening policy to identify variables that have a low coefficient of variation (CV).

Alias lowCoefficientVariation
Default true
lowMutualInformation=double

specifies the variable screening policy for variables that have a low level of information about the target.

Alias lowInformation
Default 0.05
Minimum value 0
missingIndicatorPercent=double

specifies the variable screening policy for generating missing indicator variables.

Alias missingIndicatorPercentage
Default 75
Range [10–100)
missingPercentThreshold=double

specifies the variable screening policy for identifying variables that have a very high missing rate.

Alias missingPercentageThreshold
Default 90
Range [10–100)
redundant=double

specifies the symmetric uncertainty (SU) threshold for identifying redundant variables. If the SU for two variables exceeds the threshold, the variable that has less information about the target is flagged as redundant.

Default 1
Range (0–1]

seed=integer

specifies a seed value for random number generation. This value is used for repeatable random number generation in some scenarios.

Default 0

selectionPolicy={featureSelectOptions}

specifies the feature selection policy.

Long form selectionPolicy={criterion="CHISQ" | "CRAMERSV" | "ENTROPY" | "FTEST" | "G2" | "IV" | "MI" | "NORMMI" | "PEARSON" | "SU"}
Shortcut form selectionPolicy="CHISQ" | "CRAMERSV" | "ENTROPY" | "FTEST" | "G2" | "IV" | "MI" | "NORMMI" | "PEARSON" | "SU"

The featureSelectOptions value can be one or more of the following:

criterion="CHISQ" | "CRAMERSV" | "ENTROPY" | "FTEST" | "G2" | "IV" | "MI" | "NORMMI" | "PEARSON" | "SU"

specifies the filter feature selection criterion to use.

Alias stat
Default MI
CHISQ

uses the chi-square statistic.

CRAMERSV

uses Cramer's V.

ENTROPY

uses the entropy percentage decrease.

FTEST

uses the F test.

G2

uses the G2 statistic.

IV

uses the information value statistic.

MI

uses the mutual information statistic.

NORMMI

uses the normalized mutual information statistic.

PEARSON

uses the Pearson correlation.

SU

uses the symmetric uncertainty statistic.

topK=integer

specifies that the number of features that have the highest filter selection criterion value be selected. If both topK and topKPercent are specified, then topKPercent is used.

Default 50
Minimum value 1
topKPercent=double

specifies that the percentage of features that have the highest filter selection criterion value be selected. If both topK and topKPercent are specified, then topKPercent is used.

Alias topKPercentage
Range (0–100]

* table={castable}

specifies the table name, caslib, and other common parameters.

Long form table={name="table-name"}
Shortcut form table="table-name"

The castable value can be one or more of the following:

caslib="string"

specifies the caslib for the input table that you want to use with the action. By default, the active caslib is used. Specify a value only if you need to access a table from a different caslib.

computedOnDemand=true | false

when set to True, creates the computed variables when the table is loaded instead of when the action begins.

Alias compOnDemand
Default false
computedVars={{casinvardesc-1} <, {casinvardesc-2}, ...>}

specifies the names of the computed variables to create. Specify an expression for each variable in the computedVarsProgram parameter. If you do not specify this parameter, then all variables from computedVarsProgram are automatically included.

Alias compVars

The casinvardesc value can be one or more of the following:

format="string"

specifies the format to apply to the variable.

formattedLength=integer

specifies the length of the format field plus the length of the format precision.

label="string"

specifies the descriptive label for the variable.

* name="variable-name"

specifies the name for the variable.

nfd=integer

specifies the length of the format precision.

nfl=integer

specifies the length of the format field.

computedVarsProgram="string"

specifies an expression for each computed variable that you include in the computedVars parameter.

Alias compPgm
dataSourceOptions={key-1=any-list-or-data-type-1 <, key-2=any-list-or-data-type-2, ...>}

specifies data source options.

Aliases options
dataSource
importOptions={fileType="ANY" | "AUDIO" | "AUTO" | "BASESAS" | "CSV" | "DOCUMENT" | "DTA" | "ESP" | "EXCEL" | "FMT" | "HDAT" | "IMAGE" | "JMP" | "LASR" | "PARQUET" | "SPSS" | "VIDEO" | "XLS", fileType-specific-parameters}

specifies the settings for reading a table from a data source.

Alias import

For more information about specifying the importOptions parameter, see the common importOptions parameter (Appendix A: Common Parameters).

* name="table-name"

specifies the name of the input table.

singlePass=true | false

when set to True, does not create a transient table on the server. Setting this parameter to True can be efficient, but the data might not have stable ordering upon repeated runs.

Default false
vars={{casinvardesc-1} <, {casinvardesc-2}, ...>}

specifies the variables to use in the action.

The casinvardesc value can be one or more of the following:

format="string"

specifies the format to apply to the variable.

formattedLength=integer

specifies the length of the format field plus the length of the format precision.

label="string"

specifies the descriptive label for the variable.

* name="variable-name"

specifies the name for the variable.

nfd=integer

specifies the length of the format precision.

nfl=integer

specifies the length of the format field.

where="where-expression"

specifies an expression for subsetting the input data.

whereTable={groupbytable}

specifies an input table that contains rows to use as a WHERE filter. If the vars parameter is not specified, then all the variable names that are common to the input table and the filtering table are used to find matching rows. If the where parameter for the input table and this parameter are specified, then this filtering table is applied first.

The groupbytable value can be one or more of the following:

casLib="string"

specifies the caslib for the filter table. By default, the active caslib is used.

dataSourceOptions={adls_noreq-parameters | bigquery-parameters | cas_noreq-parameters | clouddex-parameters | db2-parameters | dnfs-parameters | esp-parameters | fedsvr-parameters | gcs_noreq-parameters | hadoop-parameters | hana-parameters | impala-parameters | informix-parameters | jdbc-parameters | mongodb-parameters | mysql-parameters | odbc-parameters | oracle-parameters | path-parameters | postgres-parameters | redshift-parameters | s3-parameters | sapiq-parameters | sforce-parameters | singlestore_standard-parameters | snowflake-parameters | spark-parameters | spde-parameters | sqlserver-parameters | ss_noreq-parameters | teradata-parameters | vertica-parameters | yellowbrick-parameters}

specifies data source options.

Aliases options
dataSource

For more information about specifying the dataSourceOptions parameter, see the common dataSourceOptions parameter (Appendix A: Common Parameters).

importOptions={fileType="ANY" | "AUDIO" | "AUTO" | "BASESAS" | "CSV" | "DOCUMENT" | "DTA" | "ESP" | "EXCEL" | "FMT" | "HDAT" | "IMAGE" | "JMP" | "LASR" | "PARQUET" | "SPSS" | "VIDEO" | "XLS", fileType-specific-parameters}

specifies the settings for reading a table from a data source.

Alias import

For more information about specifying the importOptions parameter, see the common importOptions parameter (Appendix A: Common Parameters).

* name="table-name"

specifies the name of the filter table.

vars={{casinvardesc-1} <, {casinvardesc-2}, ...>}

specifies the variable names to use from the filter table.

The casinvardesc value can be one or more of the following:

format="string"

specifies the format to apply to the variable.

formattedLength=integer

specifies the length of the format field plus the length of the format precision.

label="string"

specifies the descriptive label for the variable.

* name="variable-name"

specifies the name for the variable.

nfd=integer

specifies the length of the format precision.

nfl=integer

specifies the length of the format field.

where="where-expression"

specifies an expression for subsetting the data from the filter table.

* target="variable-name"

specifies the target variable.

Alias evalVar

topKPipelines=integer

specifies the number of best-performing pipelines to save.

Default 10
Minimum value 1

* transformationOut={casouttable}

specifies the CAS table to store the feature transformation and generation pipelines.

Alias transformationsOut
Long form transformationOut={name="table-name"}
Shortcut form transformationOut="table-name"

The casouttable value can be one or more of the following:

caslib="string"

specifies the name of the caslib for the output table.

indexVars={"variable-name-1" <, "variable-name-2", ...>}

specifies the list of variables to create indexes for in the output data.

lifetime=64-bit-integer

specifies the number of seconds to keep the table in memory after it is last accessed. The table is dropped if it is not accessed for the specified number of seconds.

Default 0
Minimum value 0
memoryFormat="DVR" | "INHERIT" | "STANDARD"

specifies the memory format for the output table.

Default INHERIT
DVR

use the duplicate value reduction memory format. This memory format can reduce the memory consumption and file size when the input data contains duplicate values.

INHERIT

use the default memory format that is set for the server. By default, the server uses the standard memory format. If an administrator sets the CAS_DEFAULT_MEMORY_FORMAT environment variable to DVR, then the DVR memory format is set as the default for the server.

STANDARD

use the standard memory format.

name="table-name"

specifies the name for the output table.

promote=true | false

when set to True, adds the output table with a global scope. This enables other sessions to access the table, subject to access controls. The target caslib must also have a global scope.

Default false
replace=true | false

when set to True, overwrites an existing table that has the same name.

Default false
tableRedistUpPolicy="DEFER" | "NOREDIST" | "REBALANCE"

Specifies the Table Redistribution Policy when the number of worker pods increases on a running CAS server.

DEFER

Defer redistribution policy selection to higher-level entity.

NOREDIST

Do not redistribute table data when the number of worker pods changes on a running CAS server.

REBALANCE

Rebalance table data when the number of worker pods changes on a running CAS server.

transformationPolicy={transformationSpace}

specifies the feature transformation and generation space in which the feature machine operates.

Alias transformationSpace

The transformationSpace value can be one or more of the following:

cardinality=true | false

when set to True, includes cardinality-reducing transformations.

Default true
entropy=true | false

when set to True, includes transformations for the treatment of low entropy.

Default false
interaction=true | false

when set to True, detects and generates interaction features.

Default false
iqv=true | false

when set to True, includes transformations for the treatment of low indices of qualitative variation (IQV).

Default false
kurtosis=true | false

when set to True, includes transformations for the treatment of high kurtosis.

Default false
missing=true | false

when set to True, includes transformations for the treatment of missing values.

Default true
outlier=true | false

when set to True, includes transformations for the treatment of outliers.

Default false
skewness=true | false

when set to True, includes transformations for the treatment of high skewness.

Default true

validationPartitionFraction=double

specifies the percentage of the input data to use for validation.

Default 0.3
Range 0.01–0.99

dsAutoMl Action

Automated machine learning pipeline exploration, execution and ranking..

Python Syntax

results=s.dataSciencePilot.dsAutoMl(
ecdfTolerance=double,
event="string",
explorationPolicy={
"cv":{
"lowMoment":double
"lowRobust":double
},
"dateTimeVariables":["variable-name-1" <, "variable-name-2", ...>],
"dateVariables":["variable-name-1" <, "variable-name-2", ...>],
"iqv":{
"highTopBottom":double
"highTopTwo":double
},
"missing":{
"lowMediumCutoff":double
},
"nominal":{
"includeNegative":True | False
"includeNonIntegral":True | False
"intervals":["variable-name-1" <, "variable-name-2", ...>]
"nominals":["variable-name-1" <, "variable-name-2", ...>]
},
"timeVariables":["variable-name-1" <, "variable-name-2", ...>]
},
required parameter featureOut={
"caslib":"string",
"indexVars":["variable-name-1" <, "variable-name-2", ...>],
"lifetime":64-bit-integer,
"name":"table-name",
"promote":True | False,
"replace":True | False,
},
hyperParameterOptimizer="MODELCOMPOSER" | "TUNEALL",
inputs=[{
"format":"string",
"formattedLength":integer,
"label":"string",
required parameter "name":"variable-name",
"nfd":integer,
"nfl":integer
}<, {...}>],
kFolds=integer,
logLevel=integer,
misraGries=True | False,
modelTypes=["DECISIONTREE", "FOREST", "GLM", "GRADBOOST", "LOGISTIC", "NEURALNET"],
required parameter pipelineOut={
"caslib":"string",
"indexVars":["variable-name-1" <, "variable-name-2", ...>],
"lifetime":64-bit-integer,
"name":"table-name",
"promote":True | False,
"replace":True | False,
},
sampleSize=integer,
saveState={casouttable} | {savePipelinesOptions},
screenPolicy={
"constant":True | False,
"groupRareLevels":True | False,
"lowCv":True | False,
"redundant":double
},
seed=integer,
required parameter table={
"caslib":"string",
"computedOnDemand":True | False,
"computedVars":[{
"format":"string",
"formattedLength":integer,
"label":"string",
required parameter "name":"variable-name",
"nfd":integer,
"nfl":integer
}<, {...}>],
"computedVarsProgram":"string",
"dataSourceOptions":{"key-1":{any-list-or-data-type-1} <, "key-2":{any-list-or-data-type-2}, ...>},
"importOptions":{"fileType":"ANY" | "AUDIO" | "AUTO" | "BASESAS" | "CSV" | "DELIMITED" | "DOCUMENT" | "DTA" | "ESP" | "EXCEL" | "FMT" | "HDAT" | "IMAGE" | "JMP" | "LASR" | "PARQUET" | "SOUND" | "SPSS" | "VIDEO" | "XLS", fileType-specific-parameters},
required parameter "name":"table-name",
"singlePass":True | False,
"vars":[{
"format":"string",
"formattedLength":integer,
"label":"string",
required parameter "name":"variable-name",
"nfd":integer,
"nfl":integer
}<, {...}>],
"where":"where-expression",
"whereTable":{
"casLib":"string"
"dataSourceOptions":{adls_noreq-parameters | bigquery-parameters | cas_noreq-parameters | clouddex-parameters | db2-parameters | dnfs-parameters | esp-parameters | fedsvr-parameters | gcs_noreq-parameters | hadoop-parameters | hana-parameters | impala-parameters | informix-parameters | jdbc-parameters | mongodb-parameters | mysql-parameters | odbc-parameters | oracle-parameters | path-parameters | postgres-parameters | redshift-parameters | s3-parameters | sapiq-parameters | sforce-parameters | singlestore_standard-parameters | snowflake-parameters | spark-parameters | spde-parameters | sqlserver-parameters | ss_noreq-parameters | teradata-parameters | vertica-parameters | yellowbrick-parameters}
"importOptions":{"fileType":"ANY" | "AUDIO" | "AUTO" | "BASESAS" | "CSV" | "DELIMITED" | "DOCUMENT" | "DTA" | "ESP" | "EXCEL" | "FMT" | "HDAT" | "IMAGE" | "JMP" | "LASR" | "PARQUET" | "SOUND" | "SPSS" | "VIDEO" | "XLS", fileType-specific-parameters}
required parameter "name":"table-name"
"vars":[{
"format":"string",
"formattedLength":integer,
"label":"string",
required parameter "name":"variable-name",
"nfd":integer,
"nfl":integer
}<, {...}>]
"where":"where-expression"
}
},
required parameter target="variable-name",
topKPipelines=integer,
required parameter transformationOut={
"caslib":"string",
"indexVars":["variable-name-1" <, "variable-name-2", ...>],
"lifetime":64-bit-integer,
"name":"table-name",
"promote":True | False,
"replace":True | False,
},
transformationPolicy={
"cardinality":True | False,
"entropy":True | False,
"interaction":True | False,
"iqv":True | False,
"kurtosis":True | False,
"missing":True | False,
"outlier":True | False,
"skewness":True | False
},
)
indicates a required parameter

Summary: Input and Output Tables

If a row includes a subparameter, you can specify the name, caslib, and so on in the subparameter. Otherwise, you can specify the name, caslib, and so on in the parameter.

Parameters for Reading Input Tables

Parameter

Subparameter

Description

required parametertable

specifies the table name, caslib, and other common parameters.

Parameters for Creating Output Tables

Parameter

Subparameter

Description

required parameterfeatureOut

specifies the CAS table to store the feature transformation and generation pipelines.

required parameterpipelineOut

specifies the CAS table to store the analysis results.

 saveState

specifies the CAS table to store the analysis results.

required parametertransformationOut

specifies the CAS table to store the feature transformation and generation pipelines.

Parameter Descriptions

distinctCountLimit=integer

specifies the distinct count limit. If the limit is exceeded, and the misraGries parameter is set to True, the Misra-Gries frequency sketch algorithm is used to estimate the frequency distribution. Otherwise, the distinct count operation is aborted.

Alias maxNVals
Default 10000
Minimum value 256

ecdfTolerance=double

specifies the tolerance value for the empirical cumulative distribution function. This value is used by the quantile sketch algorithm.

Default 0.001
Range 1E-06–0.1

event="string"

specifies the target variable level that you want to model. Multilevel classification problems are cast into a one-versus-all binary classification problem, where the value of the event parameter denotes the level that you are modeling.

explorationPolicy={avaptPolicy}

specifies the automatic variable analysis and grouping (AVAPT) policy.

Alias avaptPolicy

The avaptPolicy value can be one or more of the following:

"cardinality":{cardinalityAvaptPolicy}

specifies the automatic variable analysis and grouping (AVAPT) cardinality policy.

The cardinalityAvaptPolicy value can be one or more of the following:

"lowMediumCutoff":double

specifies the cardinality threshold for the low-medium cutoff.

Default 32
Range 2–256
"mediumHighCutoff":double

specifies the cardinality threshold for the medium-high cutoff.

Default 64
Range 2–1024
"minNObsPerTargetLevel":double

specifies the minimum number of observations for each target level.

Default 10
Range 5–100
"cv":{cvAvaptPolicy}

specifies the automatic variable analysis and grouping (AVAPT) coefficient of variation policy.

Alias coefficientVariation

The cvAvaptPolicy value can be one or more of the following:

"lowMoment":double

specifies the absolute value of the low-high percentage threshold for the moment coefficient of variation (CV).

Default 1
Minimum value 0
"lowRobust":double

specifies the absolute value of the low-high percentage threshold for the robust coefficient of variation (CV).

Default 1
Minimum value 0
"dateTimeVariables":["variable-name-1" <, "variable-name-2", ...>]

specifies the datetime variables.

Alias dateTime
"dateVariables":["variable-name-1" <, "variable-name-2", ...>]

specifies the date variables.

Alias date
"entropy":{entropyAvaptPolicy}

specifies the automatic variable analysis and grouping (AVAPT) entropy policy.

The entropyAvaptPolicy value can be one or more of the following:

"giniLowMediumCutoff":double

specifies the Gini entropy threshold for the low-medium cutoff.

Default 0.25
Range 0–1
"giniMediumHighCutoff":double

specifies the Gini entropy threshold for the medium-high cutoff.

Default 0.75
Range 0–1
"shannonLowMediumCutoff":double

specifies the Shannon entropy threshold for the low-medium cutoff.

Default 0.25
Range 0–1
"shannonMediumHighCutoff":double

specifies the Shannon entropy threshold for the medium-high cutoff.

Default 0.75
Range 0–1
"iqv":{iqvAvaptPolicy}

specifies the automatic variable analysis and grouping (AVAPT) index of qualitative variation policy.

Alias qualitativeVariationIndex

The iqvAvaptPolicy value can be one or more of the following:

"highTopBottom":double

specifies the low-high cutoff frequency ratio threshold between the most frequent and least frequent levels of a nominal variable.

Alias highTop1Bottom1
Default 100
Minimum value 1
"highTopTwo":double

specifies the low-high cutoff frequency ratio threshold between the most frequent and second most frequent levels of a nominal variable.

Alias highTop1Top2
Default 10
Minimum value 1
"highVariationRatio":double

specifies the variation ratio threshold for the low-high cutoff.

Alias highModVr
Default 0.5
Range (0–1]
"kurtosis":{kurtosisAvaptPolicy}

specifies the automatic variable analysis and grouping (AVAPT) kurtosis policy.

The kurtosisAvaptPolicy value can be one or more of the following:

"momentLowMediumCutoff":double

specifies the absolute value of the moment kurtosis threshold for the low-medium cutoff.

Default 5
Minimum value 0
"momentMediumHighCutoff":double

specifies the absolute value of the moment kurtosis threshold for the medium-high cutoff.

Default 10
Minimum value 0
"robustLowMediumCutoff":double

specifies the absolute value of the robust kurtosis threshold for the low-medium cutoff.

Default 2
Minimum value 0
"robustMediumHighCutoff":double

specifies the absolute value of the robust kurtosis threshold for the medium-high cutoff.

Default 3
Minimum value 0
"missing":{missingAvaptPolicy}

specifies the automatic variable analysis and grouping (AVAPT) missing grouping policy.

The missingAvaptPolicy value can be one or more of the following:

"lowMediumCutoff":double

specifies the missing percentage threshold for the low-medium cutoff.

Default 5
Range 0–100
"mediumHighCutoff":double

specifies the missing percentage threshold for the medium-high cutoff.

Default 25
Range 0–100
"nominal":{nominalAvaptPolicy}

specifies the automatic variable analysis and grouping (AVAPT) nominal policy.

The nominalAvaptPolicy value can be one or more of the following:

"cardinalityRatio":double

specifies the AVAPT nominal policy cardinality ratio threshold.

Default 0.25
Range (0–1]
"cardinalityThreshold":double

specifies the AVAPT nominal policy cardinality threshold.

Default 1024
Minimum value 32
"includeNegative":True | False

when set to True, includes numeric variables with some negative values in the nominal analysis.

Default False
"includeNonIntegral":True | False

when set to True, includes numeric variables with some nonintegral values in the nominal analysis.

Default False
"intervals":["variable-name-1" <, "variable-name-2", ...>]

specifies variables to consider as intervals.

"nominals":["variable-name-1" <, "variable-name-2", ...>]

specifies variables to consider as nominals.

"outlier":{outlierAvaptPolicy}

specifies the automatic variable analysis and grouping (AVAPT) outlier policy.

The outlierAvaptPolicy value can be one or more of the following:

"momentLowMediumCutoff":double

specifies the z-score outlier percentage threshold for the low-medium cutoff.

Default 1
Range 0–100
"momentMediumHighCutoff":double

specifies the z-score outlier percentage threshold for the medium-high cutoff.

Default 2.5
Range 0–100
"robustLowMediumCutoff":double

specifies the modified interquartile range outlier percentage threshold for the low-medium cutoff.

Default 1
Range 0–100
"robustMediumHighCutoff":double

specifies the modified interquartile range outlier percentage threshold for the medium-high cutoff.

Default 2.5
Range 0–100
"skewness":{skewnessAvaptPolicy}

specifies the automatic variable analysis and grouping (AVAPT) skewness policy.

The skewnessAvaptPolicy value can be one or more of the following:

"momentLowMediumCutoff":double

specifies the moment skewness threshold for the low-medium cutoff.

Default 2
Range 0–100
"momentMediumHighCutoff":double

specifies the moment skewness threshold for the medium-high cutoff.

Default 10
Range 0–100
"robustLowMediumCutoff":double

specifies the robust skewness threshold for the low-medium cutoff.

Default 0.75
Range 0–3
"robustMediumHighCutoff":double

specifies the robust skewness threshold for the medium-high cutoff.

Default 2
Range 0–3
"timeVariables":["variable-name-1" <, "variable-name-2", ...>]

specifies the time variables.

Alias time

* featureOut={casouttable}

specifies the CAS table to store the feature transformation and generation pipelines.

Alias featuresOut
Long form featureOut={"name":"table-name"}
Shortcut form featureOut="table-name"

The casouttable value can be one or more of the following:

"caslib":"string"

specifies the name of the caslib for the output table.

"indexVars":["variable-name-1" <, "variable-name-2", ...>]

specifies the list of variables to create indexes for in the output data.

"lifetime":64-bit-integer

specifies the number of seconds to keep the table in memory after it is last accessed. The table is dropped if it is not accessed for the specified number of seconds.

Default 0
Minimum value 0
"memoryFormat":"DVR" | "INHERIT" | "STANDARD"

specifies the memory format for the output table.

Default INHERIT
DVR

use the duplicate value reduction memory format. This memory format can reduce the memory consumption and file size when the input data contains duplicate values.

INHERIT

use the default memory format that is set for the server. By default, the server uses the standard memory format. If an administrator sets the CAS_DEFAULT_MEMORY_FORMAT environment variable to DVR, then the DVR memory format is set as the default for the server.

STANDARD

use the standard memory format.

"name":"table-name"

specifies the name for the output table.

"promote":True | False

when set to True, adds the output table with a global scope. This enables other sessions to access the table, subject to access controls. The target caslib must also have a global scope.

Default False
"replace":True | False

when set to True, overwrites an existing table that has the same name.

Default False
"tableRedistUpPolicy":"DEFER" | "NOREDIST" | "REBALANCE"

Specifies the Table Redistribution Policy when the number of worker pods increases on a running CAS server.

DEFER

Defer redistribution policy selection to higher-level entity.

NOREDIST

Do not redistribute table data when the number of worker pods changes on a running CAS server.

REBALANCE

Rebalance table data when the number of worker pods changes on a running CAS server.

hyperParameterOptimizer="MODELCOMPOSER" | "TUNEALL"

specifies the method to use for hyperparameter optimization.

Alias hpOptimizer
Default TUNEALL

inputs=[{casinvardesc-1} <, {casinvardesc-2}, ...>]

specifies the variables to use for the analysis. You can specify a subset of the variables from the input table.

For more information about specifying the inputs parameter, see the common casinvardesc parameter (Appendix A: Common Parameters).

Alias vars

kFolds=integer

specifies the number of folds for cross validation.

Default 5
Range 2–10

logLevel=integer

specifies the logging level.

Default 0
Range 0–3

misraGries=True | False

when set to True, uses the Misra-Gries algorithm for the frequency distribution estimation, if the distinct count limit is exceeded.

Default True

modelTypes=["DECISIONTREE", "FOREST", "GLM", "GRADBOOST", "LOGISTIC", "NEURALNET"]

specifies the values to control the types and classes of machine learning algorithms to include in the pipeline exploration.

DECISIONTREE

specifies the decision tree model.

FOREST

specifies the random forest model.

GLM

specifies the generalized linear model.

GRADBOOST

specifies the gradient boosting model.

LOGISTIC

specifies the logistic regression model.

NEURALNET

specifies the neural network model.

objective="ASE" | "AUC" | "F1" | "MAE" | "MCE" | "MSLE" | "RASE" | "RMAE" | "RMSLE"

specifies the model performance metric to use.

ASE

uses the average square error.

AUC

uses the area under the receiver operating characteristic curve.

F1

uses the F1 coefficient.

MAE

uses the mean absolute error.

MCE

uses the misclassification error.

Alias MCR
MSLE

uses the mean square logarithmic error.

RASE

uses the root average square error.

RMAE

uses the root mean absolute error.

RMSLE

uses the root mean square logarithmic error.

* pipelineOut={casouttable}

specifies the CAS table to store the analysis results.

Alias pipelinesOut
Long form pipelineOut={"name":"table-name"}
Shortcut form pipelineOut="table-name"

The casouttable value can be one or more of the following:

"caslib":"string"

specifies the name of the caslib for the output table.

"indexVars":["variable-name-1" <, "variable-name-2", ...>]

specifies the list of variables to create indexes for in the output data.

"lifetime":64-bit-integer

specifies the number of seconds to keep the table in memory after it is last accessed. The table is dropped if it is not accessed for the specified number of seconds.

Default 0
Minimum value 0
"memoryFormat":"DVR" | "INHERIT" | "STANDARD"

specifies the memory format for the output table.

Default INHERIT
DVR

use the duplicate value reduction memory format. This memory format can reduce the memory consumption and file size when the input data contains duplicate values.

INHERIT

use the default memory format that is set for the server. By default, the server uses the standard memory format. If an administrator sets the CAS_DEFAULT_MEMORY_FORMAT environment variable to DVR, then the DVR memory format is set as the default for the server.

STANDARD

use the standard memory format.

"name":"table-name"

specifies the name for the output table.

"promote":True | False

when set to True, adds the output table with a global scope. This enables other sessions to access the table, subject to access controls. The target caslib must also have a global scope.

Default False
"replace":True | False

when set to True, overwrites an existing table that has the same name.

Default False
"tableRedistUpPolicy":"DEFER" | "NOREDIST" | "REBALANCE"

Specifies the Table Redistribution Policy when the number of worker pods increases on a running CAS server.

DEFER

Defer redistribution policy selection to higher-level entity.

NOREDIST

Do not redistribute table data when the number of worker pods changes on a running CAS server.

REBALANCE

Rebalance table data when the number of worker pods changes on a running CAS server.

sampleSize=integer

specifies the maximum number of pipelines to sample.

Default 10
Minimum value 1

saveState={casouttable} | {savePipelinesOptions}

specifies the CAS table to store the analysis results.

Alias saveModel

The casouttable value can be one or more of the following:

"caslib":"string"

specifies the name of the caslib for the output table.

"compress":True | False

when set to True, applies data compression to the table.

Default False
"indexVars":["variable-name-1" <, "variable-name-2", ...>]

specifies the list of variables to create indexes for in the output data.

"label":"string"

specifies the descriptive label to associate with the table.

"lifetime":64-bit-integer

specifies the number of seconds to keep the table in memory after it is last accessed. The table is dropped if it is not accessed for the specified number of seconds.

Default 0
Minimum value 0
"maxMemSize":64-bit-integer

specifies the maximum amount of memory, in bytes, that each thread should allocate for in-memory blocks before converting to a memory-mapped file. Files are written in the directories that are specified in the CAS_DISK_CACHE environment variable.

TIP You can enclose the value in quotation marks and specify B, K, M, G, or T as a suffix to indicate the units. For example, "8M" specifies eight megabytes.
"memoryFormat":"DVR" | "INHERIT" | "STANDARD"

specifies the memory format for the output table.

Default INHERIT
DVR

use the duplicate value reduction memory format. This memory format can reduce the memory consumption and file size when the input data contains duplicate values.

INHERIT

use the default memory format that is set for the server. By default, the server uses the standard memory format. If an administrator sets the CAS_DEFAULT_MEMORY_FORMAT environment variable to DVR, then the DVR memory format is set as the default for the server.

STANDARD

use the standard memory format.

"name":"table-name"

specifies the name for the output table.

"promote":True | False

when set to True, adds the output table with a global scope. This enables other sessions to access the table, subject to access controls. The target caslib must also have a global scope.

Default False
"replace":True | False

when set to True, overwrites an existing table that has the same name.

Default False
"replication":integer

specifies the number of copies of the table to make for fault tolerance. Larger values result in slower performance and use more memory, but provide high availability for data in the event of a node failure. Data redundancy applies to distributed servers only.

Default 1
Minimum value 0
"tableRedistUpPolicy":"DEFER" | "NOREDIST" | "REBALANCE"

Specifies the Table Redistribution Policy when the number of worker pods increases on a running CAS server.

DEFER

Defer redistribution policy selection to higher-level entity.

NOREDIST

Do not redistribute table data when the number of worker pods changes on a running CAS server.

REBALANCE

Rebalance table data when the number of worker pods changes on a running CAS server.

"threadBlockSize":64-bit-integer

specifies the number of bytes to use for blocks in the output table. The blocks are read by threads. Gradually increase this value when you have a large table with millions or billions of rows and you are tuning for performance. Larger values can increase performance with indexed tables. However, if the value is too large, then you can cause thread starvation due to too few blocks for threads to work on.

Alias blockSize
Default 1048576
Minimum value 0
TIP You can enclose the value in quotation marks and specify B, K, M, G, or T as a suffix to indicate the units. For example, "8M" specifies eight megabytes.
"timeStamp":"string"

specifies to add a timestamp column to the table. Support for timeStamp is action-specific. Specify the value in the form that is appropriate for your session locale.

"where":["string-1" <, "string-2", ...>]

specifies one or more expressions for subsetting the output data. When multiple expressions are specified, the expressions are effectively combined using AND to form the final output filter. If an expression contains quoted values, use nested quotation marks.

The savePipelinesOptions value can be one or more of the following:

"modelNamePrefix":"string"

specifies the prefix to use for the names of the saved models.

"replace":True | False

when set to True, overwrites already existing models that have the same name.

Default True
"topK":integer

specifies the number of best-performing models to save.

Default 5
Minimum value 1

screenPolicy={sweeperPolicy}

specifies the variable screening policy to use for recommending that variables be screened out, transformed, or copied.

Alias sweeperPolicy

The sweeperPolicy value can be one or more of the following:

"constant":True | False

when set to True, uses the variable screening policy to identify variables that have constant values.

Alias unique
Default True
"groupRareLevels":True | False

when set to True, uses the variable screening policy to identify nominal variables that have rare levels.

Alias groupRare
Default True
"leakagePercentThreshold":double

specifies the variable screening policy for variables that have a very high level of information about the target. Variables that have a greater target entropy percentage reduction than the specified threshold are flagged as leakage variables.

Alias leakagePercentageThreshold
Default 90
Range (0–100]
"lowCv":True | False

when set to True, uses the variable screening policy to identify variables that have a low coefficient of variation (CV).

Alias lowCoefficientVariation
Default True
"lowMutualInformation":double

specifies the variable screening policy for variables that have a low level of information about the target.

Alias lowInformation
Default 0.05
Minimum value 0
"missingIndicatorPercent":double

specifies the variable screening policy for generating missing indicator variables.

Alias missingIndicatorPercentage
Default 75
Range [10–100)
"missingPercentThreshold":double

specifies the variable screening policy for identifying variables that have a very high missing rate.

Alias missingPercentageThreshold
Default 90
Range [10–100)
"redundant":double

specifies the symmetric uncertainty (SU) threshold for identifying redundant variables. If the SU for two variables exceeds the threshold, the variable that has less information about the target is flagged as redundant.

Default 1
Range (0–1]

seed=integer

specifies a seed value for random number generation. This value is used for repeatable random number generation in some scenarios.

Default 0

selectionPolicy={featureSelectOptions}

specifies the feature selection policy.

Long form selectionPolicy={"criterion":"CHISQ" | "CRAMERSV" | "ENTROPY" | "FTEST" | "G2" | "IV" | "MI" | "NORMMI" | "PEARSON" | "SU"}
Shortcut form selectionPolicy="CHISQ" | "CRAMERSV" | "ENTROPY" | "FTEST" | "G2" | "IV" | "MI" | "NORMMI" | "PEARSON" | "SU"

The featureSelectOptions value can be one or more of the following:

"criterion":"CHISQ" | "CRAMERSV" | "ENTROPY" | "FTEST" | "G2" | "IV" | "MI" | "NORMMI" | "PEARSON" | "SU"

specifies the filter feature selection criterion to use.

Alias stat
Default MI
CHISQ

uses the chi-square statistic.

CRAMERSV

uses Cramer's V.

ENTROPY

uses the entropy percentage decrease.

FTEST

uses the F test.

G2

uses the G2 statistic.

IV

uses the information value statistic.

MI

uses the mutual information statistic.

NORMMI

uses the normalized mutual information statistic.

PEARSON

uses the Pearson correlation.

SU

uses the symmetric uncertainty statistic.

"topK":integer

specifies that the number of features that have the highest filter selection criterion value be selected. If both topK and topKPercent are specified, then topKPercent is used.

Default 50
Minimum value 1
"topKPercent":double

specifies that the percentage of features that have the highest filter selection criterion value be selected. If both topK and topKPercent are specified, then topKPercent is used.

Alias topKPercentage
Range (0–100]

* table={castable}

specifies the table name, caslib, and other common parameters.

Long form table={"name":"table-name"}
Shortcut form table="table-name"

The castable value can be one or more of the following:

"caslib":"string"

specifies the caslib for the input table that you want to use with the action. By default, the active caslib is used. Specify a value only if you need to access a table from a different caslib.

"computedOnDemand":True | False

when set to True, creates the computed variables when the table is loaded instead of when the action begins.

Alias compOnDemand
Default False
"computedVars":[{casinvardesc-1} <, {casinvardesc-2}, ...>]

specifies the names of the computed variables to create. Specify an expression for each variable in the computedVarsProgram parameter. If you do not specify this parameter, then all variables from computedVarsProgram are automatically included.

Alias compVars

The casinvardesc value can be one or more of the following:

"format":"string"

specifies the format to apply to the variable.

"formattedLength":integer

specifies the length of the format field plus the length of the format precision.

"label":"string"

specifies the descriptive label for the variable.

* "name":"variable-name"

specifies the name for the variable.

"nfd":integer

specifies the length of the format precision.

"nfl":integer

specifies the length of the format field.

"computedVarsProgram":"string"

specifies an expression for each computed variable that you include in the computedVars parameter.

Alias compPgm
"dataSourceOptions":{"key-1":{any-list-or-data-type-1} <, "key-2":{any-list-or-data-type-2}, ...>}

specifies data source options.

Aliases options
dataSource
"importOptions":{"fileType":"ANY" | "AUDIO" | "AUTO" | "BASESAS" | "CSV" | "DOCUMENT" | "DTA" | "ESP" | "EXCEL" | "FMT" | "HDAT" | "IMAGE" | "JMP" | "LASR" | "PARQUET" | "SPSS" | "VIDEO" | "XLS", fileType-specific-parameters}

specifies the settings for reading a table from a data source.

Alias import_

For more information about specifying the importOptions parameter, see the common importOptions parameter (Appendix A: Common Parameters).

* "name":"table-name"

specifies the name of the input table.

"singlePass":True | False

when set to True, does not create a transient table on the server. Setting this parameter to True can be efficient, but the data might not have stable ordering upon repeated runs.

Default False
"vars":[{casinvardesc-1} <, {casinvardesc-2}, ...>]

specifies the variables to use in the action.

The casinvardesc value can be one or more of the following:

"format":"string"

specifies the format to apply to the variable.

"formattedLength":integer

specifies the length of the format field plus the length of the format precision.

"label":"string"

specifies the descriptive label for the variable.

* "name":"variable-name"

specifies the name for the variable.

"nfd":integer

specifies the length of the format precision.

"nfl":integer

specifies the length of the format field.

"where":"where-expression"

specifies an expression for subsetting the input data.

"whereTable":{groupbytable}

specifies an input table that contains rows to use as a WHERE filter. If the vars parameter is not specified, then all the variable names that are common to the input table and the filtering table are used to find matching rows. If the where parameter for the input table and this parameter are specified, then this filtering table is applied first.

The groupbytable value can be one or more of the following:

"casLib":"string"

specifies the caslib for the filter table. By default, the active caslib is used.

"dataSourceOptions":{adls_noreq-parameters | bigquery-parameters | cas_noreq-parameters | clouddex-parameters | db2-parameters | dnfs-parameters | esp-parameters | fedsvr-parameters | gcs_noreq-parameters | hadoop-parameters | hana-parameters | impala-parameters | informix-parameters | jdbc-parameters | mongodb-parameters | mysql-parameters | odbc-parameters | oracle-parameters | path-parameters | postgres-parameters | redshift-parameters | s3-parameters | sapiq-parameters | sforce-parameters | singlestore_standard-parameters | snowflake-parameters | spark-parameters | spde-parameters | sqlserver-parameters | ss_noreq-parameters | teradata-parameters | vertica-parameters | yellowbrick-parameters}

specifies data source options.

Aliases options
dataSource

For more information about specifying the dataSourceOptions parameter, see the common dataSourceOptions parameter (Appendix A: Common Parameters).

"importOptions":{"fileType":"ANY" | "AUDIO" | "AUTO" | "BASESAS" | "CSV" | "DOCUMENT" | "DTA" | "ESP" | "EXCEL" | "FMT" | "HDAT" | "IMAGE" | "JMP" | "LASR" | "PARQUET" | "SPSS" | "VIDEO" | "XLS", fileType-specific-parameters}

specifies the settings for reading a table from a data source.

Alias import_

For more information about specifying the importOptions parameter, see the common importOptions parameter (Appendix A: Common Parameters).

* "name":"table-name"

specifies the name of the filter table.

"vars":[{casinvardesc-1} <, {casinvardesc-2}, ...>]

specifies the variable names to use from the filter table.

The casinvardesc value can be one or more of the following:

"format":"string"

specifies the format to apply to the variable.

"formattedLength":integer

specifies the length of the format field plus the length of the format precision.

"label":"string"

specifies the descriptive label for the variable.

* "name":"variable-name"

specifies the name for the variable.

"nfd":integer

specifies the length of the format precision.

"nfl":integer

specifies the length of the format field.

"where":"where-expression"

specifies an expression for subsetting the data from the filter table.

* target="variable-name"

specifies the target variable.

Alias evalVar

topKPipelines=integer

specifies the number of best-performing pipelines to save.

Default 10
Minimum value 1

* transformationOut={casouttable}

specifies the CAS table to store the feature transformation and generation pipelines.

Alias transformationsOut
Long form transformationOut={"name":"table-name"}
Shortcut form transformationOut="table-name"

The casouttable value can be one or more of the following:

"caslib":"string"

specifies the name of the caslib for the output table.

"indexVars":["variable-name-1" <, "variable-name-2", ...>]

specifies the list of variables to create indexes for in the output data.

"lifetime":64-bit-integer

specifies the number of seconds to keep the table in memory after it is last accessed. The table is dropped if it is not accessed for the specified number of seconds.

Default 0
Minimum value 0
"memoryFormat":"DVR" | "INHERIT" | "STANDARD"

specifies the memory format for the output table.

Default INHERIT
DVR

use the duplicate value reduction memory format. This memory format can reduce the memory consumption and file size when the input data contains duplicate values.

INHERIT

use the default memory format that is set for the server. By default, the server uses the standard memory format. If an administrator sets the CAS_DEFAULT_MEMORY_FORMAT environment variable to DVR, then the DVR memory format is set as the default for the server.

STANDARD

use the standard memory format.

"name":"table-name"

specifies the name for the output table.

"promote":True | False

when set to True, adds the output table with a global scope. This enables other sessions to access the table, subject to access controls. The target caslib must also have a global scope.

Default False
"replace":True | False

when set to True, overwrites an existing table that has the same name.

Default False
"tableRedistUpPolicy":"DEFER" | "NOREDIST" | "REBALANCE"

Specifies the Table Redistribution Policy when the number of worker pods increases on a running CAS server.

DEFER

Defer redistribution policy selection to higher-level entity.

NOREDIST

Do not redistribute table data when the number of worker pods changes on a running CAS server.

REBALANCE

Rebalance table data when the number of worker pods changes on a running CAS server.

transformationPolicy={transformationSpace}

specifies the feature transformation and generation space in which the feature machine operates.

Alias transformationSpace

The transformationSpace value can be one or more of the following:

"cardinality":True | False

when set to True, includes cardinality-reducing transformations.

Default True
"entropy":True | False

when set to True, includes transformations for the treatment of low entropy.

Default False
"interaction":True | False

when set to True, detects and generates interaction features.

Default False
"iqv":True | False

when set to True, includes transformations for the treatment of low indices of qualitative variation (IQV).

Default False
"kurtosis":True | False

when set to True, includes transformations for the treatment of high kurtosis.

Default False
"missing":True | False

when set to True, includes transformations for the treatment of missing values.

Default True
"outlier":True | False

when set to True, includes transformations for the treatment of outliers.

Default False
"skewness":True | False

when set to True, includes transformations for the treatment of high skewness.

Default True

validationPartitionFraction=double

specifies the percentage of the input data to use for validation.

Default 0.3
Range 0.01–0.99

dsAutoMl Action

Automated machine learning pipeline exploration, execution and ranking..

R Syntax

results <– cas.dataSciencePilot.dsAutoMl(s,
ecdfTolerance=double,
event="string",
explorationPolicy=list(
cv=list(
lowMoment=double
lowRobust=double
),
dateTimeVariables=list("variable-name-1" <, "variable-name-2", ...>),
dateVariables=list("variable-name-1" <, "variable-name-2", ...>),
iqv=list( ),
missing=list( ),
nominal=list(
includeNegative=TRUE | FALSE
includeNonIntegral=TRUE | FALSE
intervals=list("variable-name-1" <, "variable-name-2", ...>)
nominals=list("variable-name-1" <, "variable-name-2", ...>)
),
timeVariables=list("variable-name-1" <, "variable-name-2", ...>)
),
required parameter featureOut=list(
caslib="string",
indexVars=list("variable-name-1" <, "variable-name-2", ...>),
lifetime=64-bit-integer,
name="table-name",
promote=TRUE | FALSE,
replace=TRUE | FALSE,
),
hyperParameterOptimizer="MODELCOMPOSER" | "TUNEALL",
inputs=list( list(
format="string",
formattedLength=integer,
label="string",
required parameter name="variable-name",
nfd=integer,
nfl=integer
) <, list(...)>),
kFolds=integer,
logLevel=integer,
misraGries=TRUE | FALSE,
modelTypes=list("DECISIONTREE", "FOREST", "GLM", "GRADBOOST", "LOGISTIC", "NEURALNET"),
required parameter pipelineOut=list(
caslib="string",
indexVars=list("variable-name-1" <, "variable-name-2", ...>),
lifetime=64-bit-integer,
name="table-name",
promote=TRUE | FALSE,
replace=TRUE | FALSE,
),
sampleSize=integer,
saveState=list(casouttable) | list(savePipelinesOptions),
screenPolicy=list(
constant=TRUE | FALSE,
groupRareLevels=TRUE | FALSE,
lowCv=TRUE | FALSE,
redundant=double
),
seed=integer,
required parameter table=list(
caslib="string",
computedOnDemand=TRUE | FALSE,
computedVars=list( list(
format="string",
formattedLength=integer,
label="string",
required parameter name="variable-name",
nfd=integer,
nfl=integer
) <, list(...)>),
dataSourceOptions=list(key-1=list(any-list-or-data-type-1) <, key-2=list(any-list-or-data-type-2), ...>),
importOptions=list(fileType="ANY" | "AUDIO" | "AUTO" | "BASESAS" | "CSV" | "DELIMITED" | "DOCUMENT" | "DTA" | "ESP" | "EXCEL" | "FMT" | "HDAT" | "IMAGE" | "JMP" | "LASR" | "PARQUET" | "SOUND" | "SPSS" | "VIDEO" | "XLS", fileType-specific-parameters),
required parameter name="table-name",
singlePass=TRUE | FALSE,
vars=list( list(
format="string",
formattedLength=integer,
label="string",
required parameter name="variable-name",
nfd=integer,
nfl=integer
) <, list(...)>),
where="where-expression",
whereTable=list(
casLib="string"
dataSourceOptions=list(adls_noreq-parameters | bigquery-parameters | cas_noreq-parameters | clouddex-parameters | db2-parameters | dnfs-parameters | esp-parameters | fedsvr-parameters | gcs_noreq-parameters | hadoop-parameters | hana-parameters | impala-parameters | informix-parameters | jdbc-parameters | mongodb-parameters | mysql-parameters | odbc-parameters | oracle-parameters | path-parameters | postgres-parameters | redshift-parameters | s3-parameters | sapiq-parameters | sforce-parameters | singlestore_standard-parameters | snowflake-parameters | spark-parameters | spde-parameters | sqlserver-parameters | ss_noreq-parameters | teradata-parameters | vertica-parameters | yellowbrick-parameters)
importOptions=list(fileType="ANY" | "AUDIO" | "AUTO" | "BASESAS" | "CSV" | "DELIMITED" | "DOCUMENT" | "DTA" | "ESP" | "EXCEL" | "FMT" | "HDAT" | "IMAGE" | "JMP" | "LASR" | "PARQUET" | "SOUND" | "SPSS" | "VIDEO" | "XLS", fileType-specific-parameters)
required parameter name="table-name"
vars=list( list(
format="string",
formattedLength=integer,
label="string",
required parameter name="variable-name",
nfd=integer,
nfl=integer
) <, list(...)>)
where="where-expression"
)
),
required parameter target="variable-name",
topKPipelines=integer,
required parameter transformationOut=list(
caslib="string",
indexVars=list("variable-name-1" <, "variable-name-2", ...>),
lifetime=64-bit-integer,
name="table-name",
promote=TRUE | FALSE,
replace=TRUE | FALSE,
),
transformationPolicy=list(
cardinality=TRUE | FALSE,
entropy=TRUE | FALSE,
interaction=TRUE | FALSE,
iqv=TRUE | FALSE,
kurtosis=TRUE | FALSE,
missing=TRUE | FALSE,
outlier=TRUE | FALSE,
skewness=TRUE | FALSE
),
)
indicates a required parameter

Summary: Input and Output Tables

If a row includes a subparameter, you can specify the name, caslib, and so on in the subparameter. Otherwise, you can specify the name, caslib, and so on in the parameter.

Parameters for Reading Input Tables

Parameter

Subparameter

Description

required parametertable

specifies the table name, caslib, and other common parameters.

Parameters for Creating Output Tables

Parameter

Subparameter

Description

required parameterfeatureOut

specifies the CAS table to store the feature transformation and generation pipelines.

required parameterpipelineOut

specifies the CAS table to store the analysis results.

 saveState

specifies the CAS table to store the analysis results.

required parametertransformationOut

specifies the CAS table to store the feature transformation and generation pipelines.

Parameter Descriptions

distinctCountLimit=integer

specifies the distinct count limit. If the limit is exceeded, and the misraGries parameter is set to True, the Misra-Gries frequency sketch algorithm is used to estimate the frequency distribution. Otherwise, the distinct count operation is aborted.

Alias maxNVals
Default 10000
Minimum value 256

ecdfTolerance=double

specifies the tolerance value for the empirical cumulative distribution function. This value is used by the quantile sketch algorithm.

Default 0.001
Range 1E-06–0.1

event="string"

specifies the target variable level that you want to model. Multilevel classification problems are cast into a one-versus-all binary classification problem, where the value of the event parameter denotes the level that you are modeling.

explorationPolicy=list(avaptPolicy)

specifies the automatic variable analysis and grouping (AVAPT) policy.

Alias avaptPolicy

The avaptPolicy value can be one or more of the following:

cardinality=list(cardinalityAvaptPolicy)

specifies the automatic variable analysis and grouping (AVAPT) cardinality policy.

The cardinalityAvaptPolicy value can be one or more of the following:

lowMediumCutoff=double

specifies the cardinality threshold for the low-medium cutoff.

Default 32
Range 2–256
mediumHighCutoff=double

specifies the cardinality threshold for the medium-high cutoff.

Default 64
Range 2–1024
minNObsPerTargetLevel=double

specifies the minimum number of observations for each target level.

Default 10
Range 5–100
cv=list(cvAvaptPolicy)

specifies the automatic variable analysis and grouping (AVAPT) coefficient of variation policy.

Alias coefficientVariation

The cvAvaptPolicy value can be one or more of the following:

lowMoment=double

specifies the absolute value of the low-high percentage threshold for the moment coefficient of variation (CV).

Default 1
Minimum value 0
lowRobust=double

specifies the absolute value of the low-high percentage threshold for the robust coefficient of variation (CV).

Default 1
Minimum value 0
dateTimeVariables=list("variable-name-1" <, "variable-name-2", ...>)

specifies the datetime variables.

Alias dateTime
dateVariables=list("variable-name-1" <, "variable-name-2", ...>)

specifies the date variables.

Alias date
entropy=list(entropyAvaptPolicy)

specifies the automatic variable analysis and grouping (AVAPT) entropy policy.

The entropyAvaptPolicy value can be one or more of the following:

giniLowMediumCutoff=double

specifies the Gini entropy threshold for the low-medium cutoff.

Default 0.25
Range 0–1
giniMediumHighCutoff=double

specifies the Gini entropy threshold for the medium-high cutoff.

Default 0.75
Range 0–1
shannonLowMediumCutoff=double

specifies the Shannon entropy threshold for the low-medium cutoff.

Default 0.25
Range 0–1
shannonMediumHighCutoff=double

specifies the Shannon entropy threshold for the medium-high cutoff.

Default 0.75
Range 0–1
iqv=list(iqvAvaptPolicy)

specifies the automatic variable analysis and grouping (AVAPT) index of qualitative variation policy.

Alias qualitativeVariationIndex

The iqvAvaptPolicy value can be one or more of the following:

highTopBottom=double

specifies the low-high cutoff frequency ratio threshold between the most frequent and least frequent levels of a nominal variable.

Alias highTop1Bottom1
Default 100
Minimum value 1
highTopTwo=double

specifies the low-high cutoff frequency ratio threshold between the most frequent and second most frequent levels of a nominal variable.

Alias highTop1Top2
Default 10
Minimum value 1
highVariationRatio=double

specifies the variation ratio threshold for the low-high cutoff.

Alias highModVr
Default 0.5
Range (0–1]
kurtosis=list(kurtosisAvaptPolicy)

specifies the automatic variable analysis and grouping (AVAPT) kurtosis policy.

The kurtosisAvaptPolicy value can be one or more of the following:

momentLowMediumCutoff=double

specifies the absolute value of the moment kurtosis threshold for the low-medium cutoff.

Default 5
Minimum value 0
momentMediumHighCutoff=double

specifies the absolute value of the moment kurtosis threshold for the medium-high cutoff.

Default 10
Minimum value 0
robustLowMediumCutoff=double

specifies the absolute value of the robust kurtosis threshold for the low-medium cutoff.

Default 2
Minimum value 0
robustMediumHighCutoff=double

specifies the absolute value of the robust kurtosis threshold for the medium-high cutoff.

Default 3
Minimum value 0
missing=list(missingAvaptPolicy)

specifies the automatic variable analysis and grouping (AVAPT) missing grouping policy.

The missingAvaptPolicy value can be one or more of the following:

lowMediumCutoff=double

specifies the missing percentage threshold for the low-medium cutoff.

Default 5
Range 0–100
mediumHighCutoff=double

specifies the missing percentage threshold for the medium-high cutoff.

Default 25
Range 0–100
nominal=list(nominalAvaptPolicy)

specifies the automatic variable analysis and grouping (AVAPT) nominal policy.

The nominalAvaptPolicy value can be one or more of the following:

cardinalityRatio=double

specifies the AVAPT nominal policy cardinality ratio threshold.

Default 0.25
Range (0–1]
cardinalityThreshold=double

specifies the AVAPT nominal policy cardinality threshold.

Default 1024
Minimum value 32
includeNegative=TRUE | FALSE

when set to True, includes numeric variables with some negative values in the nominal analysis.

Default FALSE
includeNonIntegral=TRUE | FALSE

when set to True, includes numeric variables with some nonintegral values in the nominal analysis.

Default FALSE
intervals=list("variable-name-1" <, "variable-name-2", ...>)

specifies variables to consider as intervals.

nominals=list("variable-name-1" <, "variable-name-2", ...>)

specifies variables to consider as nominals.

outlier=list(outlierAvaptPolicy)

specifies the automatic variable analysis and grouping (AVAPT) outlier policy.

The outlierAvaptPolicy value can be one or more of the following:

momentLowMediumCutoff=double

specifies the z-score outlier percentage threshold for the low-medium cutoff.

Default 1
Range 0–100
momentMediumHighCutoff=double

specifies the z-score outlier percentage threshold for the medium-high cutoff.

Default 2.5
Range 0–100
robustLowMediumCutoff=double

specifies the modified interquartile range outlier percentage threshold for the low-medium cutoff.

Default 1
Range 0–100
robustMediumHighCutoff=double

specifies the modified interquartile range outlier percentage threshold for the medium-high cutoff.

Default 2.5
Range 0–100
skewness=list(skewnessAvaptPolicy)

specifies the automatic variable analysis and grouping (AVAPT) skewness policy.

The skewnessAvaptPolicy value can be one or more of the following:

momentLowMediumCutoff=double

specifies the moment skewness threshold for the low-medium cutoff.

Default 2
Range 0–100
momentMediumHighCutoff=double

specifies the moment skewness threshold for the medium-high cutoff.

Default 10
Range 0–100
robustLowMediumCutoff=double

specifies the robust skewness threshold for the low-medium cutoff.

Default 0.75
Range 0–3
robustMediumHighCutoff=double

specifies the robust skewness threshold for the medium-high cutoff.

Default 2
Range 0–3
timeVariables=list("variable-name-1" <, "variable-name-2", ...>)

specifies the time variables.

Alias time

* featureOut=list(casouttable)

specifies the CAS table to store the feature transformation and generation pipelines.

Alias featuresOut
Long form featureOut=list(name="table-name")
Shortcut form featureOut="table-name"

The casouttable value can be one or more of the following:

caslib="string"

specifies the name of the caslib for the output table.

indexVars=list("variable-name-1" <, "variable-name-2", ...>)

specifies the list of variables to create indexes for in the output data.

lifetime=64-bit-integer

specifies the number of seconds to keep the table in memory after it is last accessed. The table is dropped if it is not accessed for the specified number of seconds.

Default 0
Minimum value 0
memoryFormat="DVR" | "INHERIT" | "STANDARD"

specifies the memory format for the output table.

Default INHERIT
DVR

use the duplicate value reduction memory format. This memory format can reduce the memory consumption and file size when the input data contains duplicate values.

INHERIT

use the default memory format that is set for the server. By default, the server uses the standard memory format. If an administrator sets the CAS_DEFAULT_MEMORY_FORMAT environment variable to DVR, then the DVR memory format is set as the default for the server.

STANDARD

use the standard memory format.

name="table-name"

specifies the name for the output table.

promote=TRUE | FALSE

when set to True, adds the output table with a global scope. This enables other sessions to access the table, subject to access controls. The target caslib must also have a global scope.

Default FALSE
replace=TRUE | FALSE

when set to True, overwrites an existing table that has the same name.

Default FALSE
tableRedistUpPolicy="DEFER" | "NOREDIST" | "REBALANCE"

Specifies the Table Redistribution Policy when the number of worker pods increases on a running CAS server.

DEFER

Defer redistribution policy selection to higher-level entity.

NOREDIST

Do not redistribute table data when the number of worker pods changes on a running CAS server.

REBALANCE

Rebalance table data when the number of worker pods changes on a running CAS server.

hyperParameterOptimizer="MODELCOMPOSER" | "TUNEALL"

specifies the method to use for hyperparameter optimization.

Alias hpOptimizer
Default TUNEALL

inputs=list( list(casinvardesc-1) <, list(casinvardesc-2), ...>)

specifies the variables to use for the analysis. You can specify a subset of the variables from the input table.

For more information about specifying the inputs parameter, see the common casinvardesc parameter (Appendix A: Common Parameters).

Alias vars

kFolds=integer

specifies the number of folds for cross validation.

Default 5
Range 2–10

logLevel=integer

specifies the logging level.

Default 0
Range 0–3

misraGries=TRUE | FALSE

when set to True, uses the Misra-Gries algorithm for the frequency distribution estimation, if the distinct count limit is exceeded.

Default TRUE

modelTypes=list("DECISIONTREE", "FOREST", "GLM", "GRADBOOST", "LOGISTIC", "NEURALNET")

specifies the values to control the types and classes of machine learning algorithms to include in the pipeline exploration.

DECISIONTREE

specifies the decision tree model.

FOREST

specifies the random forest model.

GLM

specifies the generalized linear model.

GRADBOOST

specifies the gradient boosting model.

LOGISTIC

specifies the logistic regression model.

NEURALNET

specifies the neural network model.

objective="ASE" | "AUC" | "F1" | "MAE" | "MCE" | "MSLE" | "RASE" | "RMAE" | "RMSLE"

specifies the model performance metric to use.

ASE

uses the average square error.

AUC

uses the area under the receiver operating characteristic curve.

F1

uses the F1 coefficient.

MAE

uses the mean absolute error.

MCE

uses the misclassification error.

Alias MCR
MSLE

uses the mean square logarithmic error.

RASE

uses the root average square error.

RMAE

uses the root mean absolute error.

RMSLE

uses the root mean square logarithmic error.

* pipelineOut=list(casouttable)

specifies the CAS table to store the analysis results.

Alias pipelinesOut
Long form pipelineOut=list(name="table-name")
Shortcut form pipelineOut="table-name"

The casouttable value can be one or more of the following:

caslib="string"

specifies the name of the caslib for the output table.

indexVars=list("variable-name-1" <, "variable-name-2", ...>)

specifies the list of variables to create indexes for in the output data.

lifetime=64-bit-integer

specifies the number of seconds to keep the table in memory after it is last accessed. The table is dropped if it is not accessed for the specified number of seconds.

Default 0
Minimum value 0
memoryFormat="DVR" | "INHERIT" | "STANDARD"

specifies the memory format for the output table.

Default INHERIT
DVR

use the duplicate value reduction memory format. This memory format can reduce the memory consumption and file size when the input data contains duplicate values.

INHERIT

use the default memory format that is set for the server. By default, the server uses the standard memory format. If an administrator sets the CAS_DEFAULT_MEMORY_FORMAT environment variable to DVR, then the DVR memory format is set as the default for the server.

STANDARD

use the standard memory format.

name="table-name"

specifies the name for the output table.

promote=TRUE | FALSE

when set to True, adds the output table with a global scope. This enables other sessions to access the table, subject to access controls. The target caslib must also have a global scope.

Default FALSE
replace=TRUE | FALSE

when set to True, overwrites an existing table that has the same name.

Default FALSE
tableRedistUpPolicy="DEFER" | "NOREDIST" | "REBALANCE"

Specifies the Table Redistribution Policy when the number of worker pods increases on a running CAS server.

DEFER

Defer redistribution policy selection to higher-level entity.

NOREDIST

Do not redistribute table data when the number of worker pods changes on a running CAS server.

REBALANCE

Rebalance table data when the number of worker pods changes on a running CAS server.

sampleSize=integer

specifies the maximum number of pipelines to sample.

Default 10
Minimum value 1

saveState={casouttable} | {savePipelinesOptions}

specifies the CAS table to store the analysis results.

Alias saveModel

The casouttable value can be one or more of the following:

caslib="string"

specifies the name of the caslib for the output table.

compress=TRUE | FALSE

when set to True, applies data compression to the table.

Default FALSE
indexVars=list("variable-name-1" <, "variable-name-2", ...>)

specifies the list of variables to create indexes for in the output data.

label="string"

specifies the descriptive label to associate with the table.

lifetime=64-bit-integer

specifies the number of seconds to keep the table in memory after it is last accessed. The table is dropped if it is not accessed for the specified number of seconds.

Default 0
Minimum value 0
maxMemSize=64-bit-integer

specifies the maximum amount of memory, in bytes, that each thread should allocate for in-memory blocks before converting to a memory-mapped file. Files are written in the directories that are specified in the CAS_DISK_CACHE environment variable.

TIP You can enclose the value in quotation marks and specify B, K, M, G, or T as a suffix to indicate the units. For example, "8M" specifies eight megabytes.
memoryFormat="DVR" | "INHERIT" | "STANDARD"

specifies the memory format for the output table.

Default INHERIT
DVR

use the duplicate value reduction memory format. This memory format can reduce the memory consumption and file size when the input data contains duplicate values.

INHERIT

use the default memory format that is set for the server. By default, the server uses the standard memory format. If an administrator sets the CAS_DEFAULT_MEMORY_FORMAT environment variable to DVR, then the DVR memory format is set as the default for the server.

STANDARD

use the standard memory format.

name="table-name"

specifies the name for the output table.

promote=TRUE | FALSE

when set to True, adds the output table with a global scope. This enables other sessions to access the table, subject to access controls. The target caslib must also have a global scope.

Default FALSE
replace=TRUE | FALSE

when set to True, overwrites an existing table that has the same name.

Default FALSE
replication=integer

specifies the number of copies of the table to make for fault tolerance. Larger values result in slower performance and use more memory, but provide high availability for data in the event of a node failure. Data redundancy applies to distributed servers only.

Default 1
Minimum value 0
tableRedistUpPolicy="DEFER" | "NOREDIST" | "REBALANCE"

Specifies the Table Redistribution Policy when the number of worker pods increases on a running CAS server.

DEFER

Defer redistribution policy selection to higher-level entity.

NOREDIST

Do not redistribute table data when the number of worker pods changes on a running CAS server.

REBALANCE

Rebalance table data when the number of worker pods changes on a running CAS server.

threadBlockSize=64-bit-integer

specifies the number of bytes to use for blocks in the output table. The blocks are read by threads. Gradually increase this value when you have a large table with millions or billions of rows and you are tuning for performance. Larger values can increase performance with indexed tables. However, if the value is too large, then you can cause thread starvation due to too few blocks for threads to work on.

Alias blockSize
Default 1048576
Minimum value 0
TIP You can enclose the value in quotation marks and specify B, K, M, G, or T as a suffix to indicate the units. For example, "8M" specifies eight megabytes.
timeStamp="string"

specifies to add a timestamp column to the table. Support for timeStamp is action-specific. Specify the value in the form that is appropriate for your session locale.

where=list("string-1" <, "string-2", ...>)

specifies one or more expressions for subsetting the output data. When multiple expressions are specified, the expressions are effectively combined using AND to form the final output filter. If an expression contains quoted values, use nested quotation marks.

The savePipelinesOptions value can be one or more of the following:

modelNamePrefix="string"

specifies the prefix to use for the names of the saved models.

replace=TRUE | FALSE

when set to True, overwrites already existing models that have the same name.

Default TRUE
topK=integer

specifies the number of best-performing models to save.

Default 5
Minimum value 1

screenPolicy=list(sweeperPolicy)

specifies the variable screening policy to use for recommending that variables be screened out, transformed, or copied.

Alias sweeperPolicy

The sweeperPolicy value can be one or more of the following:

constant=TRUE | FALSE

when set to True, uses the variable screening policy to identify variables that have constant values.

Alias unique
Default TRUE
groupRareLevels=TRUE | FALSE

when set to True, uses the variable screening policy to identify nominal variables that have rare levels.

Alias groupRare
Default TRUE
leakagePercentThreshold=double

specifies the variable screening policy for variables that have a very high level of information about the target. Variables that have a greater target entropy percentage reduction than the specified threshold are flagged as leakage variables.

Alias leakagePercentageThreshold
Default 90
Range (0–100]
lowCv=TRUE | FALSE

when set to True, uses the variable screening policy to identify variables that have a low coefficient of variation (CV).

Alias lowCoefficientVariation
Default TRUE
lowMutualInformation=double

specifies the variable screening policy for variables that have a low level of information about the target.

Alias lowInformation
Default 0.05
Minimum value 0
missingIndicatorPercent=double

specifies the variable screening policy for generating missing indicator variables.

Alias missingIndicatorPercentage
Default 75
Range [10–100)
missingPercentThreshold=double

specifies the variable screening policy for identifying variables that have a very high missing rate.

Alias missingPercentageThreshold
Default 90
Range [10–100)
redundant=double

specifies the symmetric uncertainty (SU) threshold for identifying redundant variables. If the SU for two variables exceeds the threshold, the variable that has less information about the target is flagged as redundant.

Default 1
Range (0–1]

seed=integer

specifies a seed value for random number generation. This value is used for repeatable random number generation in some scenarios.

Default 0

selectionPolicy=list(featureSelectOptions)

specifies the feature selection policy.

Long form selectionPolicy=list(criterion="CHISQ" | "CRAMERSV" | "ENTROPY" | "FTEST" | "G2" | "IV" | "MI" | "NORMMI" | "PEARSON" | "SU")
Shortcut form selectionPolicy="CHISQ" | "CRAMERSV" | "ENTROPY" | "FTEST" | "G2" | "IV" | "MI" | "NORMMI" | "PEARSON" | "SU"

The featureSelectOptions value can be one or more of the following:

criterion="CHISQ" | "CRAMERSV" | "ENTROPY" | "FTEST" | "G2" | "IV" | "MI" | "NORMMI" | "PEARSON" | "SU"

specifies the filter feature selection criterion to use.

Alias stat
Default MI
CHISQ

uses the chi-square statistic.

CRAMERSV

uses Cramer's V.

ENTROPY

uses the entropy percentage decrease.

FTEST

uses the F test.

G2

uses the G2 statistic.

IV

uses the information value statistic.

MI

uses the mutual information statistic.

NORMMI

uses the normalized mutual information statistic.

PEARSON

uses the Pearson correlation.

SU

uses the symmetric uncertainty statistic.

topK=integer

specifies that the number of features that have the highest filter selection criterion value be selected. If both topK and topKPercent are specified, then topKPercent is used.

Default 50
Minimum value 1
topKPercent=double

specifies that the percentage of features that have the highest filter selection criterion value be selected. If both topK and topKPercent are specified, then topKPercent is used.

Alias topKPercentage
Range (0–100]

* table=list(castable)

specifies the table name, caslib, and other common parameters.

Long form table=list(name="table-name")
Shortcut form table="table-name"

The castable value can be one or more of the following:

caslib="string"

specifies the caslib for the input table that you want to use with the action. By default, the active caslib is used. Specify a value only if you need to access a table from a different caslib.

computedOnDemand=TRUE | FALSE

when set to True, creates the computed variables when the table is loaded instead of when the action begins.

Alias compOnDemand
Default FALSE
computedVars=list( list(casinvardesc-1) <, list(casinvardesc-2), ...>)

specifies the names of the computed variables to create. Specify an expression for each variable in the computedVarsProgram parameter. If you do not specify this parameter, then all variables from computedVarsProgram are automatically included.

Alias compVars

The casinvardesc value can be one or more of the following:

format="string"

specifies the format to apply to the variable.

formattedLength=integer

specifies the length of the format field plus the length of the format precision.

label="string"

specifies the descriptive label for the variable.

* name="variable-name"

specifies the name for the variable.

nfd=integer

specifies the length of the format precision.

nfl=integer

specifies the length of the format field.

computedVarsProgram="string"

specifies an expression for each computed variable that you include in the computedVars parameter.

Alias compPgm
dataSourceOptions=list(key-1=list(any-list-or-data-type-1) <, key-2=list(any-list-or-data-type-2), ...>)

specifies data source options.

Aliases options
dataSource
importOptions=list(fileType="ANY" | "AUDIO" | "AUTO" | "BASESAS" | "CSV" | "DOCUMENT" | "DTA" | "ESP" | "EXCEL" | "FMT" | "HDAT" | "IMAGE" | "JMP" | "LASR" | "PARQUET" | "SPSS" | "VIDEO" | "XLS", fileType-specific-parameters)

specifies the settings for reading a table from a data source.

Alias import

For more information about specifying the importOptions parameter, see the common importOptions parameter (Appendix A: Common Parameters).

* name="table-name"

specifies the name of the input table.

singlePass=TRUE | FALSE

when set to True, does not create a transient table on the server. Setting this parameter to True can be efficient, but the data might not have stable ordering upon repeated runs.

Default FALSE
vars=list( list(casinvardesc-1) <, list(casinvardesc-2), ...>)

specifies the variables to use in the action.

The casinvardesc value can be one or more of the following:

format="string"

specifies the format to apply to the variable.

formattedLength=integer

specifies the length of the format field plus the length of the format precision.

label="string"

specifies the descriptive label for the variable.

* name="variable-name"

specifies the name for the variable.

nfd=integer

specifies the length of the format precision.

nfl=integer

specifies the length of the format field.

where="where-expression"

specifies an expression for subsetting the input data.

whereTable=list(groupbytable)

specifies an input table that contains rows to use as a WHERE filter. If the vars parameter is not specified, then all the variable names that are common to the input table and the filtering table are used to find matching rows. If the where parameter for the input table and this parameter are specified, then this filtering table is applied first.

The groupbytable value can be one or more of the following:

casLib="string"

specifies the caslib for the filter table. By default, the active caslib is used.

dataSourceOptions=list(adls_noreq-parameters | bigquery-parameters | cas_noreq-parameters | clouddex-parameters | db2-parameters | dnfs-parameters | esp-parameters | fedsvr-parameters | gcs_noreq-parameters | hadoop-parameters | hana-parameters | impala-parameters | informix-parameters | jdbc-parameters | mongodb-parameters | mysql-parameters | odbc-parameters | oracle-parameters | path-parameters | postgres-parameters | redshift-parameters | s3-parameters | sapiq-parameters | sforce-parameters | singlestore_standard-parameters | snowflake-parameters | spark-parameters | spde-parameters | sqlserver-parameters | ss_noreq-parameters | teradata-parameters | vertica-parameters | yellowbrick-parameters)

specifies data source options.

Aliases options
dataSource

For more information about specifying the dataSourceOptions parameter, see the common dataSourceOptions parameter (Appendix A: Common Parameters).

importOptions=list(fileType="ANY" | "AUDIO" | "AUTO" | "BASESAS" | "CSV" | "DOCUMENT" | "DTA" | "ESP" | "EXCEL" | "FMT" | "HDAT" | "IMAGE" | "JMP" | "LASR" | "PARQUET" | "SPSS" | "VIDEO" | "XLS", fileType-specific-parameters)

specifies the settings for reading a table from a data source.

Alias import

For more information about specifying the importOptions parameter, see the common importOptions parameter (Appendix A: Common Parameters).

* name="table-name"

specifies the name of the filter table.

vars=list( list(casinvardesc-1) <, list(casinvardesc-2), ...>)

specifies the variable names to use from the filter table.

The casinvardesc value can be one or more of the following:

format="string"

specifies the format to apply to the variable.

formattedLength=integer

specifies the length of the format field plus the length of the format precision.

label="string"

specifies the descriptive label for the variable.

* name="variable-name"

specifies the name for the variable.

nfd=integer

specifies the length of the format precision.

nfl=integer

specifies the length of the format field.

where="where-expression"

specifies an expression for subsetting the data from the filter table.

* target="variable-name"

specifies the target variable.

Alias evalVar

topKPipelines=integer

specifies the number of best-performing pipelines to save.

Default 10
Minimum value 1

* transformationOut=list(casouttable)

specifies the CAS table to store the feature transformation and generation pipelines.

Alias transformationsOut
Long form transformationOut=list(name="table-name")
Shortcut form transformationOut="table-name"

The casouttable value can be one or more of the following:

caslib="string"

specifies the name of the caslib for the output table.

indexVars=list("variable-name-1" <, "variable-name-2", ...>)

specifies the list of variables to create indexes for in the output data.

lifetime=64-bit-integer

specifies the number of seconds to keep the table in memory after it is last accessed. The table is dropped if it is not accessed for the specified number of seconds.

Default 0
Minimum value 0
memoryFormat="DVR" | "INHERIT" | "STANDARD"

specifies the memory format for the output table.

Default INHERIT
DVR

use the duplicate value reduction memory format. This memory format can reduce the memory consumption and file size when the input data contains duplicate values.

INHERIT

use the default memory format that is set for the server. By default, the server uses the standard memory format. If an administrator sets the CAS_DEFAULT_MEMORY_FORMAT environment variable to DVR, then the DVR memory format is set as the default for the server.

STANDARD

use the standard memory format.

name="table-name"

specifies the name for the output table.

promote=TRUE | FALSE

when set to True, adds the output table with a global scope. This enables other sessions to access the table, subject to access controls. The target caslib must also have a global scope.

Default FALSE
replace=TRUE | FALSE

when set to True, overwrites an existing table that has the same name.

Default FALSE
tableRedistUpPolicy="DEFER" | "NOREDIST" | "REBALANCE"

Specifies the Table Redistribution Policy when the number of worker pods increases on a running CAS server.

DEFER

Defer redistribution policy selection to higher-level entity.

NOREDIST

Do not redistribute table data when the number of worker pods changes on a running CAS server.

REBALANCE

Rebalance table data when the number of worker pods changes on a running CAS server.

transformationPolicy=list(transformationSpace)

specifies the feature transformation and generation space in which the feature machine operates.

Alias transformationSpace

The transformationSpace value can be one or more of the following:

cardinality=TRUE | FALSE

when set to True, includes cardinality-reducing transformations.

Default TRUE
entropy=TRUE | FALSE

when set to True, includes transformations for the treatment of low entropy.

Default FALSE
interaction=TRUE | FALSE

when set to True, detects and generates interaction features.

Default FALSE
iqv=TRUE | FALSE

when set to True, includes transformations for the treatment of low indices of qualitative variation (IQV).

Default FALSE
kurtosis=TRUE | FALSE

when set to True, includes transformations for the treatment of high kurtosis.

Default FALSE
missing=TRUE | FALSE

when set to True, includes transformations for the treatment of missing values.

Default TRUE
outlier=TRUE | FALSE

when set to True, includes transformations for the treatment of outliers.

Default FALSE
skewness=TRUE | FALSE

when set to True, includes transformations for the treatment of high skewness.

Default TRUE

validationPartitionFraction=double

specifies the percentage of the input data to use for validation.

Default 0.3
Range 0.01–0.99
Last updated: November 23, 2025