The CCDM Procedure

Example 5.2 Using Externally Simulated Count Data

The COUNTREG procedure in SAS/ETS and the CNTSELECT procedure in SAS Econometrics enable you to estimate count regression models that are based on the most commonly used discrete distributions, such as the Poisson, negative binomial (both p = 1 and p = 2), and Conway-Maxwell-Poisson distributions. Those procedures also enable you to fit zero-inflated models that are based on Poisson, negative binomial (p = 2), and Conway-Maxwell-Poisson distributions. However, you might encounter situations in which you want to use a different method of fitting count regression models. For example, if you are modeling the number of loss events that are incurred by two financial instruments such that there is some dependency between the two, then you might use multivariate frequency modeling methods and simulate the counts for each instrument by using the dependency structure between the count model parameters of the two instruments. As another example, you might want to use different types of count models for different BY groups in your data; this is not possible in PROC COUNTREG or PROC CNTSELECT. So you need to simulate the counts for such BY groups externally. The CCDM procedure enables you to supply externally simulated counts by using the EXTERNALCOUNTS statement. PROC CCDM then does not need to simulate the counts internally; it simulates only the severity of each loss event by using the severity model estimates that you specify in the SEVERITYEST= data table or the SEVERITYSTORE= item store. The simulation process is described and illustrated in the section Simulation with External Counts.

Consider that you are a bank, and as part of quantifying your operational risk, you want to estimate the aggregate loss distributions for two lines of business, retail banking and commercial banking, by using some key risk indicators (KRIs). Assume that your model fitting and model selection process has determined that the Poisson regression model and negative binomial regression model are the best-fitting count models for the number of loss events that are incurred in the retail banking and commercial banking lines, respectively. Let CorpKRI1, CorpKRI2, CbKRI1, CbKRI2, and CbKRI3 be the KRIs that you use in the count regression model of the commercial banking line, and let CorpKRI1, RbKRI1, and RbKRI2 be the KRIs that you use in the count regression model of the retail banking line. Examples of corporate-level KRIs (CorpKRI1 and CorpKRI2 in this example) are the ratio of temporary to permanent employees and the number of security breaches that are reported during a year. Examples of KRIs that are specific to the commercial banking business (CbKRI1, CbKRI2, and CbKRI3 in this example) are number of credit defaults, proportion of financed assets that are movable, and penalty claims against your bank because of processing delays. Examples of KRIs that are specific to the retail banking business (RbKRI1 and RbKRI2 in this example) are number of credit cards that are reported stolen, fraction of employees who have not undergone fraud detection training, and number of forged drafts and checks that are presented in a year.

Let the severity of each loss event in the commercial banking business be dependent on two KRIs, CorpKRI1 and CbKRI2. Let the severity of each loss event in the retail banking business be dependent on three KRIs, CorpKRI2, RbKRI1, and RbKRI3. Note that for each line of business, the set of KRIs that are used for the severity model is different from the set of KRIs that are used for the count model, although the two sets overlap. Further, the severity model for retail banking includes a new regressor (RbKRI3) that is not used for any of the count models. Such use of different sets of KRIs for count and severity models is typical of real-world applications.

Let the parameter estimates of the negative binomial and Poisson regression models be available in the data sets Work.CountEstEx2NB2 and Work.CountEstEx2Poisson, respectively. The following statements produce them by using the OUTEST= option in the respective PROC COUNTREG statements:

/* Simulate count data */
data cntdataex2(keep=line corpKRI1 corpKRI2
                     cbKRI1 cbKRI2 cbKRI3 rbKRI1 rbKRI2 numloss);
   call streaminit(12345);
   array cx{7} corpKRI1 corpKRI2 cbKRI1 cbKRI2 cbKRI3 rbKRI1 rbKRI2;
   array cbetaR{8} _TEMPORARY_ (0.35 1 0 0 0 0 0.5 0.25);
   array cbetaC{8} _TEMPORARY_ (0.9 0.75 0.3 0.1 0.25 0.5 0 0);

   alpha = 0.3;
   theta = 1/alpha;
   do obs=1 to 5000;
      do i=1 to dim(cx);
         cx(i) = rand('NORMAL');
      end;

      line = 'CommercialBanking';
      xbeta = cbetaC(1);
      do i=1 to dim(cx);
         xbeta = xbeta + cx(i) * cbetaC(i+1);
      end;
      Mu = exp(xbeta);
      p = theta/(Mu+theta);
      numloss = rand('NEGB',p,theta);
      output;

      line = 'RetailBanking';
      xbeta = cbetaR(1);
      do i=1 to dim(cx);
         xbeta = xbeta + cx(i) * cbetaR(i+1);
      end;
      lambda = exp(xbeta);
      numloss = rand('POISSON', lambda);
      output;
   end;
run;

proc sort data=cntdataex2;
   by line;
run;

/* Fit negative binomial (p=2) regression model for each business line */
proc countreg data=cntdataex2 outest=countEstEx2NB2;
   by line;
   model numloss = corpKRI1 corpKRI2 cbKRI1 cbKRI2 cbKRI3
                   rbKRI1 rbKRI2 / dist=negbin;
run;

/* Fit Poisson regression model for each business line */
proc countreg data=cntdataex2 outest=countEstEx2Poisson;
   by line;
   model numloss = corpKRI1 corpKRI2 cbKRI1 cbKRI2 cbKRI3
                   rbKRI1 rbKRI2 / dist=poisson;
run;

Note: PROC CNTSELECT does not support the OUTEST= option, so unlike other examples that use PROC CNTSELECT to estimate count models, the preceding statements use PROC COUNTREG.

Let the parameter estimates of the best-fitting severity models, as determined by PROC SEVSELECT, be available in the data table mycas.SevEstEx2, as prepared by the following statements:

/* Simulate severity data */
data sevdataEx2(keep=line corpKRI1 corpKRI2 cbKRI2 rbKRI1 rbKRI3 lossValue);
   array sx{5} corpKRI1 corpKRI2 cbKRI2 rbKRI1 rbKRI3;
   array sbetaC{6} _TEMPORARY_ (5 1 0 0.3 0 0);
   array sbetaR{6} _TEMPORARY_ (3.5 0 0.5 0 -0.8 0.6);
   if (_n_=1) then call streaminit(67890);

   set cntdataex2(keep=line numloss corpKRI1 corpKRI2 cbKRI2 rbKRI1);
   sigma = 1;
   alpha = 2.5;
   if (numloss > 0) then do;
      sx(5) = rand('NORMAL'); /* simulate rbKRI3 value */

      if (line='CommercialBanking') then do;
         /* lognormal */
         Mu = sbetaC(1);
         do i=1 to dim(sx);
            Mu = Mu + sx(i) * sbetaC(i+1);
         end;
         lossValue = exp(Mu) * rand('LOGNORMAL')**Sigma;
      end;
      else do;
         /* gamma */
         Mu = sbetaR(1);
         do i=1 to dim(sx);
            Mu = Mu + sx(i) * sbetaR(i+1);
         end;
         lossValue = quantile('Gamma', rand('UNIFORM'), Alpha, exp(Mu));
      end;
      output;
   end;
run;

/* Load data into the CAS server */
data mycas.sevdataEx2;
  set sevdataEx2;
run;

/* Fit severity models for each business line */
proc sevselect data=mycas.sevdataEx2 outest=mycas.sevestEx2;
   by line;
   loss lossValue;
   scalemodel corpKRI1 corpKRI2 cbKRI2 rbKRI1 rbKRI3;
   dist logn gamma;
run;

Now, consider that you want to estimate the distribution of the aggregate loss for a scenario, which is represented by a specific set of KRI values. The following DATA step illustrates one such scenario:

/* Generate a scenario data table for a single operating condition */
data singleScenario (keep=corpKRI1 corpKRI2 cbKRI1 cbKRI2 cbKRI3
                          rbKRI1 rbKRI2 rbKRI3);
   array x{8} corpKRI1 corpKRI2 cbKRI1 cbKRI2 cbKRI3 rbKRI1 rbKRI2 rbKRI3;
   call streaminit(5151);
   do i=1 to dim(x);
      x(i) = rand('NORMAL');
   end;
   output;
run;

/* Load data into the CAS server */
data mycas.singleScenario;
  set singleScenario;
run;

The data table mycas.SingleScenario contains all the KRIs that are included in the count and severity models of both business lines. Note that if you standardize or scale the KRIs while fitting the count and severity models, then you must apply the same standardization or scaling method to the values of the KRIs that you specify in the scenario. In this particular example, all KRIs are assumed to be standardized.

The following DATA step uses the scenario in the data table mycas.SingleScenario to simulate 10,000 replications of the number of loss events that you might observe for each business line and writes the simulated counts to the variable NumLoss of the data table mycas.LossCounts1:

/* Simulate multiple replications of the number of loss events that
   you can expect in the scenario being analyzed */
data lossCounts1 (keep=line corpKRI1 corpKRI2 cbKRI2 rbKRI1 rbKRI3 numloss);
   array cxR{3} corpKRI1 rbKRI1 rbKRI2;
   array cbetaR{4} _TEMPORARY_;
   array cxC{5} corpKRI1 corpKRI2 cbKRI1 cbKRI2 cbKRI3;
   array cbetaC{6} _TEMPORARY_;

   retain theta;
   if _n_ = 1 then do;
      call streaminit(5151);
      * read count model estimates *;
      set countEstEx2NB2(where=(line='CommercialBanking' and _type_='PARM'));
      cbetaC(1) = Intercept;
      do i=1 to dim(cxC);
         cbetaC(i+1) = cxC(i);
      end;
      alpha = _Alpha;
      theta = 1/alpha;

      set countEstEx2Poisson(where=(line='RetailBanking' and _type_='PARM'));
      cbetaR(1) = Intercept;
      do i=1 to dim(cxR);
         cbetaR(i+1) = cxR(i);
      end;
   end;

   set singleScenario;
   do iline=1 to 2;
      if (iline=1) then line = 'CommercialBanking';
      else line = 'RetailBanking';
      do repid=1 to 10000;
         * draw from count distribution *;
         if (iline=1) then do;
            xbeta = cbetaC(1);
            do i=1 to dim(cxC);
               xbeta = xbeta + cxC(i) * cbetaC(i+1);
            end;
            Mu = exp(xbeta);
            p = theta/(Mu+theta);
            numloss = rand('NEGB',p,theta);
         end;
         else do;
            xbeta = cbetaR(1);
            do i=1 to dim(cxR);
               xbeta = xbeta + cxR(i) * cbetaR(i+1);
            end;
            numloss = rand('POISSON', exp(xbeta));
         end;
         output;
      end;
   end;
run;

/* Load data into the CAS server */
data mycas.lossCounts1;
  set lossCounts1;
run;

The mycas.LossCounts1 data table contains the variable NumLoss in addition to the KRIs that the severity regression model uses, which PROC CCDM needs in order to simulate the aggregate loss.

Now, you are ready to estimate the aggregate loss distribution for each line of business by submitting the following PROC CCDM step, in which you specify the EXTERNALCOUNTS statement to request that external counts in the variable NumLoss of the DATA= data table be used for simulation of the aggregate loss:

/* Estimate the distribution of the aggregate loss for both
   lines of business by using the externally simulated counts */
proc ccdm data=mycas.lossCounts1 seed=13579 print=all
          severityest=mycas.sevestEx2;
   by line;
   externalcounts count=numloss;
   severitymodel logn gamma;
run;

Each observation in the mycas.LossCounts1 data table represents one replication of the external counts simulation process. For each such replication, the preceding PROC CCDM step makes as many severity draws from the severity distribution as the value of the NumLoss variable, and it adds together the severity values from those draws to compute one sample point of the aggregate loss. The severity distribution that is used for making the severity draws has a scale parameter value that is determined by the KRI values in the given observation and by the regression parameter values that are read from the mycas.SevEstEx2 data table.

The summary statistics and percentiles of the aggregate loss distribution for the commercial banking business, which uses the lognormal severity model, are shown in Output 5.2.1. The "Input Data Summary" table indicates that each of the 10,000 observations in the BY group is treated as one replication and that a total of 19,028 loss events were produced by all the replications together. For the scenario in the data table mycas.SingleScenario, you can expect the commercial banking line to incur an average aggregate loss of 600 units, as shown in the "Sample Summary Statistics" table, and the chance that the loss will exceed 4,428 units is 0.5%, as shown in the "Sample Percentiles" table.

Output 5.2.1: Aggregate Loss Summary for Commercial Banking Line

The CCDM Procedure

line=CommercialBanking

Input Data Summary
Name	LOSSCOUNTS1
Observations	10000
Valid Observations	10000
Replications	10000
Total Count	19028

The CCDM Procedure

Severity Model: Logn

Count Model: External

line=CommercialBanking

Sample Summary Statistics
Mean	599.56756	Median	344.00869
Standard Deviation	786.88364	Interquartile Range	785.04322
Variance	619185.9	Minimum	0
Skewness	2.80654	Maximum	11176.8
Kurtosis	13.58752	Sample Size	10000

line=CommercialBanking

Sample Percentiles
Percentile	Value
1	0
5	0
25	52.25800
50	344.00869
75	837.30121
95	2107.5
99	3637.3
99.5	4428.4
Percentile Method = 5

For the retail banking line, which uses the gamma severity model, the "Sample Percentiles" table in Output 5.2.2 indicates that the median operational loss of that line of business is about 70 units and the chance that the loss will exceed 378 units is about 1%.

Output 5.2.2: Aggregate Loss Percentiles for Retail Banking Line

line=RetailBanking

Sample Percentiles
Percentile	Value
1	0
5	0
25	0
50	70.46309
75	142.15990
95	271.76555
99	378.32158
99.5	437.12308
Percentile Method = 5

When you conduct simulation and estimation for a scenario that contains only one observation, you assume that the operating environment does not change over the period of time that is being analyzed. That assumption might be valid for shorter durations and stable business environments, but often the operating environments change, especially if you are estimating the aggregate loss over a longer period of time. So you might want to include in your scenario all the possible operating environments that you expect to see during the analysis time period. Each environment is characterized by its own set of KRI values. For example, the operating conditions might change from quarter to quarter, and you might want to estimate the aggregate loss distribution for the entire year. You start the estimation process for such scenarios by creating a scenario data table. The following DATA step creates the data table mycas.MultiConditionScenario, which consists of four operating environments, one for each quarter:

/* Generate a scenario data table for multiple operating conditions */
data multiConditionScenario (keep=opEnvId corpKRI1 corpKRI2
      cbKRI1 cbKRI2 cbKRI3 rbKRI1 rbKRI2 rbKRI3);
   array x{8} corpKRI1 corpKRI2 cbKRI1 cbKRI2 cbKRI3 rbKRI1 rbKRI2 rbKRI3;
   call streaminit(5151);
   do opEnvId=1 to 4;
      do i=1 to dim(x);
         x(i) = rand('NORMAL');
      end;
      output;
   end;
run;

/* Load data into the CAS server */
data mycas.multiConditionScenario;
  set multiConditionScenario;
run;

All four observations of the data table mycas.MultiConditionScenario together form one scenario. When you simulate the external counts for such multi-entity scenarios, one replication consists of the possible number of loss events that can occur as a result of all four operating environments. In any given replication, some operating environments might not produce any loss event, or all four operating environments might produce some loss events. The following DATA step creates the data table mycas.LossCounts2, which contains, for each business line, 10,000 replications of the loss counts, and which identifies each replication by using the variable RepId:

/* Simulate multiple replications of the number of loss events that you can
   expect for all operating environments in the scenario being analyzed */
data lossCounts2 (keep=line opEnvId corpKRI1 corpKRI2 cbKRI2
                       rbKRI1 rbKRI3 repid numloss);
   array cxR{3} corpKRI1 rbKRI1 rbKRI2;
   array cbetaR{4} _TEMPORARY_;
   array cxC{5} corpKRI1 corpKRI2 cbKRI1 cbKRI2 cbKRI3;
   array cbetaC{6} _TEMPORARY_;

   /* Read the count model estimates */
   retain theta;
   if _n_ = 1 then do;
      call streaminit(5151);
      set countEstEx2NB2(where=(line='CommercialBanking' and _type_='PARM'));
      cbetaC(1) = Intercept;
      do i=1 to dim(cxC);
         cbetaC(i+1) = cxC(i);
      end;
      alpha = _Alpha;
      theta = 1/alpha;

      set countEstEx2Poisson(where=(line='RetailBanking' and _type_='PARM'));
      cbetaR(1) = Intercept;
      do i=1 to dim(cxR);
         cbetaR(i+1) = cxR(i);
      end;
   end;

   /* Find the number of observations in the scenario data set */
   nobs = 0;
   do while(last=0);
     set multiConditionScenario end=last;
     nobs = nobs+1;
   end;
   nobstotal=nobs;

   do iline=1 to 2;
      if (iline=1) then line = 'CommercialBanking';
      else line = 'RetailBanking';
      do repid=1 to 10000;
         do nobs=1 to nobstotal;
            set multiConditionScenario point=nobs;
            /* Draw from the appropriate count distribution */
            if (line = 'CommercialBanking') then do;
               xbeta = cbetaC(1);
               do i=1 to dim(cxC);
                  xbeta = xbeta + cxC(i) * cbetaC(i+1);
               end;
               Mu = exp(xbeta);
               p = theta/(Mu+theta);
               numloss = rand('NEGB',p,theta);
            end;
            else if (line = 'RetailBanking') then do;
               xbeta = cbetaR(1);
               do i=1 to dim(cxR);
                  xbeta = xbeta + cxR(i) * cbetaR(i+1);
               end;
               numloss = rand('POISSON', exp(xbeta));
            end;
            output;
         end;
      end;
   end;
run;

To use the replication identifier variable with PROC CCDM, you need to partition the input data table of counts. The following DATA step loads the data set Work.LossCounts into a data table in your CAS session that is associated with the mycas CAS engine libref. The PARTITION= data table option partitions the data table such that a group of observations that have the same value for the variables line and RepId are located on the same worker node. The DATA step assumes that your CAS engine libref is named mycas, but you can substitute any appropriately defined CAS engine libref.

/* Load data into the CAS server and partition by the specified variables. */
data mycas.lossCounts2(partition=(line repid));
  set lossCounts2;
run;

Output 5.2.3 shows some observations of the data table mycas.LossCounts2 for each business line. For the first replication (RepId=1) of the commercial banking line, only operating environments 3 and 4 incur loss events; the other environments incur no loss events. For the second replication (RepId=2), all operating environments incur at least one loss event. For the first replication (RepId=1) of the retail banking line, operating environments 2, 3, and 4 incur two, one, and three loss events, respectively.

Output 5.2.3: Snapshot of the External Counts Data with Replication Identifier

line	opEnvId	corpKRI1	corpKRI2	cbKRI2	rbKRI1	rbKRI3	repid	numloss
CommercialBanking	1	0.45224	0.40661	-0.33680	-1.08692	-2.20557	1	0
CommercialBanking	2	-0.03799	0.98670	-0.03752	1.94589	1.22456	1	0
CommercialBanking	3	-0.29120	-0.45239	0.98855	-0.37208	-1.51534	1	3
CommercialBanking	4	0.87499	-0.67812	-0.04839	-1.44881	0.78221	1	1
CommercialBanking	1	0.45224	0.40661	-0.33680	-1.08692	-2.20557	2	2
CommercialBanking	2	-0.03799	0.98670	-0.03752	1.94589	1.22456	2	5
CommercialBanking	3	-0.29120	-0.45239	0.98855	-0.37208	-1.51534	2	12
CommercialBanking	4	0.87499	-0.67812	-0.04839	-1.44881	0.78221	2	12
RetailBanking	1	0.45224	0.40661	-0.33680	-1.08692	-2.20557	1	0
RetailBanking	2	-0.03799	0.98670	-0.03752	1.94589	1.22456	1	2
RetailBanking	3	-0.29120	-0.45239	0.98855	-0.37208	-1.51534	1	1
RetailBanking	4	0.87499	-0.67812	-0.04839	-1.44881	0.78221	1	3
RetailBanking	1	0.45224	0.40661	-0.33680	-1.08692	-2.20557	2	2
RetailBanking	2	-0.03799	0.98670	-0.03752	1.94589	1.22456	2	2
RetailBanking	3	-0.29120	-0.45239	0.98855	-0.37208	-1.51534	2	0
RetailBanking	4	0.87499	-0.67812	-0.04839	-1.44881	0.78221	2	1

You can now use this simulated count data to estimate the distribution of the aggregate loss that is incurred in all four operating environments by submitting the following PROC CCDM step, in which you specify the replication identifier variable RepId in the ID= option of the EXTERNALCOUNTS statement:

/* Estimate the distribution of the aggregate loss for both
   lines of business by using the externally simulated counts
   for the multiple operating environments */
proc ccdm data=mycas.lossCounts2 seed=13579 print=all
          severityest=mycas.sevestEx2;
   by line;
   externalcounts count=numloss id=repid;
   severitymodel logn gamma;
run;

Within each BY group, for each value of the variable RepId, one point of the aggregate loss sample is simulated by using the process that is described in the section Simulation with External Counts.

The summary statistics and percentiles of the distribution of the aggregate loss, which is the aggregate of the losses across all four operating environments, are shown in Output 5.2.4 for the commercial banking line. The "Input Data Summary" table indicates that the BY group contains 10,000 replications and that a total of 145,721 loss events are generated across all replications. The "Sample Percentiles" table indicates that you can expect a median aggregate loss of 4,817 units and a worst-case loss, as defined by the 99.5th percentile, of 17,521 units from the commercial banking line when you combine losses from all four operating environments.

Output 5.2.4: Aggregate Loss Summary for the Commercial Banking Line in Multiple Operating Environments

The CCDM Procedure

line=CommercialBanking

Input Data Summary
Name	LOSSCOUNTS2
Observations	40000
Valid Observations	40000
Replications	10000
Total Count	145721

line=CommercialBanking

Sample Percentiles
Percentile	Value
1	771.75856
5	1507.8
25	3097.6
50	4816.6
75	7000.8
95	11346.5
99	15740.3
99.5	17520.5
Percentile Method = 5

Last updated: June 14, 2018