/* This program illustrates how to put together a dataset for credit scoring if you have information in the form of snapshots in time. Generally, in credit scoring an observation point is selected where you take a sample of all good performing accounts and append attribute information such as generic credit scores, balances, etc. Then you select the performance window - the length of time you want to use to examine payment performance. If an account does not meet this performance criteria over, say 12 months, then the account is considered "BAD" and a value of "1" is assigned, otherwise a "0". This "BAD" variable is considered the dependent variable in a logistic or probit regression model in order to develop your probability of going bad - i.e. your new credit score; The example here uses only a 2 month performance window for illustrative purposes. Note: Here, June_2002 is the observation point. The remaining files we use to observe performance; The performance criteria is passed as an arguement in the macro - the bad definition is 90+ days deliquent, where DAYS = Days deliquent. */ OPTIONS NOTES SOURCE SOURCE2 MPRINT MLOGIC MERROR SYMBOLGEN; *Put the names of the files you need to use to determine performance; data files; input names $32.; cards; june_2002 july_2002 august_2002 ; run; * Observation Point - Get attribute data and determine initial performance status; * Note - some accounts are currently bad and we need to eliminate them from our; * observation data which should only have accounts in good status; data June_2002; input Account ATTRIBUTE_1 ATTRIBUTE_2 DAYS; cards; 101 680 193000 94 102 555 183000 115 103 800 173000 1 104 645 163000 4 105 790 153000 0 106 766 143000 33 107 666 133000 4 108 784 123000 54 109 766 123000 12 110 788 223000 96 111 800 133000 0 112 645 143000 0 113 790 153000 1 114 766 163000 6 115 800 173000 13 116 645 183000 15 117 790 193000 15 118 766 153000 59 119 666 143000 17 120 784 123000 88 ; run; * Performance Period 1. Note how some accounts have aged in terms of DAYS delinquent; data July_2002; input Account ATTRIBUTE_1 ATTRIBUTE_2 DAYS; cards; 101 680 143000 124 102 555 153000 145 103 800 153000 3 104 645 163000 4 105 790 173000 6 106 766 163000 63 107 666 163000 4 108 784 123000 84 109 766 123000 42 110 788 123000 126 111 800 123000 3 112 645 123000 0 113 790 153000 4 114 766 163000 36 115 800 173000 43 116 645 183000 6 117 790 133000 8 118 766 143000 89 119 666 153000 15 120 784 123000 118 ; run; * Performance Period 2. Note how some accounts have aged even further in terms of DAYS delinquent; data August_2002; input Account ATTRIBUTE_1 ATTRIBUTE_2 DAYS; cards; 101 680 123000 154 102 555 163000 175 103 800 163000 4 104 645 143000 5 105 790 143000 16 106 766 163000 93 107 666 123000 14 108 784 123000 114 109 766 183000 72 110 788 123000 156 111 800 183000 4 112 645 123000 1 113 790 173000 6 114 766 193000 66 115 800 193000 73 116 645 193000 3 117 790 193000 9 118 766 183000 119 119 666 1723000 4 120 784 183000 148 ; run; title June 2002 (Observation Data); proc print data=june_2002;run; title July 2002 (Performance month 1);run; proc print data=july_2002;run; title August 2002 (Performance month 2); proc print data=august_2002;run; %macro MODEL_DATA(days_del); OPTIONS NOTES SOURCE SOURCE2 MPRINT MLOGIC MERROR SYMBOLGEN; *Getting the number of files to cycle through; data _null_; set files end=last; if last then call symput ('max',_n_); run; %do i=1 %to &max; data _null_; set files end=last; if _n_=&i then call symput ('myfile',names); if last then call symput ('max',_n_); run; *Assuming first file in your list is your Observation Point; *Start off with only GOOD accounts; %if &i=1 %then %do; data original_good (keep=account attribute_1 attribute_2 Days original); set june_2002; *Set criteria for Good performance (<=90 days delinquent); if days>&days_del then delete; original=1; run; data all; set original_good; run; proc sort data=all;by account;run; %end; *Now we loop through all the performance files...; %if &i>1 %then %do; data file_&i (keep=account bad_&i ); set &myfile; bad_&i=0; if days>&days_del then bad_&i=1; run; proc sort data=file_&i;by account;run; *Now putting all files together; data all; retain bad; merge all file_&i; by account; if original^=1 then delete; run; %end; %end; * The following counts an account as BAD if it goes bad during the performance window, even if it; * were to pay up and get current sometime thereafter. The variable DEPV_BAD would be the dependent variable; * in your logistic or probit scoring model.; data MODELING_DATA(drop=original); set all; count=0; %do i=2 %to &max; if bad_&i=1 then COUNT+1; %end; DEPV_BAD=0; if Count>0 then DEPV_BAD=1; run; Title Modeling Data; footnote Note: attribute data for modeling is from Observation Point, not performance periods.; proc print data=modeling_data;run; %mend MODEL_DATA; * Now we execute the macro with 90+ days as our bad definition; * The table MODELING_DATA can be used with logistic regression, for instance,; * except BAD_2, BAD_3, and COUNT wouldn't be ordinarily used for estimation.; * They are shown for illustrative purposes; %MODEL_DATA(90);