Determining Late Payments on Loan Application Using Machine Learning— an Amateur Approach

Published in

Analytics Vidhya

11 min readJul 8, 2021

Retrieved from https://consumercreditcardrelief.com/wp-content/uploads/2019/09/debt-relief-program.png

Late payments has always been one of the main risk that a financial company will always face. Usually we have credit analysts to determine whether an application have the tendency to default or not. The purpose of determining which application have the tendency to default is to reject that loan application or perhaps to devise a procedure on handling them, thus minimizing the company’s loss. Thank technology, we’re now able to use machine learning to detect applications that will default automatically. On this opportunity, I would like to demonstrate on how we use machine learning in predicting default loans.

Data

Machine learning needs data. There’s a good analogy on how machine learning would work with data. When we would like to learn on how to do something, say photography, we would look for “patterns” on how beautiful photo would look like. Should a colorful photo be categorized as beautiful or should a symmetrical photo be categorized as beautiful? Machine learning also uses this approach. It needs data of past events to understand the pattern of clients who have the tendency to default.

The data that I’ll be using today is obtained from a recruitment test that I went through before for a position at a financial company. The original source of the data is anonymous but the data itself is quite interesting. I’ll leave a link to the full dataset at the end of the article if you guys are interested in trying out the data yourself. The data consists of 4 different tables and by tables, I mean 4 different .csv files. The 4 different tables are:

We’ll be evaluating the data step by step, starting from the installmentpayment.csv as because it’s in my personal interest to see the behavioral data first. We’ll load the data using the following script:

train = pd.read_csv('app_train.csv')
train.drop('Unnamed: 0',axis = 1,inplace=True)
test = pd.read_csv('app_test.csv')
test.drop('Unnamed: 0',axis = 1,inplace=True)
prev = pd.read_csv('prev_app.csv')
prev.drop('Unnamed: 0',axis = 1,inplace=True)
behavior = pd.read_csv('installment_payment.csv')
behavior.drop('Unnamed: 0',axis = 1,inplace=True)

FIRST STEP (What to do with installmentpayment.csv)

The installmentpayment.csv dataset contains the following variables:

We’ll be dropping LN_ID since we’ll be joining the table with previous loan dataset, not the train dataset directly. The interesting about this table is that we can actually reduce INST_DAYS, PAY_DAYS, AMT_INST, and AMT_PAY into 2 new variables that talks about the same thing. We can create a variable namely PREV_LATENESS that describes the lateness of each previous payment. We can also create another variable that describes the difference between amount prescribed and amount pay. We’ll name this variable as PREV_PAY_DEFICIT as to measure the power of clients to complete the payment.

While we can actually create more variables such as installation number towards clients lateness or maybe client’s prescribed credit due date towards lateness and pay deficit, we’ll just use PREV_LATENESS and PREV_PAY_DEFICIT to describe the clients’ behavior on this opportunity.

behavior.drop('LN_ID',axis= 1,inplace=True)
behavior.describe()
behaviorengineered = behavior[:]
behaviorengineered['PREV_PAY_DEFICIT']=behaviorengineered.AMT_INST - behaviorengineered.AMT_PAY
behaviorengineered['PREV_LATENESS'] = behaviorengineered.INST_DAYS - behaviorengineered.PAY_DAYS
behaviorengineered.drop(['INST_DAYS','PAY_DAYS','AMT_INST','AMT_PAY'],axis = 1, inplace=True)behaviorengineered.fillna(behaviorengineered.median(),inplace=True)

SECOND STEP (Joining behavioral data with previous loan data)

We’ll be adding the behavioral data engineered with the previous loan data to describe previous clients’ loan application data and their behavior. The previous loan application dataset contains showed one row per unique SK_ID_PREV, meaning each row only describe one record of previous application. Meanwhile, the behavior dataset has multiple rows with the same SK_ID_PREV, meaning it shows actions of each SK_ID_PREV that forms the behavior of the respective SK_ID_PREV. Thus INST_NUM is the “n” record of an SK_ID_PREV behavior.

— — — — — — — — — —-

We should group client’s behavior on behavior dataset by SK_ID_PREV before adding them to previous loan application dataset.

— — — — — — — — — —-

prevpaydeficit = behaviorengineered.groupby('SK_ID_PREV')['PREV_PAY_DEFICIT'].agg(lambda x:x.median() if x.notnull().any() else np.nan)
prevlateness = behaviorengineered.groupby('SK_ID_PREV')['PREV_LATENESS'].agg(lambda x:x.median() if x.notnull().any() else np.nan)prev['PREV_PAY_DEFICIT'] = prev['SK_ID_PREV'].apply(lambda x: prevpaydeficit[x] if x in prevpaydeficit.index else np.nan)
prev['PREV_LATENESS'] = prev['SK_ID_PREV'].apply(lambda x:prevlateness[x] if x in prevlateness.index else np.nan)

Number of missing values within previous loan dataset including behavior columns

The number of missing values showed on the picture above showed that we have a lot of missing information regarding previous loan. Also, almost half of our SK_ID_PREV behavior wasn’t described due to the number of missing behavior columns’ values. We won’t be taking care the missing values present now because this data only acts as an additional for the train dataset. We’ll be taking care the missing values shown after we combined all data as a train dataset. Also, we need to encode categorical variables within the previous loan dataset.

from sklearn.preprocessing import LabelEncoder, OrdinalEncoderLEcontract_type = LabelEncoder()
LEweekdays_apply = LabelEncoder()
LEcontract_status = LabelEncoder()
LEyield_group = LabelEncoder()prev['CONTRACT_TYPE'] = LEcontract_type.fit_transform(prev.CONTRACT_TYPE)
prev.WEEKDAYS_APPLY = LEweekdays_apply.fit_transform(prev.WEEKDAYS_APPLY)
prev.CONTRACT_STATUS = LEcontract_status.fit_transform(prev.CONTRACT_STATUS)
prev.YIELD_GROUP = LEyield_group.fit_transform(prev.YIELD_GROUP)
prev.describe()

Reducing variables used

There are 20 variables or columns within the previous loan dataset, including the behavior columns. It is possible to add all 20 variables to the train dataset but it would be a waste of resources. Why? Because there could be variables that are highly correlated with each other, thus can be represented by other variable(s). The most simple way to reduce data is to look at the correlation matrix of all variables, describe the correlated variables, and use one of the two variables correlated. This method is usually avoided because it needs deep understanding regarding the data interpreted but I will be using it for the sake of the data preparation length. We’ll be looking at variables with correlation higher than 0.3 or lower than -0.3.

prev.drop(['SK_ID_PREV'],axis = 1, inplace = True)
corrprevbhv = prev.corr()
corr_triuprevbhv = corrprevbhv.where(~np.tril(np.ones(corrprevbhv.shape)).astype(np.bool))
corr_triuprevbhv = corr_triuprevbhv.stack()
corr_triuprevbhv.name = 'Pearson Correlation Coefficient'
corr_triuprevbhv.index.names = ['First Var', 'Second Var']
corr_triuprevbhv[(corr_triuprevbhv > 0.3)|(corr_triuprevbhv < -0.3)].to_frame()

Correlation of all columns with the respective cutoff

From the correlation result, I’m choosing several variables that in my opinion could represent other variables that it highly correlated with.

prevbhvfinal = prev[['LN_ID','CONTRACT_TYPE','CONTRACT_STATUS','AMT_DOWN_PAYMENT','PRICE','WEEKDAYS_APPLY','HOUR_APPLY','DAYS_DECISION','PREV_PAY_DEFICIT','PREV_LATENESS','TERMINATION']]

Just like the behavior dataset, we would like to check if the LN_ID here is unique per row or not. We check for its value counts and compare it to the number of rows.

This shows that there are LN_IDs that appeared multiple times within the data. We should group the previous loan dataset by LN_ID before adding it to the train and test dataset.

THIRD STEP (Grouping previous loan data and adding them to train-test data)

contract_type = prevbhvfinal.groupby(['LN_ID'])['CONTRACT_TYPE'].agg(lambda x: mode(x)[0][0] if x.notnull().any() else np.nan )
contract_status = prevbhvfinal.groupby(['LN_ID'])['CONTRACT_STATUS'].agg(lambda x: mode(x)[0][0] if x.notnull().any() else np.nan)
amt_down_payment = prevbhvfinal.groupby(['LN_ID'])['AMT_DOWN_PAYMENT'].agg(lambda x: x.median() if x.notnull().any() else np.nan )
price = prevbhvfinal.groupby(['LN_ID'])['PRICE'].agg(lambda x: x.median() if x.notnull().any() else np.nan )
weekdays_apply = prevbhvfinal.groupby(['LN_ID'])['WEEKDAYS_APPLY'].agg(lambda x: mode(x)[0][0] if x.notnull().any() else np.nan )
hour_apply = prevbhvfinal.groupby(['LN_ID'])['HOUR_APPLY'].agg(lambda x: x.median() if x.notnull().any() else np.nan )
days_decision = prevbhvfinal.groupby(['LN_ID'])['DAYS_DECISION'].agg(lambda x: x.median() if x.notnull().any() else np.nan )
prev_pay_deficit = prevbhvfinal.groupby(['LN_ID'])['PREV_PAY_DEFICIT'].agg(lambda x: x.median() if x.notnull().any() else np.nan )
prev_lateness = prevbhvfinal.groupby(['LN_ID'])['PREV_LATENESS'].agg(lambda x: x.median() if x.notnull().any() else np.nan )
termination = prevbhvfinal.groupby(['LN_ID'])['TERMINATION'].agg(lambda x: x.median() if x.notnull().any() else np.nan )

After the grouping for the previous loan dataset is done, we can start adding those variables as additional columns to train-test dataset.

LEincometype = LabelEncoder()
LEeducation = OrdinalEncoder(categories = [['Academic degree','Lower secondary','Secondary / secondary special','Incomplete higher','Higher education']])
LEfamilystatus = LabelEncoder()
LEhousingtypes = LabelEncoder()
LEorganizationtype = LabelEncoder()train['PREV_CONTRACT_TYPE'] = train['LN_ID'].apply(lambda x: contract_type[x] if x in contract_type.index else np.nan)
train['PREV_AMT_DOWN_PAYMENT'] = train['LN_ID'].apply(lambda x: amt_down_payment[x] if x in amt_down_payment.index else np.nan)
train['PREV_PRICE'] = train['LN_ID'].apply(lambda x: price[x] if x in price.index else np.nan)
train['PREV_WEEKDAYS_APPLY'] = train['LN_ID'].apply(lambda x: weekdays_apply[x] if x in weekdays_apply.index else np.nan)
train['PREV_HOUR_APPLY'] = train['LN_ID'].apply(lambda x: hour_apply[x] if x in hour_apply.index else np.nan)
train['PREV_DAYS_DECISION'] = train['LN_ID'].apply(lambda x: days_decision[x] if x in days_decision.index else np.nan)
train['PREV_PAY_DEFICIT'] = train['LN_ID'].apply(lambda x: prev_pay_deficit[x] if x in prev_pay_deficit.index else np.nan)
train['PREV_LATENESS'] = train['LN_ID'].apply(lambda x: prev_lateness[x] if x in prev_lateness.index else np.nan)
train['PREV_TERMINATION'] = train['LN_ID'].apply(lambda x: termination[x] if x in termination.index else np.nan)
train['PREV_CONTRACT_STATUS'] = train['LN_ID'].apply(lambda x:contract_status[x] if x in contract_status.index else np.nan)fortraingenddummy = pd.get_dummies(train.GENDER)
train['GENDER_F'], train['GENDER_M'] = fortraingenddummy['F'],fortraingenddummy['M']
train.drop('GENDER',axis = 1, inplace= True)
train['INCOME_TYPE'] = LEincometype.fit_transform(train['INCOME_TYPE'])
train['EDUCATION'] = LEeducation.fit_transform(train.loc[:,['EDUCATION']])
train['FAMILY_STATUS'] = LEfamilystatus.fit_transform(train['FAMILY_STATUS'])
train['HOUSING_TYPE'] = LEhousingtypes.fit_transform(train['HOUSING_TYPE'])
train['ORGANIZATION_TYPE'] = LEorganizationtype.fit_transform(train['ORGANIZATION_TYPE'])train['CONTRACT_TYPE'] = LEcontract_type.transform(train['CONTRACT_TYPE'])
train['WEEKDAYS_APPLY'] = LEweekdays_apply.transform(train['WEEKDAYS_APPLY'])

The same is done for the test dataset.

FOURTH STEP (Cleaning the train data, training models, and evaluation)

Regarding the missing values we met before, we can start evaluating them in the train dataset. We’ll also import KNNImputer and StandardScaler to scale the data and impute the missing values.

from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScalertrain.isnull().sum()/len(train)
test.isnull().sum()/len(test)

The percentage of missing values is not as big as the percentage of missing values on the previous loan dataset. This indicates that most of the SK_ID_PREV with missing values are not the SK_ID_PREV of LN_ID used in the train data. EXT_SCORE_1 contains 50% missing values which is actually concerning. We’re dropping this column because of the large number of missing values it contained, making it hard to be included in the model.

train.drop('EXT_SCORE_1',axis = 1, inplace = True)
test.drop('EXT_SCORE_1',axis = 1, inplace=True)

We would also need to check for class imbalances before feeding the train data to the model. We can check it simply by using value_counts() for the TARGET variable.

train.TARGET.value_counts()

We can handle the class imbalances later after we scale and impute the data.

scale = StandardScaler()
impute=KNNImputer()xtrain = train.drop('TARGET',axis = 1, inplace = False)
ytrain = train['TARGET']xtest = test.drop('TARGET',axis = 1, inplace = False)
ytest = test['TARGET']xtrainscaled = pd.DataFrame(scale.fit_transform(xtrain),columns = xtrain.columns)
xtrainscaledimpute = pd.DataFrame(impute.fit_transform(xtrainscaled),columns = xtrainscaled.columns)xtestscaled = pd.DataFrame(scale.transform(xtest),columns=xtest.columns)xtestscaledimpute = pd.DataFrame(impute.fit_transform(xtestscaled),columns = xtestscaled.columns)

We need to check for correlations within the train data before handling class imbalances.

corr = train.corr()
corr_triu = corr.where(~np.tril(np.ones(corr.shape)).astype(np.bool))
corr_triu = corr_triu.stack()
corr_triu.name = 'Pearson Correlation Coefficient'
corr_triu.index.names = ['First Var', 'Second Var']
corr_triu[(corr_triu > 0.3) | (corr_triu < -0.3)].to_frame()

we can drop PRICE and DAYS_WORK then continue on handling class imbalances. We’ll be using SMOTE to handle class imbalances.

xtrainscaledimpute.drop(['PRICE','DAYS_WORK'],axis = 1 , inplace = True)
xtestscaledimpute.drop(['PRICE','DAYS_WORK'], axis = 1, inplace = True)from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state = 42)
xtrainfinal, ytrainfinal = sm.fit_resample(xtrainscaledimpute, ytrain)

Modeling

I will be using Logistic Regression, Random Forest, and Perceptron on this opportunity. We’ll be displaying the result of each model tuned a little bit with GridSearchCV.

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Perceptron, LogisticRegression
from sklearn.metrics import classification_report,accuracy_score,roc_auc_score,f1_score
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV###logreg
logreg = GridSearchCV(LogisticRegression(max_iter=300),dict(solver = ['newton-cg', 'lbfgs', 'sag', 'saga']),
                      scoring='roc_auc')
logreg.fit(xtrainfinal,ytrainfinal)
logreg.best_estimator_
logregmodel = LogisticRegression(max_iter=300, solver='sag')
logregmodel.fit(xtrainfinal,ytrainfinal)
ypredlogreg = (logregmodel.predict_proba(xtestscaledimpute)[:,1]>=0.5).astype(int)
rocauclogreg = round(roc_auc_score(ytest,ypredlogreg),3)
classreportlogreg = classification_report(ytest,ypredlogreg)
print('Logistic Regression Classification Report\n'+classreportlogreg+"\nROC AUC Score: "+str(rocauclogreg)+"\nF1-Score: "+str(f1_score(ytest,ypredlogreg)))
acclogreg = round(accuracy_score(ytest,ypredlogreg),3)###Perceptron
Perceptron = GridSearchCV(Perceptron(random_state = 42),dict(penalty=['l2','l1','elasticnet','None'],
                                                             class_weight = ['balanced','None']))
Perceptron.fit(xtrainfinal,ytrainfinal)
ypredPercept = Perceptron.predict(xtestscaledimpute)
rocaucPercept = round(roc_auc_score(ytest,ypredPercept),3)
classreportPercept = classification_report(ytest,ypredPercept)
print('Perceptron Classification Report\n'+classreportPercept+"\nROC AUC Score: "+str(rocaucPercept)+"\nF1-Score: "+str(f1_score(ytest,ypredPercept)))
accPercept= round(accuracy_score(ytest,ypredPercept),3)###RandomForestClassifier
RF = RandomizedSearchCV(RandomForestClassifier(random_state=42),dict(n_estimators=[100,150,200],
                                                criterion = ['gini','entropy'],
                                                max_features = ['sqrt','log2']),
                        random_state=42,scoring='roc_auc')
RF.fit(xtrainfinal,ytrainfinal)
RF.best_params_
RFmodel = RandomForestClassifier(n_estimators= 200, max_features= 'log2', criterion= 'entropy',random_state=42)
RFmodel.fit(xtrainfinal,ytrainfinal)
ypredrf = (RFmodel.predict_proba(xtestscaledimpute)[:,1]>=0.5).astype(int)
rocaucrf = round(roc_auc_score(ytest,ypredrf),3)
classreportrf = classification_report(ytest,ypredrf)
print('Random Forest Classification Report \n'+classreportrf+"\nROC AUC Score: "+str(rocaucrf)+"\nF1-Score: "+str(f1_score(ytest,ypredrf)))
accrf = round(accuracy_score(ytest,ypredrf),3)

Models trained using GridSearchCV (Probability cutoff not changed)

We can see that among all the models, Perceptron and Logistic Regression seems to be the best alternatives. We measure the performance of a model by paying attention to F1-Score, ROC AUC score, and accuracy score.

Nb: In this project, I choose to measure the model performance using F1-Score, ROC AUC and Accuracy because the goal is to maximize the prediction made for both classes(Clients who tend to be late on payments and clients who tend to pay in time). If the goal is to reduce the number of False Negative (Client is late but we didn’t detect it), we should use Recall to measure model performance.

Although Perceptron and Logistic Regression might be the best option, we haven’t try to change the cutoff for predicted probabilities of Logistic Regression and Random Forest. We can see the new performance of Random Forest and Logistic Regression after we change the cutoff for their predicted probabilities.

###logreg
logreg = GridSearchCV(LogisticRegression(max_iter=300),dict(solver = ['newton-cg', 'lbfgs', 'sag', 'saga']), scoring='roc_auc')
logreg.fit(xtrainfinal,ytrainfinal)
logreg.best_estimator_
logregmodel = LogisticRegression(max_iter=300, solver='sag')
logregmodel.fit(xtrainfinal,ytrainfinal)
ypredlogreg = (logregmodel.predict_proba(xtestscaledimpute)[:,1]>=0.62).astype(int)
rocauclogreg = round(roc_auc_score(ytest,ypredlogreg),3)
classreportlogreg = classification_report(ytest,ypredlogreg)
print('Logistic Regression Classification Report\n'+classreportlogreg+"\nROC AUC Score: "+str(rocauclogreg)+"\nF1-Score: "+str(f1_score(ytest,ypredlogreg)))
acclogreg = round(accuracy_score(ytest,ypredlogreg),3)###RandomForestClassifier
RF = RandomizedSearchCV(RandomForestClassifier(random_state=42),dict(n_estimators=[100,150,200], criterion = ['gini','entropy'], max_features = ['sqrt','log2']),random_state=42,scoring='roc_auc')
RF.fit(xtrainfinal,ytrainfinal)
RF.best_params_
RFmodel = RandomForestClassifier(n_estimators= 200, max_features= 'log2', criterion= 'entropy',random_state=42)
RFmodel.fit(xtrainfinal,ytrainfinal)
ypredrf = (RFmodel.predict_proba(xtestscaledimpute)[:,1]>=0.318).astype(int)
rocaucrf = round(roc_auc_score(ytest,ypredrf),3)
classreportrf = classification_report(ytest,ypredrf)
print('Random Forest Classification Report \n'+classreportrf+"\nROC AUC Score: "+str(rocaucrf)+"\nF1-Score: "+str(f1_score(ytest,ypredrf)))
accrf = round(accuracy_score(ytest,ypredrf),3)

New performances of Logistic Regression and Random Forest after changing cutoff

Logistic Regression and Random Forest perform way better than the Logistic Regression and Random Forest without changing the predicted probabilities cutoff. The accuracy for Random Forest dropped from 90% to 80% by changing the cutoff but its ROC AUC score and F1-Score increased significantly. We pay more attention to ROC AUC score and F1-Score because both metrics describes how well the model can predict each class (in this case, default loan or normal loan).

Conclusion

The model itself isn’t perfect. ROC AUC score of <0.7 is still considered poor. But the point of this writing is to described one of the approaches we can take to build a model that determines late payments of loan applications.

Things I could’ve tried to do but didn’t

As I’ve mentioned before, this is only one of the many approaches available out there towards this data. Few things that I would like to try outside of this post is:

We could try to get more creative with the behavioral data. We could create more variables other than lateness and payment deficit.
We could try to use clustering previous loan or behavior data. This could give us insights regarding unseen cluster that may behave in a certain way, thus can be used to summarize previous loan or behavior data whilst effectively minimizing the number of variables to add to the model.
We could try PCA to reduce the variables given. This surely is a better method than just looking at the correlation matrix of all variables.

Loan Application Dataset - Google Drive

Links to the dataset used and columns description

drive.google.com

Thank you for reading!

Determining Late Payments on Loan Application Using Machine Learning— an Amateur Approach

Data

Conclusion

Loan Application Dataset - Google Drive

Links to the dataset used and columns description

Written by Sean Yonathan T