Bank Load Status Prediction

7 min readJun 6, 2021

Problem Statement

This dataset includes details of applicants who have applied for loan. The dataset includes details like credit history, loan amount, their income, dependents etc.

There are six distinct phases of the mortgage loan process: pre-approval, house shopping; mortgage application; loan processing; underwriting and closing. Here’s what you need to know about each step.

Types of Loan is there.

Personal Loans: Most banks offer personal loans to their customers and the money can be used for any expense like paying a bill or purchasing a new television. …
Credit Card Loans: …
Home Loans: …
Car Loans: …
Two-Wheeler Loans: …
Small Business Loans: …
Payday Loans: …
Cash Advances:

Using machine learning algorithm we are going to train a model to predict the loan can be approve or not.

To predict loan status lot analysis and EDA process done.

Below are the future given to predict the train a model.

Loan_ID- This is unique id for all applicant
Gender — Categorical variable generally Male /Female
Married — Categorical variable Married /Single
Dependents — Yes/No
Education — Graduate /School
Self_Employed — Yes/No
ApplicantIncome — continues variable
CoapplicantIncome- continues variable
Loan_Amount — Value of the load
Loan_Amount_Term — No of days return
Credit History — Yes /No
Property_Area — Category

Importing the required library

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import zscore,boxcoxfrom sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_scorefrom sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifierimport warnings
warnings.filterwarnings('ignore')

Reading the dataset

df =pd.read_csv('loan_prediction.csv')
df.head()

In dataset Load_ID column not much important for class variable.
Gender,Married,Education,Self-Employed,Property_Area,Loan_Status data is categorical so we need encode this value.

Data Analysis and EDA Concluding Remark.

Checking the Null values percentage in complete dataset

#List comprehension to grab null value
feature_with_na = [feature for feature in df.columns if df[feature].isnull().sum() > 0]for i in feature_with_na:
    print(i,np.round(df[i].isnull().mean(),4),"% missing value")Gender 0.0212 % missing value
Married 0.0049 % missing value
Dependents 0.0244 % missing value
Self_Employed 0.0521 % missing value
LoanAmount 0.0358 % missing value
Loan_Amount_Term 0.0228 % missing value
Credit_History 0.0814 % missing value

Size of the dataset

df.shape
(614, 13)
They are 614 row and 13 column in the dataset

Checking unique value in dataset

df.nunique()
Loan_ID              614
Gender                 2
Married                2
Dependents             4
Education              2
Self_Employed          2
ApplicantIncome      505
CoapplicantIncome    287
LoanAmount           203
Loan_Amount_Term      10
Credit_History         2
Property_Area          3
Loan_Status            2
dtype: int64
Observation : Categorical Variable - 9 Continues Variable - 4

Complete summary information of the dataset

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB

Null values

df.isnull().sum()
Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64Observation:

Gender,Married,Dependents,Self_Employed,LoanAmount,Loan_Amount_Term,Credit_History having null values in the data we need to handle.

Pre-Processing Pipeline.

Handling the null value in the dataset

df[‘Gender’] = encoder.fit_transform(df[‘Gender’])
df[‘Married’] = encoder.fit_transform(df[‘Married’])
df[‘Dependents’] = encoder.fit_transform(df[‘Dependents’])
df[‘Education’] = encoder.fit_transform(df[‘Education’])
df[‘Self_Employed’] = encoder.fit_transform(df[‘Self_Employed’])
df[‘Property_Area’] = encoder.fit_transform(df[‘Property_Area’])
df[‘Loan_Status’] = encoder.fit_transform(df[‘Loan_Status’])

df.drop(‘Loan_ID’,axis=1,inplace=True)

Removing the outlier from the dataset

z = np.abs(zscore(df))df_new =df[(z<3).all(axis=1)]df_new.shape # data loss is less only we can remove outlier
(548, 12)df.shape
(614, 12)

Checking correlation of the dataset

Observation : 1. Credit history positively correlated. 2. Some column negatively correlated.

Building Machine Learning Models

Splitting X and Y Dataset

x = df_new.iloc[:,0:-1]
y = df_new.iloc[:,-1]

x.shape
(548, 11)y.shape
(548,)

Finding best random state

the random_state parameter is used for initializing the internal random number generator, which will decide the splitting of data into train and test indices in your case. If random_state is None or np. random, then a randomly-initialized RandomState object is returned.

maxScore = 0
maxRS = 0for i in range(1,100):
    x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.23,random_state=i)
    lr = LogisticRegression()
    lr.fit(x_train,y_train)
    prec= lr.predict(x_test)
    acc =accuracy_score(y_test,prec)
    
    if acc > maxScore:
        maxScore = acc
        maxRS = iprint(f"Best random state :{maxRS} and accuracy_score:{maxScore}")

Best random state :29 and accuracy_score:0.889763779527559

Standard Scalar

scale = StandardScaler()
x =scale.fit_transform(x)

Splitting the training and testing data

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.23,random_state=29)# Size of the training and testing dataset
print("X_train",x_train.shape)
print("y_train",y_train.shape)
print("x_test",x_test.shape)
print("y_test",y_test.shape)X_train (421, 11)
y_train (421,)
x_test (127, 11)
y_test (127,)

Written function to check metrics in of model.

def metrics_model(y_test,prec):
    print("Accuracy score is",accuracy_score(y_test,prec))
    print("."*80)
    print("Confusion matrix value is:\n",confusion_matrix(y_test,prec))
    print("."*80)
    print(classification_report(y_test,prec))

Logistic Regression

Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of binary regression).

Code Below :
# logistic Regressionlr = LogisticRegression()
lr.fit(x_train,y_train)
prec= lr.predict(x_test)
metrics_model(y_test,prec)Accuracy score is 0.889763779527559
................................................................................
Confusion matrix value is:
 [[22 13]
 [ 1 91]]
................................................................................
              precision    recall  f1-score   support

           0       0.96      0.63      0.76        35
           1       0.88      0.99      0.93        92

    accuracy                           0.89       127
   macro avg       0.92      0.81      0.84       127
weighted avg       0.90      0.89      0.88       127

Cross validation function

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation

# cross Validation
cross = cross_val_score(LogisticRegression(),x,y,cv=5)
cross.mean()0.8212844036697249

All other classification model checking now

Creating list of classification model instance

dtc = DecisionTreeClassifier()
Kn  = KNeighborsClassifier(n_neighbors=5)
sv  = SVC()# model instance list created
model = [dtc,Kn,sv]for m in model:
    print(type(m))
    print("")
    m.fit(x_train,y_train)
    prec=m.predict(x_test)
    metrics_model(y_test,prec)<class 'sklearn.tree._classes.DecisionTreeClassifier'>

Accuracy score is 0.7244094488188977
................................................................................
Confusion matrix value is:
 [[25 10]
 [25 67]]
................................................................................
              precision    recall  f1-score   support

           0       0.50      0.71      0.59        35
           1       0.87      0.73      0.79        92

    accuracy                           0.72       127
   macro avg       0.69      0.72      0.69       127
weighted avg       0.77      0.72      0.74       127

<class 'sklearn.neighbors._classification.KNeighborsClassifier'>

Accuracy score is 0.889763779527559
................................................................................
Confusion matrix value is:
 [[22 13]
 [ 1 91]]
................................................................................
              precision    recall  f1-score   support

           0       0.96      0.63      0.76        35
           1       0.88      0.99      0.93        92

    accuracy                           0.89       127
   macro avg       0.92      0.81      0.84       127
weighted avg       0.90      0.89      0.88       127

<class 'sklearn.svm._classes.SVC'>

Accuracy score is 0.8818897637795275
................................................................................
Confusion matrix value is:
 [[21 14]
 [ 1 91]]
................................................................................
              precision    recall  f1-score   support

           0       0.95      0.60      0.74        35
           1       0.87      0.99      0.92        92

    accuracy                           0.88       127
   macro avg       0.91      0.79      0.83       127
weighted avg       0.89      0.88      0.87       127

All model now we cross validatating

# DecisionTreeClassifier
cross = cross_val_score(DecisionTreeClassifier(),x,y,cv=5)
cross.mean()0.7319099249374478# KNeighborsClassifier
cross = cross_val_score(KNeighborsClassifier(),x,y,cv=5)
cross.mean()
0.8030358632193495# SVC
cross = cross_val_score(SVC(),x,y,cv=5)
cross.mean()
0.8175979983319432

Observation : As per the above model and cross validation score comparation. Decisiontree algorithem is difference score very less.So we are Choosing Decision tree clasifier.

Ensemble methods

# Bagging
from sklearn.ensemble import RandomForestClassifierrf =RandomForestClassifier(n_estimators=100,random_state=42)
rf.fit(x_train,y_train)
prec =rf.predict(x_test)
metrics_model(y_test,prec)Accuracy score is 0.8661417322834646
................................................................................
Confusion matrix value is:
 [[22 13]
 [ 4 88]]
................................................................................
              precision    recall  f1-score   support

           0       0.85      0.63      0.72        35
           1       0.87      0.96      0.91        92

    accuracy                           0.87       127
   macro avg       0.86      0.79      0.82       127
weighted avg       0.86      0.87      0.86       127# Randombooster
cross = cross_val_score(RandomForestClassifier(),x,y,cv=5)
cross.mean()# Boosting algarithem
from sklearn.ensemble import AdaBoostClassifieradb =AdaBoostClassifier()
adb.fit(x_train,y_train)
prec=adb.predict(x_test)
metrics_model(y_test,prec)Accuracy score is 0.84251968503937
................................................................................
Confusion matrix value is:
 [[21 14]
 [ 6 86]]
................................................................................
              precision    recall  f1-score   support

           0       0.78      0.60      0.68        35
           1       0.86      0.93      0.90        92

    accuracy                           0.84       127
   macro avg       0.82      0.77      0.79       127
weighted avg       0.84      0.84      0.84       127
# Adabooster
cross = cross_val_score(AdaBoostClassifier(),x,y,cv=5)
cross.mean()0.7903085904920767

HyperParameterTuning

from sklearn.model_selection import GridSearchCV
param = {'criterion' :['gini', 'entropy'],
        'splitter' : ['best', 'random']
        }gr = GridSearchCV(DecisionTreeClassifier(),param,cv=5)
gr.fit(x_train,y_train)
gr.best_params_dct = DecisionTreeClassifier(criterion='entropy',splitter='best')
dct.fit(x_train,y_train)
prec = dct.predict(x_test)
metrics_model(y_test,prec)Accuracy score is 0.7480314960629921
................................................................................
Confusion matrix value is:
 [[25 10]
 [22 70]]
................................................................................
              precision    recall  f1-score   support

           0       0.53      0.71      0.61        35
           1       0.88      0.76      0.81        92

    accuracy                           0.75       127
   macro avg       0.70      0.74      0.71       127
weighted avg       0.78      0.75      0.76       127

Concluding Remarks :

TheKNN model working good comparatively all other model.

So we are choosing KNN algorithm for this dataset.

Metrics : AUC ROC Curve

# lib
from sklearn.metrics import roc_curve,roc_auc_score,plot_roc_curve
Y_pred_pb =Kn.predict_proba(x_test)[:,1]roc_auc_score(y_test,Y_pred_pb)
0.8523291925465839plot_roc_curve(Kn,x_test,y_test)

Saving the model

import joblib
joblib.dump(dct,"Loan_Application.obj")#loding model to filejob=joblib.load("Loan_Application.obj")
jobprec = job.predict(x_test)np.unique(prec)