Bank Load Status Prediction

Sivakumar V
7 min readJun 6, 2021
Loan is approved or Not

Problem Statement

This dataset includes details of applicants who have applied for loan. The dataset includes details like credit history, loan amount, their income, dependents etc.

There are six distinct phases of the mortgage loan process: pre-approval, house shopping; mortgage application; loan processing; underwriting and closing. Here’s what you need to know about each step.

Types of Loan is there.

  • Personal Loans: Most banks offer personal loans to their customers and the money can be used for any expense like paying a bill or purchasing a new television. …
  • Credit Card Loans: …
  • Home Loans: …
  • Car Loans: …
  • Two-Wheeler Loans: …
  • Small Business Loans: …
  • Payday Loans: …
  • Cash Advances:

Using machine learning algorithm we are going to train a model to predict the loan can be approve or not.

To predict loan status lot analysis and EDA process done.

Below are the future given to predict the train a model.

  • Loan_ID- This is unique id for all applicant
  • Gender — Categorical variable generally Male /Female
  • Married — Categorical variable Married /Single
  • Dependents — Yes/No
  • Education — Graduate /School
  • Self_Employed — Yes/No
  • ApplicantIncome — continues variable
  • CoapplicantIncome- continues variable
  • Loan_Amount — Value of the load
  • Loan_Amount_Term — No of days return
  • Credit History — Yes /No
  • Property_Area — Category

Importing the required library

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import zscore,boxcox
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
import warnings
warnings.filterwarnings('ignore')

Reading the dataset

df =pd.read_csv('loan_prediction.csv')
df.head()
  • In dataset Load_ID column not much important for class variable.
  • Gender,Married,Education,Self-Employed,Property_Area,Loan_Status data is categorical so we need encode this value.

Data Analysis and EDA Concluding Remark.

Checking the Null values percentage in complete dataset

#List comprehension to grab null value
feature_with_na = [feature for feature in df.columns if df[feature].isnull().sum() > 0]
for i in feature_with_na:
print(i,np.round(df[i].isnull().mean(),4),"% missing value")
Gender 0.0212 % missing value
Married 0.0049 % missing value
Dependents 0.0244 % missing value
Self_Employed 0.0521 % missing value
LoanAmount 0.0358 % missing value
Loan_Amount_Term 0.0228 % missing value
Credit_History 0.0814 % missing value

Size of the dataset

df.shape
(614, 13)
They are 614 row and 13 column in the dataset

Checking unique value in dataset

df.nunique()
Loan_ID 614
Gender 2
Married 2
Dependents 4
Education 2
Self_Employed 2
ApplicantIncome 505
CoapplicantIncome 287
LoanAmount 203
Loan_Amount_Term 10
Credit_History 2
Property_Area 3
Loan_Status 2
dtype: int64
Observation : Categorical Variable - 9 Continues Variable - 4

Complete summary information of the dataset

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Loan_ID 614 non-null object
1 Gender 601 non-null object
2 Married 611 non-null object
3 Dependents 599 non-null object
4 Education 614 non-null object
5 Self_Employed 582 non-null object
6 ApplicantIncome 614 non-null int64
7 CoapplicantIncome 614 non-null float64
8 LoanAmount 592 non-null float64
9 Loan_Amount_Term 600 non-null float64
10 Credit_History 564 non-null float64
11 Property_Area 614 non-null object
12 Loan_Status 614 non-null object
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB

Null values

df.isnull().sum()
Loan_ID 0
Gender 13
Married 3
Dependents 15
Education 0
Self_Employed 32
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 22
Loan_Amount_Term 14
Credit_History 50
Property_Area 0
Loan_Status 0
dtype: int64
Observation:
  • Gender,Married,Dependents,Self_Employed,LoanAmount,Loan_Amount_Term,Credit_History having null values in the data we need to handle.

Pre-Processing Pipeline.

Handling the null value in the dataset

df[‘Gender’] = encoder.fit_transform(df[‘Gender’])
df[‘Married’] = encoder.fit_transform(df[‘Married’])
df[‘Dependents’] = encoder.fit_transform(df[‘Dependents’])
df[‘Education’] = encoder.fit_transform(df[‘Education’])
df[‘Self_Employed’] = encoder.fit_transform(df[‘Self_Employed’])
df[‘Property_Area’] = encoder.fit_transform(df[‘Property_Area’])
df[‘Loan_Status’] = encoder.fit_transform(df[‘Loan_Status’])

df.drop(‘Loan_ID’,axis=1,inplace=True)

Removing the outlier from the dataset

z = np.abs(zscore(df))df_new =df[(z<3).all(axis=1)]df_new.shape # data loss is less only we can remove outlier
(548, 12)
df.shape
(614, 12)

Checking correlation of the dataset

Observation : 1. Credit history positively correlated. 2. Some column negatively correlated.

Building Machine Learning Models

Splitting X and Y Dataset

x = df_new.iloc[:,0:-1]
y = df_new.iloc[:,-1]

x.shape
(548, 11)
y.shape
(548,)

Finding best random state

the random_state parameter is used for initializing the internal random number generator, which will decide the splitting of data into train and test indices in your case. If random_state is None or np. random, then a randomly-initialized RandomState object is returned.

maxScore = 0
maxRS = 0
for i in range(1,100):
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.23,random_state=i)
lr = LogisticRegression()
lr.fit(x_train,y_train)
prec= lr.predict(x_test)
acc =accuracy_score(y_test,prec)

if acc > maxScore:
maxScore = acc
maxRS = i
print(f"Best random state :{maxRS} and accuracy_score:{maxScore}")

Best random state :29 and accuracy_score:0.889763779527559

Standard Scalar

scale = StandardScaler()
x =scale.fit_transform(x)

Splitting the training and testing data

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.23,random_state=29)# Size of the training and testing dataset
print("X_train",x_train.shape)
print("y_train",y_train.shape)
print("x_test",x_test.shape)
print("y_test",y_test.shape)
X_train (421, 11)
y_train (421,)
x_test (127, 11)
y_test (127,)

Written function to check metrics in of model.

def metrics_model(y_test,prec):
print("Accuracy score is",accuracy_score(y_test,prec))
print("."*80)
print("Confusion matrix value is:\n",confusion_matrix(y_test,prec))
print("."*80)
print(classification_report(y_test,prec))

Logistic Regression

Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of binary regression).

Code Below :
# logistic Regression
lr = LogisticRegression()
lr.fit(x_train,y_train)
prec= lr.predict(x_test)
metrics_model(y_test,prec)
Accuracy score is 0.889763779527559
................................................................................
Confusion matrix value is:
[[22 13]
[ 1 91]]
................................................................................
precision recall f1-score support

0 0.96 0.63 0.76 35
1 0.88 0.99 0.93 92

accuracy 0.89 127
macro avg 0.92 0.81 0.84 127
weighted avg 0.90 0.89 0.88 127

Cross validation function

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation

# cross Validation
cross = cross_val_score(LogisticRegression(),x,y,cv=5)
cross.mean()
0.8212844036697249

All other classification model checking now

Creating list of classification model instance

dtc = DecisionTreeClassifier()
Kn = KNeighborsClassifier(n_neighbors=5)
sv = SVC()
# model instance list created
model = [dtc,Kn,sv]
for m in model:
print(type(m))
print("")
m.fit(x_train,y_train)
prec=m.predict(x_test)
metrics_model(y_test,prec)
<class 'sklearn.tree._classes.DecisionTreeClassifier'>

Accuracy score is 0.7244094488188977
................................................................................
Confusion matrix value is:
[[25 10]
[25 67]]
................................................................................
precision recall f1-score support

0 0.50 0.71 0.59 35
1 0.87 0.73 0.79 92

accuracy 0.72 127
macro avg 0.69 0.72 0.69 127
weighted avg 0.77 0.72 0.74 127

<class 'sklearn.neighbors._classification.KNeighborsClassifier'>

Accuracy score is 0.889763779527559
................................................................................
Confusion matrix value is:
[[22 13]
[ 1 91]]
................................................................................
precision recall f1-score support

0 0.96 0.63 0.76 35
1 0.88 0.99 0.93 92

accuracy 0.89 127
macro avg 0.92 0.81 0.84 127
weighted avg 0.90 0.89 0.88 127

<class 'sklearn.svm._classes.SVC'>

Accuracy score is 0.8818897637795275
................................................................................
Confusion matrix value is:
[[21 14]
[ 1 91]]
................................................................................
precision recall f1-score support

0 0.95 0.60 0.74 35
1 0.87 0.99 0.92 92

accuracy 0.88 127
macro avg 0.91 0.79 0.83 127
weighted avg 0.89 0.88 0.87 127

All model now we cross validatating

# DecisionTreeClassifier
cross = cross_val_score(DecisionTreeClassifier(),x,y,cv=5)
cross.mean()
0.7319099249374478# KNeighborsClassifier
cross = cross_val_score(KNeighborsClassifier(),x,y,cv=5)
cross.mean()
0.8030358632193495
# SVC
cross = cross_val_score(SVC(),x,y,cv=5)
cross.mean()
0.8175979983319432

Observation :
As per the above model and cross validation score comparation.
Decisiontree algorithem is difference score very less.So we are
Choosing Decision tree clasifier.

Ensemble methods

# Bagging
from sklearn.ensemble import RandomForestClassifier
rf =RandomForestClassifier(n_estimators=100,random_state=42)
rf.fit(x_train,y_train)
prec =rf.predict(x_test)
metrics_model(y_test,prec)
Accuracy score is 0.8661417322834646
................................................................................
Confusion matrix value is:
[[22 13]
[ 4 88]]
................................................................................
precision recall f1-score support

0 0.85 0.63 0.72 35
1 0.87 0.96 0.91 92

accuracy 0.87 127
macro avg 0.86 0.79 0.82 127
weighted avg 0.86 0.87 0.86 127
# Randombooster
cross = cross_val_score(RandomForestClassifier(),x,y,cv=5)
cross.mean()
# Boosting algarithem
from sklearn.ensemble import AdaBoostClassifier
adb =AdaBoostClassifier()
adb.fit(x_train,y_train)
prec=adb.predict(x_test)
metrics_model(y_test,prec)
Accuracy score is 0.84251968503937
................................................................................
Confusion matrix value is:
[[21 14]
[ 6 86]]
................................................................................
precision recall f1-score support

0 0.78 0.60 0.68 35
1 0.86 0.93 0.90 92

accuracy 0.84 127
macro avg 0.82 0.77 0.79 127
weighted avg 0.84 0.84 0.84 127
# Adabooster
cross = cross_val_score(AdaBoostClassifier(),x,y,cv=5)
cross.mean()
0.7903085904920767

HyperParameterTuning

from sklearn.model_selection import GridSearchCV
param = {'criterion' :['gini', 'entropy'],
'splitter' : ['best', 'random']
}
gr = GridSearchCV(DecisionTreeClassifier(),param,cv=5)
gr.fit(x_train,y_train)
gr.best_params_
dct = DecisionTreeClassifier(criterion='entropy',splitter='best')
dct.fit(x_train,y_train)
prec = dct.predict(x_test)
metrics_model(y_test,prec)
Accuracy score is 0.7480314960629921
................................................................................
Confusion matrix value is:
[[25 10]
[22 70]]
................................................................................
precision recall f1-score support

0 0.53 0.71 0.61 35
1 0.88 0.76 0.81 92

accuracy 0.75 127
macro avg 0.70 0.74 0.71 127
weighted avg 0.78 0.75 0.76 127

Concluding Remarks :

TheKNN model working good comparatively all other model.

So we are choosing KNN algorithm for this dataset.

Metrics : AUC ROC Curve

# lib
from sklearn.metrics import roc_curve,roc_auc_score,plot_roc_curve
Y_pred_pb =Kn.predict_proba(x_test)[:,1]
roc_auc_score(y_test,Y_pred_pb)
0.8523291925465839
plot_roc_curve(Kn,x_test,y_test)

Saving the model

import joblib
joblib.dump(dct,"Loan_Application.obj")
#loding model to filejob=joblib.load("Loan_Application.obj")
job
prec = job.predict(x_test)np.unique(prec)

--

--