Case Study: Loan Payback Prediction

Introduction

LendingClub is a US peer-to-peer lending company, headquartered in San Francisco, California. It was the first peer-to-peer lender to register its offerings as securities with the Securities and Exchange Commission (SEC), and to offer loan trading on a secondary market. LendingClub is the world's largest peer-to-peer lending platform.

Problem:

  • Given historical data on loans given out with information on whether or not the borrower defaulted (charge-off), build a model that can predict whether or nor a borrower will pay back their loan. This way in the future when a new potential customer, they can assess whether or not they are likely to pay back the loan.

Dataset:

  • Historical data of LendingClub

Source: https://www.lendingclub.com/

Review

An artificial neural network is an interconnected group of nodes, inspired by a simplification of neurons in a brain. Here, each circular node represents an artificial neuron and an arrow represents a connection from the output of one artificial neuron to the input of another.

ANNs are composed of artificial neurons which are conceptually derived from biological neurons. Each artificial neuron has inputs and produce a single output which can be sent to multiple other neurons. The inputs can be the feature values of a sample of external data, such as images or documents, or they can be the outputs of other neurons. The outputs of the final output neurons of the neural net accomplish the task, such as recognizing an object in an image.

Libraries And Data Importation

In [3]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [4]:
# Import data
df = pd.read_csv('lending_club_loan.csv')

Data Exploration

In [3]:
# Check dataframe
df.head()
Out[3]:
loan_amnt term int_rate installment grade sub_grade emp_title emp_length home_ownership annual_inc ... open_acc pub_rec revol_bal revol_util total_acc initial_list_status application_type mort_acc pub_rec_bankruptcies address
0 10000.0 36 months 11.44 329.48 B B4 Marketing 10+ years RENT 117000.0 ... 16.0 0.0 36369.0 41.8 25.0 w INDIVIDUAL 0.0 0.0 0174 Michelle Gateway\r\nMendozaberg, OK 22690
1 8000.0 36 months 11.99 265.68 B B5 Credit analyst 4 years MORTGAGE 65000.0 ... 17.0 0.0 20131.0 53.3 27.0 f INDIVIDUAL 3.0 0.0 1076 Carney Fort Apt. 347\r\nLoganmouth, SD 05113
2 15600.0 36 months 10.49 506.97 B B3 Statistician < 1 year RENT 43057.0 ... 13.0 0.0 11987.0 92.2 26.0 f INDIVIDUAL 0.0 0.0 87025 Mark Dale Apt. 269\r\nNew Sabrina, WV 05113
3 7200.0 36 months 6.49 220.65 A A2 Client Advocate 6 years RENT 54000.0 ... 6.0 0.0 5472.0 21.5 13.0 f INDIVIDUAL 0.0 0.0 823 Reid Ford\r\nDelacruzside, MA 00813
4 24375.0 60 months 17.27 609.33 C C5 Destiny Management Inc. 9 years MORTGAGE 55000.0 ... 13.0 0.0 24584.0 69.8 43.0 f INDIVIDUAL 1.0 0.0 679 Luna Roads\r\nGreggshire, VA 11650

5 rows × 27 columns

In [4]:
# Check data statistics
df.describe()
Out[4]:
loan_amnt int_rate installment annual_inc dti open_acc pub_rec revol_bal revol_util total_acc mort_acc pub_rec_bankruptcies
count 396030.000000 396030.000000 396030.000000 3.960300e+05 396030.000000 396030.000000 396030.000000 3.960300e+05 395754.000000 396030.000000 358235.000000 395495.000000
mean 14113.888089 13.639400 431.849698 7.420318e+04 17.379514 11.311153 0.178191 1.584454e+04 53.791749 25.414744 1.813991 0.121648
std 8357.441341 4.472157 250.727790 6.163762e+04 18.019092 5.137649 0.530671 2.059184e+04 24.452193 11.886991 2.147930 0.356174
min 500.000000 5.320000 16.080000 0.000000e+00 0.000000 0.000000 0.000000 0.000000e+00 0.000000 2.000000 0.000000 0.000000
25% 8000.000000 10.490000 250.330000 4.500000e+04 11.280000 8.000000 0.000000 6.025000e+03 35.800000 17.000000 0.000000 0.000000
50% 12000.000000 13.330000 375.430000 6.400000e+04 16.910000 10.000000 0.000000 1.118100e+04 54.800000 24.000000 1.000000 0.000000
75% 20000.000000 16.490000 567.300000 9.000000e+04 22.980000 14.000000 0.000000 1.962000e+04 72.900000 32.000000 3.000000 0.000000
max 40000.000000 30.990000 1533.810000 8.706582e+06 9999.000000 90.000000 86.000000 1.743266e+06 892.300000 151.000000 34.000000 8.000000
In [5]:
# Check more info
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 396030 entries, 0 to 396029
Data columns (total 27 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   loan_amnt             396030 non-null  float64
 1   term                  396030 non-null  object 
 2   int_rate              396030 non-null  float64
 3   installment           396030 non-null  float64
 4   grade                 396030 non-null  object 
 5   sub_grade             396030 non-null  object 
 6   emp_title             373103 non-null  object 
 7   emp_length            377729 non-null  object 
 8   home_ownership        396030 non-null  object 
 9   annual_inc            396030 non-null  float64
 10  verification_status   396030 non-null  object 
 11  issue_d               396030 non-null  object 
 12  loan_status           396030 non-null  object 
 13  purpose               396030 non-null  object 
 14  title                 394275 non-null  object 
 15  dti                   396030 non-null  float64
 16  earliest_cr_line      396030 non-null  object 
 17  open_acc              396030 non-null  float64
 18  pub_rec               396030 non-null  float64
 19  revol_bal             396030 non-null  float64
 20  revol_util            395754 non-null  float64
 21  total_acc             396030 non-null  float64
 22  initial_list_status   396030 non-null  object 
 23  application_type      396030 non-null  object 
 24  mort_acc              358235 non-null  float64
 25  pub_rec_bankruptcies  395495 non-null  float64
 26  address               396030 non-null  object 
dtypes: float64(12), object(15)
memory usage: 81.6+ MB

Data Visualizaton

In [6]:
# Visualize data
sns.countplot(x = 'loan_status' , data = df)
plt.show()
In [7]:
# visualize data
plt.figure(figsize = (12,4))
sns.distplot(df['loan_amnt'], kde = False, bins = 40)
plt.show()
In [8]:
# Check data correlation
df.corr()
Out[8]:
loan_amnt int_rate installment annual_inc dti open_acc pub_rec revol_bal revol_util total_acc mort_acc pub_rec_bankruptcies
loan_amnt 1.000000 0.168921 0.953929 0.336887 0.016636 0.198556 -0.077779 0.328320 0.099911 0.223886 0.222315 -0.106539
int_rate 0.168921 1.000000 0.162758 -0.056771 0.079038 0.011649 0.060986 -0.011280 0.293659 -0.036404 -0.082583 0.057450
installment 0.953929 0.162758 1.000000 0.330381 0.015786 0.188973 -0.067892 0.316455 0.123915 0.202430 0.193694 -0.098628
annual_inc 0.336887 -0.056771 0.330381 1.000000 -0.081685 0.136150 -0.013720 0.299773 0.027871 0.193023 0.236320 -0.050162
dti 0.016636 0.079038 0.015786 -0.081685 1.000000 0.136181 -0.017639 0.063571 0.088375 0.102128 -0.025439 -0.014558
open_acc 0.198556 0.011649 0.188973 0.136150 0.136181 1.000000 -0.018392 0.221192 -0.131420 0.680728 0.109205 -0.027732
pub_rec -0.077779 0.060986 -0.067892 -0.013720 -0.017639 -0.018392 1.000000 -0.101664 -0.075910 0.019723 0.011552 0.699408
revol_bal 0.328320 -0.011280 0.316455 0.299773 0.063571 0.221192 -0.101664 1.000000 0.226346 0.191616 0.194925 -0.124532
revol_util 0.099911 0.293659 0.123915 0.027871 0.088375 -0.131420 -0.075910 0.226346 1.000000 -0.104273 0.007514 -0.086751
total_acc 0.223886 -0.036404 0.202430 0.193023 0.102128 0.680728 0.019723 0.191616 -0.104273 1.000000 0.381072 0.042035
mort_acc 0.222315 -0.082583 0.193694 0.236320 -0.025439 0.109205 0.011552 0.194925 0.007514 0.381072 1.000000 0.027239
pub_rec_bankruptcies -0.106539 0.057450 -0.098628 -0.050162 -0.014558 -0.027732 0.699408 -0.124532 -0.086751 0.042035 0.027239 1.000000
In [9]:
# Visualize data correlation
plt.figure(figsize = (12, 7))
sns.heatmap(df.corr(), annot = True, cmap = 'viridis')
plt.show()
In [10]:
# Visualize data
sns.scatterplot(x = 'installment', y = 'loan_amnt', data = df)
plt.show()
In [11]:
# Visualize data
sns.boxplot(x = 'loan_status', y = 'loan_amnt', data = df)
plt.show()
In [12]:
# Check summary statistics
df.groupby('loan_status')['loan_amnt'].describe()
Out[12]:
count mean std min 25% 50% 75% max
loan_status
Charged Off 77673.0 15126.300967 8505.090557 1000.0 8525.0 14000.0 20000.0 40000.0
Fully Paid 318357.0 13866.878771 8302.319699 500.0 7500.0 12000.0 19225.0 40000.0
In [13]:
# Check for unique values
df['grade'].unique()
Out[13]:
array(['B', 'A', 'C', 'E', 'D', 'F', 'G'], dtype=object)
In [14]:
# Check for unique values
df['sub_grade'].unique()
Out[14]:
array(['B4', 'B5', 'B3', 'A2', 'C5', 'C3', 'A1', 'B2', 'C1', 'A5', 'E4',
       'A4', 'A3', 'D1', 'C2', 'B1', 'D3', 'D5', 'D2', 'E1', 'E2', 'E5',
       'F4', 'E3', 'D4', 'G1', 'F5', 'G2', 'C4', 'F1', 'F3', 'G5', 'G4',
       'F2', 'G3'], dtype=object)
In [15]:
# Visualize data
sns.countplot(x = 'grade', data = df, hue = 'loan_status')
plt.show()
In [16]:
# Visualize data
plt.figure(figsize = (12, 4))
subgrade_order = sorted(df['sub_grade'].unique())
sns.countplot(x = 'sub_grade', data = df, order = subgrade_order, palette = 'coolwarm')
plt.show()
In [17]:
# Visualize data
plt.figure(figsize = (12, 4))
subgrade_order = sorted(df['sub_grade'].unique())
sns.countplot(x = 'sub_grade', data = df, order = subgrade_order, palette = 'coolwarm', hue = 'loan_status')
plt.show()
In [18]:
# Visualize data
f_and_g = df[(df['grade'] == 'G') | (df['grade'] == 'F')]

plt.figure(figsize = (12, 4))
subgrade_order = sorted(f_and_g['sub_grade'].unique())
sns.countplot(x = 'sub_grade', data = f_and_g, order = subgrade_order, palette = 'coolwarm', hue = 'loan_status')
plt.show()
In [19]:
# Create new column
df['loan_repaid'] = df['loan_status'].map({'Fully Paid': 1, 'Charged Off': 0})
In [20]:
# Check dataframe
df[['loan_repaid', 'loan_status']]
Out[20]:
loan_repaid loan_status
0 1 Fully Paid
1 1 Fully Paid
2 1 Fully Paid
3 1 Fully Paid
4 0 Charged Off
... ... ...
396025 1 Fully Paid
396026 1 Fully Paid
396027 1 Fully Paid
396028 1 Fully Paid
396029 1 Fully Paid

396030 rows × 2 columns

In [21]:
# Visualize correlation impact
plt.figure(figsize = (8, 4))
df.corr()['loan_repaid'].sort_values().drop('loan_repaid').plot(kind = 'bar')
plt.show()

Data Preprocessing

Feature Engineering

In [22]:
# Check dataframe length
len(df)
Out[22]:
396030
In [23]:
# Check for missing values 
df.isnull().mean()
Out[23]:
loan_amnt               0.000000
term                    0.000000
int_rate                0.000000
installment             0.000000
grade                   0.000000
sub_grade               0.000000
emp_title               0.057892
emp_length              0.046211
home_ownership          0.000000
annual_inc              0.000000
verification_status     0.000000
issue_d                 0.000000
loan_status             0.000000
purpose                 0.000000
title                   0.004431
dti                     0.000000
earliest_cr_line        0.000000
open_acc                0.000000
pub_rec                 0.000000
revol_bal               0.000000
revol_util              0.000697
total_acc               0.000000
initial_list_status     0.000000
application_type        0.000000
mort_acc                0.095435
pub_rec_bankruptcies    0.001351
address                 0.000000
loan_repaid             0.000000
dtype: float64
In [24]:
# Check the percentage of missing data
(df.isnull().sum() / len(df)) * 100 
Out[24]:
loan_amnt               0.000000
term                    0.000000
int_rate                0.000000
installment             0.000000
grade                   0.000000
sub_grade               0.000000
emp_title               5.789208
emp_length              4.621115
home_ownership          0.000000
annual_inc              0.000000
verification_status     0.000000
issue_d                 0.000000
loan_status             0.000000
purpose                 0.000000
title                   0.443148
dti                     0.000000
earliest_cr_line        0.000000
open_acc                0.000000
pub_rec                 0.000000
revol_bal               0.000000
revol_util              0.069692
total_acc               0.000000
initial_list_status     0.000000
application_type        0.000000
mort_acc                9.543469
pub_rec_bankruptcies    0.135091
address                 0.000000
loan_repaid             0.000000
dtype: float64
In [25]:
# Check data unique counts
df['emp_title'].nunique()
Out[25]:
173105
In [26]:
# check for value counts
df['emp_title'].value_counts()
Out[26]:
Teacher                              4389
Manager                              4250
Registered Nurse                     1856
RN                                   1846
Supervisor                           1830
                                     ... 
Kforce-Lockheed Martin                  1
Analytical technician                   1
Florida Department of Agriculture       1
Solid surfe fabricator                  1
Cheyenne Mountain Zoo                   1
Name: emp_title, Length: 173105, dtype: int64
In [27]:
# Drop column
df = df.drop('emp_title', axis = 1)
In [28]:
# check and dort data column
sorted(df['emp_length'].dropna().unique())
Out[28]:
['1 year',
 '10+ years',
 '2 years',
 '3 years',
 '4 years',
 '5 years',
 '6 years',
 '7 years',
 '8 years',
 '9 years',
 '< 1 year']
In [29]:
# Rearrange column data
emp_length_order  = [
 '< 1 year'
 '1 year',
 '2 years',
 '3 years',
 '4 years',
 '5 years',
 '6 years',
 '7 years',
 '8 years',
 '9 years',
 '10+ years']
In [30]:
# Visualize data
plt.figure(figsize = (12, 3))
sns.countplot(x = 'emp_length', data = df, order = emp_length_order, hue = 'loan_status')
plt.show()
In [31]:
# Create filtered dataframe
emp_co = df[df['loan_status'] == 'Charged Off'].groupby('emp_length').count()['loan_status']

# Check dataframe
emp_co
Out[31]:
emp_length
1 year        5154
10+ years    23215
2 years       6924
3 years       6182
4 years       4608
5 years       5092
6 years       3943
7 years       4055
8 years       3829
9 years       3070
< 1 year      6563
Name: loan_status, dtype: int64
In [32]:
# Create filtered dataframe
emp_fp = df[df['loan_status'] == 'Fully Paid'].groupby('emp_length').count()['loan_status']

# Check dataframe
emp_fp
Out[32]:
emp_length
1 year        20728
10+ years    102826
2 years       28903
3 years       25483
4 years       19344
5 years       21403
6 years       16898
7 years       16764
8 years       15339
9 years       12244
< 1 year      25162
Name: loan_status, dtype: int64
In [33]:
# Check ratio
emp_co / emp_fp
Out[33]:
emp_length
1 year       0.248649
10+ years    0.225770
2 years      0.239560
3 years      0.242593
4 years      0.238213
5 years      0.237911
6 years      0.233341
7 years      0.241887
8 years      0.249625
9 years      0.250735
< 1 year     0.260830
Name: loan_status, dtype: float64
In [34]:
# Create label
emp_len = emp_co / (emp_co + emp_fp)
In [35]:
# Visualize data
emp_len.plot(kind = 'bar')
plt.show()
In [36]:
# Drop column
df = df.drop('emp_length', axis = 1)
In [37]:
# Check data column
df['purpose'].head(10)
Out[37]:
0              vacation
1    debt_consolidation
2           credit_card
3           credit_card
4           credit_card
5    debt_consolidation
6      home_improvement
7           credit_card
8    debt_consolidation
9    debt_consolidation
Name: purpose, dtype: object
In [38]:
# Check data column
df['title'].head(10)
Out[38]:
0                   Vacation
1         Debt consolidation
2    Credit card refinancing
3    Credit card refinancing
4      Credit Card Refinance
5         Debt consolidation
6           Home improvement
7       No More Credit Cards
8         Debt consolidation
9         Debt Consolidation
Name: title, dtype: object
In [39]:
# Drop column
df = df.drop('title', axis = 1)
In [40]:
# Check data column value counts
df['mort_acc'].value_counts()
Out[40]:
0.0     139777
1.0      60416
2.0      49948
3.0      38049
4.0      27887
5.0      18194
6.0      11069
7.0       6052
8.0       3121
9.0       1656
10.0       865
11.0       479
12.0       264
13.0       146
14.0       107
15.0        61
16.0        37
17.0        22
18.0        18
19.0        15
20.0        13
24.0        10
22.0         7
21.0         4
25.0         4
27.0         3
23.0         2
32.0         2
26.0         2
31.0         2
30.0         1
28.0         1
34.0         1
Name: mort_acc, dtype: int64
In [41]:
# Check specified correlation
df.corr()['mort_acc'].sort_values()
Out[41]:
int_rate               -0.082583
dti                    -0.025439
revol_util              0.007514
pub_rec                 0.011552
pub_rec_bankruptcies    0.027239
loan_repaid             0.073111
open_acc                0.109205
installment             0.193694
revol_bal               0.194925
loan_amnt               0.222315
annual_inc              0.236320
total_acc               0.381072
mort_acc                1.000000
Name: mort_acc, dtype: float64
In [42]:
# compare each column by mean according to a specific column
df.groupby('total_acc').mean()
Out[42]:
loan_amnt int_rate installment annual_inc dti open_acc pub_rec revol_bal revol_util mort_acc pub_rec_bankruptcies loan_repaid
total_acc
2.0 6672.222222 15.801111 210.881667 64277.777778 2.279444 1.611111 0.000000 2860.166667 53.527778 0.000000 0.000000 0.777778
3.0 6042.966361 15.615566 198.728318 41270.753884 6.502813 2.611621 0.045872 3382.807339 49.991022 0.052023 0.015480 0.779817
4.0 7587.399031 15.069491 250.050194 42426.565969 8.411963 3.324717 0.041195 4874.231826 58.477400 0.066743 0.022951 0.785945
5.0 7845.734714 14.917564 256.190325 44394.098003 10.118328 3.921598 0.071499 5475.253452 56.890311 0.103289 0.041171 0.796844
6.0 8529.019843 14.651752 278.518228 48470.001156 11.222542 4.511119 0.104003 6546.374957 57.812483 0.151293 0.055077 0.784126
... ... ... ... ... ... ... ... ... ... ... ... ...
124.0 23200.000000 17.860000 587.370000 66000.000000 14.040000 43.000000 0.000000 25497.000000 75.400000 1.000000 0.000000 0.000000
129.0 25000.000000 7.890000 505.600000 200000.000000 8.900000 48.000000 0.000000 27659.000000 8.300000 1.000000 0.000000 1.000000
135.0 24000.000000 15.410000 576.140000 82000.000000 33.850000 57.000000 0.000000 35715.000000 50.800000 3.000000 0.000000 1.000000
150.0 35000.000000 8.670000 1107.630000 189000.000000 6.630000 40.000000 0.000000 39065.000000 44.400000 2.000000 0.000000 1.000000
151.0 35000.000000 13.990000 1196.050000 160000.000000 12.650000 26.000000 0.000000 46643.000000 71.500000 0.000000 0.000000 0.000000

118 rows × 12 columns

In [43]:
# create new data frame
total_acc_avg = df.groupby('total_acc').mean()['mort_acc']

# Check dataframe
total_acc_avg
Out[43]:
total_acc
2.0      0.000000
3.0      0.052023
4.0      0.066743
5.0      0.103289
6.0      0.151293
           ...   
124.0    1.000000
129.0    1.000000
135.0    3.000000
150.0    2.000000
151.0    0.000000
Name: mort_acc, Length: 118, dtype: float64
In [44]:
# Create function
def fill_mort_acc(total_acc, mort_acc):
    if np.isnan(mort_acc):
        return total_acc_avg[total_acc]
    else:
        return mort_acc
In [45]:
df['mort_acc'] = df.apply(lambda x: fill_mort_acc(x['total_acc'], x['mort_acc']), axis = 1) # Lambda = apply function to two pandas dataframe
In [46]:
# Check for missing values
df.isnull().mean()
Out[46]:
loan_amnt               0.000000
term                    0.000000
int_rate                0.000000
installment             0.000000
grade                   0.000000
sub_grade               0.000000
home_ownership          0.000000
annual_inc              0.000000
verification_status     0.000000
issue_d                 0.000000
loan_status             0.000000
purpose                 0.000000
dti                     0.000000
earliest_cr_line        0.000000
open_acc                0.000000
pub_rec                 0.000000
revol_bal               0.000000
revol_util              0.000697
total_acc               0.000000
initial_list_status     0.000000
application_type        0.000000
mort_acc                0.000000
pub_rec_bankruptcies    0.001351
address                 0.000000
loan_repaid             0.000000
dtype: float64
In [47]:
# Drop columns
df = df.dropna()
In [48]:
# Check for missing values
df.isnull().mean()
Out[48]:
loan_amnt               0.0
term                    0.0
int_rate                0.0
installment             0.0
grade                   0.0
sub_grade               0.0
home_ownership          0.0
annual_inc              0.0
verification_status     0.0
issue_d                 0.0
loan_status             0.0
purpose                 0.0
dti                     0.0
earliest_cr_line        0.0
open_acc                0.0
pub_rec                 0.0
revol_bal               0.0
revol_util              0.0
total_acc               0.0
initial_list_status     0.0
application_type        0.0
mort_acc                0.0
pub_rec_bankruptcies    0.0
address                 0.0
loan_repaid             0.0
dtype: float64

Categorical Encoding

In [49]:
# Check string columns
df.select_dtypes(['object']).columns
Out[49]:
Index(['term', 'grade', 'sub_grade', 'home_ownership', 'verification_status',
       'issue_d', 'loan_status', 'purpose', 'earliest_cr_line',
       'initial_list_status', 'application_type', 'address'],
      dtype='object')
In [50]:
# Check value counts
df['term'].value_counts()
Out[50]:
 36 months    301247
 60 months     93972
Name: term, dtype: int64
In [51]:
# Take only the first two characters
df['term'] = df['term'].apply(lambda term: int(term[:3]))
In [52]:
# Check data column
df['term'].value_counts()
Out[52]:
36    301247
60     93972
Name: term, dtype: int64
In [53]:
# Drop column
df = df.drop('grade', axis = 1)
In [54]:
# Get dummies
dummies = pd.get_dummies(df['sub_grade'], drop_first = True)

# Removie the original column and concat the dummies
df = pd.concat([df.drop('sub_grade', axis = 1), dummies], axis = 1)
In [55]:
# check data column
df.columns
Out[55]:
Index(['loan_amnt', 'term', 'int_rate', 'installment', 'home_ownership',
       'annual_inc', 'verification_status', 'issue_d', 'loan_status',
       'purpose', 'dti', 'earliest_cr_line', 'open_acc', 'pub_rec',
       'revol_bal', 'revol_util', 'total_acc', 'initial_list_status',
       'application_type', 'mort_acc', 'pub_rec_bankruptcies', 'address',
       'loan_repaid', 'A2', 'A3', 'A4', 'A5', 'B1', 'B2', 'B3', 'B4', 'B5',
       'C1', 'C2', 'C3', 'C4', 'C5', 'D1', 'D2', 'D3', 'D4', 'D5', 'E1', 'E2',
       'E3', 'E4', 'E5', 'F1', 'F2', 'F3', 'F4', 'F5', 'G1', 'G2', 'G3', 'G4',
       'G5'],
      dtype='object')
In [56]:
# Get dummies
dummies = pd.get_dummies(df[['verification_status', 'application_type', 'initial_list_status','purpose']], drop_first = True)

# Removie the original column and concat the dummies
df = pd.concat([df.drop(['verification_status', 'application_type','initial_list_status','purpose'], axis = 1), dummies], axis = 1)
In [57]:
# Check value counts
df['home_ownership'].value_counts()
Out[57]:
MORTGAGE    198022
RENT        159395
OWN          37660
OTHER          110
NONE            29
ANY              3
Name: home_ownership, dtype: int64
In [58]:
# Replace values
df['home_ownership'] = df['home_ownership'].replace(['NONE', 'ANY'], 'OTHER')
In [59]:
# Check value counts
df['home_ownership'].value_counts()
Out[59]:
MORTGAGE    198022
RENT        159395
OWN          37660
OTHER          142
Name: home_ownership, dtype: int64
In [60]:
# Get dummies
dummies = pd.get_dummies(df['home_ownership'], drop_first = True)

# Removie the original column and concat the dummies
df = pd.concat([df.drop('home_ownership', axis = 1), dummies], axis = 1)
In [61]:
# Check dataframe
df.head()
Out[61]:
loan_amnt term int_rate installment annual_inc issue_d loan_status dti earliest_cr_line open_acc ... purpose_medical purpose_moving purpose_other purpose_renewable_energy purpose_small_business purpose_vacation purpose_wedding OTHER OWN RENT
0 10000.0 36 11.44 329.48 117000.0 Jan-2015 Fully Paid 26.24 Jun-1990 16.0 ... 0 0 0 0 0 1 0 0 0 1
1 8000.0 36 11.99 265.68 65000.0 Jan-2015 Fully Paid 22.05 Jul-2004 17.0 ... 0 0 0 0 0 0 0 0 0 0
2 15600.0 36 10.49 506.97 43057.0 Jan-2015 Fully Paid 12.79 Aug-2007 13.0 ... 0 0 0 0 0 0 0 0 0 1
3 7200.0 36 6.49 220.65 54000.0 Nov-2014 Fully Paid 2.60 Sep-2006 6.0 ... 0 0 0 0 0 0 0 0 0 1
4 24375.0 60 17.27 609.33 55000.0 Apr-2013 Charged Off 33.95 Mar-1999 13.0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 73 columns

In [62]:
# Check column
df['address']
Out[62]:
0            0174 Michelle Gateway\r\nMendozaberg, OK 22690
1         1076 Carney Fort Apt. 347\r\nLoganmouth, SD 05113
2         87025 Mark Dale Apt. 269\r\nNew Sabrina, WV 05113
3                   823 Reid Ford\r\nDelacruzside, MA 00813
4                    679 Luna Roads\r\nGreggshire, VA 11650
                                ...                        
396025     12951 Williams Crossing\r\nJohnnyville, DC 30723
396026    0114 Fowler Field Suite 028\r\nRachelborough, ...
396027    953 Matthew Points Suite 414\r\nReedfort, NY 7...
396028    7843 Blake Freeway Apt. 229\r\nNew Michael, FL...
396029        787 Michelle Causeway\r\nBriannaton, AR 48052
Name: address, Length: 395219, dtype: object
In [63]:
# Take last five characters
df['zip_code'] = df['address'].apply(lambda address:address[-5:])

# Check data column
df['zip_code']
Out[63]:
0         22690
1         05113
2         05113
3         00813
4         11650
          ...  
396025    30723
396026    05113
396027    70466
396028    29597
396029    48052
Name: zip_code, Length: 395219, dtype: object
In [64]:
# Check value counts
df['zip_code'].value_counts()
Out[64]:
70466    56880
22690    56413
30723    56402
48052    55811
00813    45725
29597    45393
05113    45300
11650    11210
93700    11126
86630    10959
Name: zip_code, dtype: int64
In [65]:
# Get dummies
dummies = pd.get_dummies(df['zip_code'], drop_first = True)

# Removie the original column and concat the dummies
df = pd.concat([df.drop('zip_code', axis = 1), dummies], axis = 1)

# Drop the original column
df = df.drop('address', axis = 1)
In [66]:
# Drop column
df = df.drop('issue_d', axis = 1)
In [67]:
# Check data column
df['earliest_cr_line']
Out[67]:
0         Jun-1990
1         Jul-2004
2         Aug-2007
3         Sep-2006
4         Mar-1999
            ...   
396025    Nov-2004
396026    Feb-2006
396027    Mar-1997
396028    Nov-1990
396029    Sep-1998
Name: earliest_cr_line, Length: 395219, dtype: object
In [68]:
# Get the last 4 characters
df['earliest_cr_line'] = df['earliest_cr_line'].apply(lambda date: int(date[-4:]))
In [69]:
# Check data column
df['earliest_cr_line']
Out[69]:
0         1990
1         2004
2         2007
3         2006
4         1999
          ... 
396025    2004
396026    2006
396027    1997
396028    1990
396029    1998
Name: earliest_cr_line, Length: 395219, dtype: int64

Data Splitting And Scaling

In [72]:
# Drop data column
df = df.drop('loan_status', axis = 1)
In [73]:
# Set the x and y variables
X = df.drop('loan_repaid', axis = 1).values
y = df['loan_repaid'].values
In [74]:
# Import library
from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
In [75]:
# Import library
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

# Normalize the data
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Model Creation

In [76]:
# Import libraries
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation,Dropout
from tensorflow.keras.constraints import max_norm
In [77]:
# Create model architecture
model = Sequential()

# Input layer
model.add(Dense(78,  activation='relu'))
model.add(Dropout(0.2))

# Hidden layer
model.add(Dense(39, activation='relu'))
model.add(Dropout(0.2))

# Hidden layer
model.add(Dense(19, activation='relu'))
model.add(Dropout(0.2))

# Output layer
model.add(Dense(units=1,activation='sigmoid'))

# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam')
In [78]:
# Train the model
model.fit(x=X_train, 
          y=y_train, 
          epochs=25,
          batch_size=256,
          validation_data=(X_test, y_test), 
          )
Epoch 1/25
1236/1236 [==============================] - 8s 7ms/step - loss: 0.3001 - val_loss: 0.2628
Epoch 2/25
1236/1236 [==============================] - 8s 7ms/step - loss: 0.2658 - val_loss: 0.2619
Epoch 3/25
1236/1236 [==============================] - 9s 7ms/step - loss: 0.2633 - val_loss: 0.2603
Epoch 4/25
1236/1236 [==============================] - 8s 7ms/step - loss: 0.2625 - val_loss: 0.2596
Epoch 5/25
1236/1236 [==============================] - 10s 8ms/step - loss: 0.2617 - val_loss: 0.2597
Epoch 6/25
1236/1236 [==============================] - 10s 8ms/step - loss: 0.2609 - val_loss: 0.2599
Epoch 7/25
1236/1236 [==============================] - 9s 7ms/step - loss: 0.2605 - val_loss: 0.2592
Epoch 8/25
1236/1236 [==============================] - 10s 8ms/step - loss: 0.2600 - val_loss: 0.2590
Epoch 9/25
1236/1236 [==============================] - 9s 8ms/step - loss: 0.2598 - val_loss: 0.2594
Epoch 10/25
1236/1236 [==============================] - 9s 7ms/step - loss: 0.2596 - val_loss: 0.2586
Epoch 11/25
1236/1236 [==============================] - 9s 8ms/step - loss: 0.2596 - val_loss: 0.2589
Epoch 12/25
1236/1236 [==============================] - 10s 8ms/step - loss: 0.2589 - val_loss: 0.2589
Epoch 13/25
1236/1236 [==============================] - 10s 8ms/step - loss: 0.2589 - val_loss: 0.2586
Epoch 14/25
1236/1236 [==============================] - 11s 9ms/step - loss: 0.2586 - val_loss: 0.2587
Epoch 15/25
1236/1236 [==============================] - 11s 9ms/step - loss: 0.2584 - val_loss: 0.2584
Epoch 16/25
1236/1236 [==============================] - 11s 9ms/step - loss: 0.2580 - val_loss: 0.2586
Epoch 17/25
1236/1236 [==============================] - 12s 10ms/step - loss: 0.2579 - val_loss: 0.2593
Epoch 18/25
1236/1236 [==============================] - 11s 9ms/step - loss: 0.2577 - val_loss: 0.2588
Epoch 19/25
1236/1236 [==============================] - 12s 10ms/step - loss: 0.2575 - val_loss: 0.2586
Epoch 20/25
1236/1236 [==============================] - 17s 14ms/step - loss: 0.2572 - val_loss: 0.2587- ETA: 5s - loss: 0. - ET - ETA: 5s - loss: 0. - ETA:  - ETA: 2s -
Epoch 21/25
1236/1236 [==============================] - 15s 12ms/step - loss: 0.2575 - val_loss: 0.2583s - lo - ETA: - ETA: 1s - l - E
Epoch 22/25
1236/1236 [==============================] - 13s 10ms/step - loss: 0.2569 - val_loss: 0.2582
Epoch 23/25
1236/1236 [==============================] - 13s 11ms/step - loss: 0.2569 - val_loss: 0.2583oss: 0.25
Epoch 24/25
1236/1236 [==============================] - 13s 10ms/step - loss: 0.2568 - val_loss: 0.2584
Epoch 25/25
1236/1236 [==============================] - 15s 12ms/step - loss: 0.2566 - val_loss: 0.2586
Out[78]:
<tensorflow.python.keras.callbacks.History at 0x1d612046d00>

Model Evaluation

In [79]:
# Check dataframe
pd.DataFrame(model.history.history)
Out[79]:
loss val_loss
0 0.300101 0.262768
1 0.265768 0.261880
2 0.263329 0.260314
3 0.262459 0.259649
4 0.261652 0.259654
5 0.260856 0.259863
6 0.260470 0.259188
7 0.259991 0.259018
8 0.259822 0.259445
9 0.259569 0.258644
10 0.259577 0.258945
11 0.258858 0.258921
12 0.258878 0.258647
13 0.258598 0.258740
14 0.258404 0.258425
15 0.258042 0.258552
16 0.257948 0.259319
17 0.257684 0.258779
18 0.257482 0.258589
19 0.257214 0.258653
20 0.257455 0.258279
21 0.256869 0.258192
22 0.256862 0.258293
23 0.256762 0.258402
24 0.256617 0.258631
In [82]:
# label the dataframe
losses = pd.DataFrame(model.history.history)

# Plot the evaluation metrics
losses[['loss','val_loss']].plot()
plt.show()
In [85]:
# Import library
from sklearn.metrics import classification_report,confusion_matrix

# Set prediction
predictions = model.predict_classes(X_test)
print(classification_report(y_test,predictions))
              precision    recall  f1-score   support

           0       0.95      0.45      0.61     15421
           1       0.88      0.99      0.94     63623

    accuracy                           0.89     79044
   macro avg       0.92      0.72      0.77     79044
weighted avg       0.90      0.89      0.87     79044

In [86]:
# Check confusion matrix
confusion_matrix(y_test,predictions)
Out[86]:
array([[ 6991,  8430],
       [  355, 63268]], dtype=int64)

Sample Prediction

In [87]:
# Create sample
import random
random.seed(101)
random_ind = random.randint(0,len(df))

new_customer = df.drop('loan_repaid', axis=1).iloc[random_ind]
new_customer
Out[87]:
loan_amnt      25000.00
term              60.00
int_rate          18.24
installment      638.11
annual_inc     61665.00
                 ...   
30723              1.00
48052              0.00
70466              0.00
86630              0.00
93700              0.00
Name: 305323, Length: 78, dtype: float64
In [88]:
# Predict Sample
model.predict_classes(new_customer.values.reshape(1,78))
Out[88]:
array([[1]])
In [89]:
# True value
df.iloc[random_ind]['loan_repaid']
Out[89]:
1.0

Save Model

In [90]:
# Import library
from tensorflow.keras.models import load_model

# Save the model
model.save('full_data_project_model.h5')  

Conclusion

The model was able to obtained 89% accuracy but there was an issue, the dataset was imbalance. There were more 'Fully paid' than 'Charged off' which has a probability of 80% by just guessing Fully Paid. Possible solution for that is Re-sampling the training dataset (Under-sampling or over-sampling and then apply k-fold cross validation.