Case Study: Loan Payback Prediction¶

Introduction¶

LendingClub is a US peer-to-peer lending company, headquartered in San Francisco, California. It was the first peer-to-peer lender to register its offerings as securities with the Securities and Exchange Commission (SEC), and to offer loan trading on a secondary market. LendingClub is the world's largest peer-to-peer lending platform.

Problem:

Given historical data on loans given out with information on whether or not the borrower defaulted (charge-off), build a model that can predict whether or nor a borrower will pay back their loan. This way in the future when a new potential customer, they can assess whether or not they are likely to pay back the loan.

Dataset:

Historical data of LendingClub

Source: https://www.lendingclub.com/

Review¶

An artificial neural network is an interconnected group of nodes, inspired by a simplification of neurons in a brain. Here, each circular node represents an artificial neuron and an arrow represents a connection from the output of one artificial neuron to the input of another.

ANNs are composed of artificial neurons which are conceptually derived from biological neurons. Each artificial neuron has inputs and produce a single output which can be sent to multiple other neurons. The inputs can be the feature values of a sample of external data, such as images or documents, or they can be the outputs of other neurons. The outputs of the final output neurons of the neural net accomplish the task, such as recognizing an object in an image.

Libraries And Data Importation¶

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Import data
df = pd.read_csv('lending_club_loan.csv')

Data Exploration¶

# Check dataframe
df.head()

# Check data statistics
df.describe()

# Check more info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 396030 entries, 0 to 396029
Data columns (total 27 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   loan_amnt             396030 non-null  float64
 1   term                  396030 non-null  object 
 2   int_rate              396030 non-null  float64
 3   installment           396030 non-null  float64
 4   grade                 396030 non-null  object 
 5   sub_grade             396030 non-null  object 
 6   emp_title             373103 non-null  object 
 7   emp_length            377729 non-null  object 
 8   home_ownership        396030 non-null  object 
 9   annual_inc            396030 non-null  float64
 10  verification_status   396030 non-null  object 
 11  issue_d               396030 non-null  object 
 12  loan_status           396030 non-null  object 
 13  purpose               396030 non-null  object 
 14  title                 394275 non-null  object 
 15  dti                   396030 non-null  float64
 16  earliest_cr_line      396030 non-null  object 
 17  open_acc              396030 non-null  float64
 18  pub_rec               396030 non-null  float64
 19  revol_bal             396030 non-null  float64
 20  revol_util            395754 non-null  float64
 21  total_acc             396030 non-null  float64
 22  initial_list_status   396030 non-null  object 
 23  application_type      396030 non-null  object 
 24  mort_acc              358235 non-null  float64
 25  pub_rec_bankruptcies  395495 non-null  float64
 26  address               396030 non-null  object 
dtypes: float64(12), object(15)
memory usage: 81.6+ MB

Data Visualizaton¶

# Visualize data
sns.countplot(x = 'loan_status' , data = df)
plt.show()

# visualize data
plt.figure(figsize = (12,4))
sns.distplot(df['loan_amnt'], kde = False, bins = 40)
plt.show()

# Check data correlation
df.corr()

# Visualize data correlation
plt.figure(figsize = (12, 7))
sns.heatmap(df.corr(), annot = True, cmap = 'viridis')
plt.show()

# Visualize data
sns.scatterplot(x = 'installment', y = 'loan_amnt', data = df)
plt.show()

# Visualize data
sns.boxplot(x = 'loan_status', y = 'loan_amnt', data = df)
plt.show()

# Check summary statistics
df.groupby('loan_status')['loan_amnt'].describe()

# Check for unique values
df['grade'].unique()

array(['B', 'A', 'C', 'E', 'D', 'F', 'G'], dtype=object)

# Check for unique values
df['sub_grade'].unique()

array(['B4', 'B5', 'B3', 'A2', 'C5', 'C3', 'A1', 'B2', 'C1', 'A5', 'E4',
       'A4', 'A3', 'D1', 'C2', 'B1', 'D3', 'D5', 'D2', 'E1', 'E2', 'E5',
       'F4', 'E3', 'D4', 'G1', 'F5', 'G2', 'C4', 'F1', 'F3', 'G5', 'G4',
       'F2', 'G3'], dtype=object)

# Visualize data
sns.countplot(x = 'grade', data = df, hue = 'loan_status')
plt.show()

# Visualize data
plt.figure(figsize = (12, 4))
subgrade_order = sorted(df['sub_grade'].unique())
sns.countplot(x = 'sub_grade', data = df, order = subgrade_order, palette = 'coolwarm')
plt.show()

# Visualize data
plt.figure(figsize = (12, 4))
subgrade_order = sorted(df['sub_grade'].unique())
sns.countplot(x = 'sub_grade', data = df, order = subgrade_order, palette = 'coolwarm', hue = 'loan_status')
plt.show()

# Visualize data
f_and_g = df[(df['grade'] == 'G') | (df['grade'] == 'F')]

plt.figure(figsize = (12, 4))
subgrade_order = sorted(f_and_g['sub_grade'].unique())
sns.countplot(x = 'sub_grade', data = f_and_g, order = subgrade_order, palette = 'coolwarm', hue = 'loan_status')
plt.show()

# Create new column
df['loan_repaid'] = df['loan_status'].map({'Fully Paid': 1, 'Charged Off': 0})

# Check dataframe
df[['loan_repaid', 'loan_status']]

# Visualize correlation impact
plt.figure(figsize = (8, 4))
df.corr()['loan_repaid'].sort_values().drop('loan_repaid').plot(kind = 'bar')
plt.show()

Data Preprocessing¶

Feature Engineering¶

# Check dataframe length
len(df)

396030

# Check for missing values 
df.isnull().mean()

loan_amnt               0.000000
term                    0.000000
int_rate                0.000000
installment             0.000000
grade                   0.000000
sub_grade               0.000000
emp_title               0.057892
emp_length              0.046211
home_ownership          0.000000
annual_inc              0.000000
verification_status     0.000000
issue_d                 0.000000
loan_status             0.000000
purpose                 0.000000
title                   0.004431
dti                     0.000000
earliest_cr_line        0.000000
open_acc                0.000000
pub_rec                 0.000000
revol_bal               0.000000
revol_util              0.000697
total_acc               0.000000
initial_list_status     0.000000
application_type        0.000000
mort_acc                0.095435
pub_rec_bankruptcies    0.001351
address                 0.000000
loan_repaid             0.000000
dtype: float64

# Check the percentage of missing data
(df.isnull().sum() / len(df)) * 100

loan_amnt               0.000000
term                    0.000000
int_rate                0.000000
installment             0.000000
grade                   0.000000
sub_grade               0.000000
emp_title               5.789208
emp_length              4.621115
home_ownership          0.000000
annual_inc              0.000000
verification_status     0.000000
issue_d                 0.000000
loan_status             0.000000
purpose                 0.000000
title                   0.443148
dti                     0.000000
earliest_cr_line        0.000000
open_acc                0.000000
pub_rec                 0.000000
revol_bal               0.000000
revol_util              0.069692
total_acc               0.000000
initial_list_status     0.000000
application_type        0.000000
mort_acc                9.543469
pub_rec_bankruptcies    0.135091
address                 0.000000
loan_repaid             0.000000
dtype: float64

# Check data unique counts
df['emp_title'].nunique()

173105

# check for value counts
df['emp_title'].value_counts()

Teacher                              4389
Manager                              4250
Registered Nurse                     1856
RN                                   1846
Supervisor                           1830
                                     ... 
Kforce-Lockheed Martin                  1
Analytical technician                   1
Florida Department of Agriculture       1
Solid surfe fabricator                  1
Cheyenne Mountain Zoo                   1
Name: emp_title, Length: 173105, dtype: int64

# Drop column
df = df.drop('emp_title', axis = 1)

# check and dort data column
sorted(df['emp_length'].dropna().unique())

['1 year',
 '10+ years',
 '2 years',
 '3 years',
 '4 years',
 '5 years',
 '6 years',
 '7 years',
 '8 years',
 '9 years',
 '< 1 year']

# Rearrange column data
emp_length_order  = [
 '< 1 year'
 '1 year',
 '2 years',
 '3 years',
 '4 years',
 '5 years',
 '6 years',
 '7 years',
 '8 years',
 '9 years',
 '10+ years']

# Visualize data
plt.figure(figsize = (12, 3))
sns.countplot(x = 'emp_length', data = df, order = emp_length_order, hue = 'loan_status')
plt.show()

# Create filtered dataframe
emp_co = df[df['loan_status'] == 'Charged Off'].groupby('emp_length').count()['loan_status']

# Check dataframe
emp_co

emp_length
1 year        5154
10+ years    23215
2 years       6924
3 years       6182
4 years       4608
5 years       5092
6 years       3943
7 years       4055
8 years       3829
9 years       3070
< 1 year      6563
Name: loan_status, dtype: int64

# Create filtered dataframe
emp_fp = df[df['loan_status'] == 'Fully Paid'].groupby('emp_length').count()['loan_status']

# Check dataframe
emp_fp

emp_length
1 year        20728
10+ years    102826
2 years       28903
3 years       25483
4 years       19344
5 years       21403
6 years       16898
7 years       16764
8 years       15339
9 years       12244
< 1 year      25162
Name: loan_status, dtype: int64

# Check ratio
emp_co / emp_fp

emp_length
1 year       0.248649
10+ years    0.225770
2 years      0.239560
3 years      0.242593
4 years      0.238213
5 years      0.237911
6 years      0.233341
7 years      0.241887
8 years      0.249625
9 years      0.250735
< 1 year     0.260830
Name: loan_status, dtype: float64

# Create label
emp_len = emp_co / (emp_co + emp_fp)

# Visualize data
emp_len.plot(kind = 'bar')
plt.show()

# Drop column
df = df.drop('emp_length', axis = 1)

# Check data column
df['purpose'].head(10)

0              vacation
1    debt_consolidation
2           credit_card
3           credit_card
4           credit_card
5    debt_consolidation
6      home_improvement
7           credit_card
8    debt_consolidation
9    debt_consolidation
Name: purpose, dtype: object

# Check data column
df['title'].head(10)

0                   Vacation
1         Debt consolidation
2    Credit card refinancing
3    Credit card refinancing
4      Credit Card Refinance
5         Debt consolidation
6           Home improvement
7       No More Credit Cards
8         Debt consolidation
9         Debt Consolidation
Name: title, dtype: object

# Drop column
df = df.drop('title', axis = 1)

# Check data column value counts
df['mort_acc'].value_counts()

0.0     139777
1.0      60416
2.0      49948
3.0      38049
4.0      27887
5.0      18194
6.0      11069
7.0       6052
8.0       3121
9.0       1656
10.0       865
11.0       479
12.0       264
13.0       146
14.0       107
15.0        61
16.0        37
17.0        22
18.0        18
19.0        15
20.0        13
24.0        10
22.0         7
21.0         4
25.0         4
27.0         3
23.0         2
32.0         2
26.0         2
31.0         2
30.0         1
28.0         1
34.0         1
Name: mort_acc, dtype: int64

# Check specified correlation
df.corr()['mort_acc'].sort_values()

int_rate               -0.082583
dti                    -0.025439
revol_util              0.007514
pub_rec                 0.011552
pub_rec_bankruptcies    0.027239
loan_repaid             0.073111
open_acc                0.109205
installment             0.193694
revol_bal               0.194925
loan_amnt               0.222315
annual_inc              0.236320
total_acc               0.381072
mort_acc                1.000000
Name: mort_acc, dtype: float64

# compare each column by mean according to a specific column
df.groupby('total_acc').mean()

# create new data frame
total_acc_avg = df.groupby('total_acc').mean()['mort_acc']

# Check dataframe
total_acc_avg

total_acc
2.0      0.000000
3.0      0.052023
4.0      0.066743
5.0      0.103289
6.0      0.151293
           ...   
124.0    1.000000
129.0    1.000000
135.0    3.000000
150.0    2.000000
151.0    0.000000
Name: mort_acc, Length: 118, dtype: float64

# Create function
def fill_mort_acc(total_acc, mort_acc):
    if np.isnan(mort_acc):
        return total_acc_avg[total_acc]
    else:
        return mort_acc

df['mort_acc'] = df.apply(lambda x: fill_mort_acc(x['total_acc'], x['mort_acc']), axis = 1) # Lambda = apply function to two pandas dataframe

# Check for missing values
df.isnull().mean()

loan_amnt               0.000000
term                    0.000000
int_rate                0.000000
installment             0.000000
grade                   0.000000
sub_grade               0.000000
home_ownership          0.000000
annual_inc              0.000000
verification_status     0.000000
issue_d                 0.000000
loan_status             0.000000
purpose                 0.000000
dti                     0.000000
earliest_cr_line        0.000000
open_acc                0.000000
pub_rec                 0.000000
revol_bal               0.000000
revol_util              0.000697
total_acc               0.000000
initial_list_status     0.000000
application_type        0.000000
mort_acc                0.000000
pub_rec_bankruptcies    0.001351
address                 0.000000
loan_repaid             0.000000
dtype: float64

# Drop columns
df = df.dropna()

# Check for missing values
df.isnull().mean()

loan_amnt               0.0
term                    0.0
int_rate                0.0
installment             0.0
grade                   0.0
sub_grade               0.0
home_ownership          0.0
annual_inc              0.0
verification_status     0.0
issue_d                 0.0
loan_status             0.0
purpose                 0.0
dti                     0.0
earliest_cr_line        0.0
open_acc                0.0
pub_rec                 0.0
revol_bal               0.0
revol_util              0.0
total_acc               0.0
initial_list_status     0.0
application_type        0.0
mort_acc                0.0
pub_rec_bankruptcies    0.0
address                 0.0
loan_repaid             0.0
dtype: float64

Categorical Encoding¶

# Check string columns
df.select_dtypes(['object']).columns

Index(['term', 'grade', 'sub_grade', 'home_ownership', 'verification_status',
       'issue_d', 'loan_status', 'purpose', 'earliest_cr_line',
       'initial_list_status', 'application_type', 'address'],
      dtype='object')

# Check value counts
df['term'].value_counts()

 36 months    301247
 60 months     93972
Name: term, dtype: int64

# Take only the first two characters
df['term'] = df['term'].apply(lambda term: int(term[:3]))

# Check data column
df['term'].value_counts()

36    301247
60     93972
Name: term, dtype: int64

# Drop column
df = df.drop('grade', axis = 1)

# Get dummies
dummies = pd.get_dummies(df['sub_grade'], drop_first = True)

# Removie the original column and concat the dummies
df = pd.concat([df.drop('sub_grade', axis = 1), dummies], axis = 1)

# check data column
df.columns

Index(['loan_amnt', 'term', 'int_rate', 'installment', 'home_ownership',
       'annual_inc', 'verification_status', 'issue_d', 'loan_status',
       'purpose', 'dti', 'earliest_cr_line', 'open_acc', 'pub_rec',
       'revol_bal', 'revol_util', 'total_acc', 'initial_list_status',
       'application_type', 'mort_acc', 'pub_rec_bankruptcies', 'address',
       'loan_repaid', 'A2', 'A3', 'A4', 'A5', 'B1', 'B2', 'B3', 'B4', 'B5',
       'C1', 'C2', 'C3', 'C4', 'C5', 'D1', 'D2', 'D3', 'D4', 'D5', 'E1', 'E2',
       'E3', 'E4', 'E5', 'F1', 'F2', 'F3', 'F4', 'F5', 'G1', 'G2', 'G3', 'G4',
       'G5'],
      dtype='object')

# Get dummies
dummies = pd.get_dummies(df[['verification_status', 'application_type', 'initial_list_status','purpose']], drop_first = True)

# Removie the original column and concat the dummies
df = pd.concat([df.drop(['verification_status', 'application_type','initial_list_status','purpose'], axis = 1), dummies], axis = 1)

# Check value counts
df['home_ownership'].value_counts()

MORTGAGE    198022
RENT        159395
OWN          37660
OTHER          110
NONE            29
ANY              3
Name: home_ownership, dtype: int64

# Replace values
df['home_ownership'] = df['home_ownership'].replace(['NONE', 'ANY'], 'OTHER')

# Check value counts
df['home_ownership'].value_counts()

MORTGAGE    198022
RENT        159395
OWN          37660
OTHER          142
Name: home_ownership, dtype: int64

# Get dummies
dummies = pd.get_dummies(df['home_ownership'], drop_first = True)

# Removie the original column and concat the dummies
df = pd.concat([df.drop('home_ownership', axis = 1), dummies], axis = 1)

# Check dataframe
df.head()

# Check column
df['address']

0            0174 Michelle Gateway\r\nMendozaberg, OK 22690
1         1076 Carney Fort Apt. 347\r\nLoganmouth, SD 05113
2         87025 Mark Dale Apt. 269\r\nNew Sabrina, WV 05113
3                   823 Reid Ford\r\nDelacruzside, MA 00813
4                    679 Luna Roads\r\nGreggshire, VA 11650
                                ...                        
396025     12951 Williams Crossing\r\nJohnnyville, DC 30723
396026    0114 Fowler Field Suite 028\r\nRachelborough, ...
396027    953 Matthew Points Suite 414\r\nReedfort, NY 7...
396028    7843 Blake Freeway Apt. 229\r\nNew Michael, FL...
396029        787 Michelle Causeway\r\nBriannaton, AR 48052
Name: address, Length: 395219, dtype: object

# Take last five characters
df['zip_code'] = df['address'].apply(lambda address:address[-5:])

# Check data column
df['zip_code']

0         22690
1         05113
2         05113
3         00813
4         11650
          ...  
396025    30723
396026    05113
396027    70466
396028    29597
396029    48052
Name: zip_code, Length: 395219, dtype: object

# Check value counts
df['zip_code'].value_counts()

70466    56880
22690    56413
30723    56402
48052    55811
00813    45725
29597    45393
05113    45300
11650    11210
93700    11126
86630    10959
Name: zip_code, dtype: int64

# Get dummies
dummies = pd.get_dummies(df['zip_code'], drop_first = True)

# Removie the original column and concat the dummies
df = pd.concat([df.drop('zip_code', axis = 1), dummies], axis = 1)

# Drop the original column
df = df.drop('address', axis = 1)

# Drop column
df = df.drop('issue_d', axis = 1)

# Check data column
df['earliest_cr_line']

0         Jun-1990
1         Jul-2004
2         Aug-2007
3         Sep-2006
4         Mar-1999
            ...   
396025    Nov-2004
396026    Feb-2006
396027    Mar-1997
396028    Nov-1990
396029    Sep-1998
Name: earliest_cr_line, Length: 395219, dtype: object

# Get the last 4 characters
df['earliest_cr_line'] = df['earliest_cr_line'].apply(lambda date: int(date[-4:]))

# Check data column
df['earliest_cr_line']

0         1990
1         2004
2         2007
3         2006
4         1999
          ... 
396025    2004
396026    2006
396027    1997
396028    1990
396029    1998
Name: earliest_cr_line, Length: 395219, dtype: int64

Data Splitting And Scaling¶

# Drop data column
df = df.drop('loan_status', axis = 1)

# Set the x and y variables
X = df.drop('loan_repaid', axis = 1).values
y = df['loan_repaid'].values

# Import library
from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

# Import library
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

# Normalize the data
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Model Creation¶

# Import libraries
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation,Dropout
from tensorflow.keras.constraints import max_norm

# Create model architecture
model = Sequential()

# Input layer
model.add(Dense(78,  activation='relu'))
model.add(Dropout(0.2))

# Hidden layer
model.add(Dense(39, activation='relu'))
model.add(Dropout(0.2))

# Hidden layer
model.add(Dense(19, activation='relu'))
model.add(Dropout(0.2))

# Output layer
model.add(Dense(units=1,activation='sigmoid'))

# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam')

# Train the model
model.fit(x=X_train, 
          y=y_train, 
          epochs=25,
          batch_size=256,
          validation_data=(X_test, y_test), 
          )

Epoch 1/25
1236/1236 [==============================] - 8s 7ms/step - loss: 0.3001 - val_loss: 0.2628
Epoch 2/25
1236/1236 [==============================] - 8s 7ms/step - loss: 0.2658 - val_loss: 0.2619
Epoch 3/25
1236/1236 [==============================] - 9s 7ms/step - loss: 0.2633 - val_loss: 0.2603
Epoch 4/25
1236/1236 [==============================] - 8s 7ms/step - loss: 0.2625 - val_loss: 0.2596
Epoch 5/25
1236/1236 [==============================] - 10s 8ms/step - loss: 0.2617 - val_loss: 0.2597
Epoch 6/25
1236/1236 [==============================] - 10s 8ms/step - loss: 0.2609 - val_loss: 0.2599
Epoch 7/25
1236/1236 [==============================] - 9s 7ms/step - loss: 0.2605 - val_loss: 0.2592
Epoch 8/25
1236/1236 [==============================] - 10s 8ms/step - loss: 0.2600 - val_loss: 0.2590
Epoch 9/25
1236/1236 [==============================] - 9s 8ms/step - loss: 0.2598 - val_loss: 0.2594
Epoch 10/25
1236/1236 [==============================] - 9s 7ms/step - loss: 0.2596 - val_loss: 0.2586
Epoch 11/25
1236/1236 [==============================] - 9s 8ms/step - loss: 0.2596 - val_loss: 0.2589
Epoch 12/25
1236/1236 [==============================] - 10s 8ms/step - loss: 0.2589 - val_loss: 0.2589
Epoch 13/25
1236/1236 [==============================] - 10s 8ms/step - loss: 0.2589 - val_loss: 0.2586
Epoch 14/25
1236/1236 [==============================] - 11s 9ms/step - loss: 0.2586 - val_loss: 0.2587
Epoch 15/25
1236/1236 [==============================] - 11s 9ms/step - loss: 0.2584 - val_loss: 0.2584
Epoch 16/25
1236/1236 [==============================] - 11s 9ms/step - loss: 0.2580 - val_loss: 0.2586
Epoch 17/25
1236/1236 [==============================] - 12s 10ms/step - loss: 0.2579 - val_loss: 0.2593
Epoch 18/25
1236/1236 [==============================] - 11s 9ms/step - loss: 0.2577 - val_loss: 0.2588
Epoch 19/25
1236/1236 [==============================] - 12s 10ms/step - loss: 0.2575 - val_loss: 0.2586
Epoch 20/25
1236/1236 [==============================] - 17s 14ms/step - loss: 0.2572 - val_loss: 0.2587- ETA: 5s - loss: 0. - ET - ETA: 5s - loss: 0. - ETA:  - ETA: 2s -
Epoch 21/25
1236/1236 [==============================] - 15s 12ms/step - loss: 0.2575 - val_loss: 0.2583s - lo - ETA: - ETA: 1s - l - E
Epoch 22/25
1236/1236 [==============================] - 13s 10ms/step - loss: 0.2569 - val_loss: 0.2582
Epoch 23/25
1236/1236 [==============================] - 13s 11ms/step - loss: 0.2569 - val_loss: 0.2583oss: 0.25
Epoch 24/25
1236/1236 [==============================] - 13s 10ms/step - loss: 0.2568 - val_loss: 0.2584
Epoch 25/25
1236/1236 [==============================] - 15s 12ms/step - loss: 0.2566 - val_loss: 0.2586

<tensorflow.python.keras.callbacks.History at 0x1d612046d00>

Model Evaluation¶

# Check dataframe
pd.DataFrame(model.history.history)

# label the dataframe
losses = pd.DataFrame(model.history.history)

# Plot the evaluation metrics
losses[['loss','val_loss']].plot()
plt.show()

# Import library
from sklearn.metrics import classification_report,confusion_matrix

# Set prediction
predictions = model.predict_classes(X_test)
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.95      0.45      0.61     15421
           1       0.88      0.99      0.94     63623

    accuracy                           0.89     79044
   macro avg       0.92      0.72      0.77     79044
weighted avg       0.90      0.89      0.87     79044

# Check confusion matrix
confusion_matrix(y_test,predictions)

array([[ 6991,  8430],
       [  355, 63268]], dtype=int64)

Sample Prediction¶

# Create sample
import random
random.seed(101)
random_ind = random.randint(0,len(df))

new_customer = df.drop('loan_repaid', axis=1).iloc[random_ind]
new_customer

loan_amnt      25000.00
term              60.00
int_rate          18.24
installment      638.11
annual_inc     61665.00
                 ...   
30723              1.00
48052              0.00
70466              0.00
86630              0.00
93700              0.00
Name: 305323, Length: 78, dtype: float64

# Predict Sample
model.predict_classes(new_customer.values.reshape(1,78))

array([[1]])

# True value
df.iloc[random_ind]['loan_repaid']

1.0

Save Model¶

# Import library
from tensorflow.keras.models import load_model

# Save the model
model.save('full_data_project_model.h5')

Conclusion¶

The model was able to obtained 89% accuracy but there was an issue, the dataset was imbalance. There were more 'Fully paid' than 'Charged off' which has a probability of 80% by just guessing Fully Paid. Possible solution for that is Re-sampling the training dataset (Under-sampling or over-sampling and then apply k-fold cross validation.

	loan_amnt	int_rate	installment	annual_inc	dti	open_acc	pub_rec	revol_bal	revol_util	total_acc	mort_acc	pub_rec_bankruptcies
count	396030.000000	396030.000000	396030.000000	3.960300e+05	396030.000000	396030.000000	396030.000000	3.960300e+05	395754.000000	396030.000000	358235.000000	395495.000000
mean	14113.888089	13.639400	431.849698	7.420318e+04	17.379514	11.311153	0.178191	1.584454e+04	53.791749	25.414744	1.813991	0.121648
std	8357.441341	4.472157	250.727790	6.163762e+04	18.019092	5.137649	0.530671	2.059184e+04	24.452193	11.886991	2.147930	0.356174
min	500.000000	5.320000	16.080000	0.000000e+00	0.000000	0.000000	0.000000	0.000000e+00	0.000000	2.000000	0.000000	0.000000
25%	8000.000000	10.490000	250.330000	4.500000e+04	11.280000	8.000000	0.000000	6.025000e+03	35.800000	17.000000	0.000000	0.000000
50%	12000.000000	13.330000	375.430000	6.400000e+04	16.910000	10.000000	0.000000	1.118100e+04	54.800000	24.000000	1.000000	0.000000
75%	20000.000000	16.490000	567.300000	9.000000e+04	22.980000	14.000000	0.000000	1.962000e+04	72.900000	32.000000	3.000000	0.000000
max	40000.000000	30.990000	1533.810000	8.706582e+06	9999.000000	90.000000	86.000000	1.743266e+06	892.300000	151.000000	34.000000	8.000000

	loan_amnt	int_rate	installment	annual_inc	dti	open_acc	pub_rec	revol_bal	revol_util	total_acc	mort_acc	pub_rec_bankruptcies
loan_amnt	1.000000	0.168921	0.953929	0.336887	0.016636	0.198556	-0.077779	0.328320	0.099911	0.223886	0.222315	-0.106539
int_rate	0.168921	1.000000	0.162758	-0.056771	0.079038	0.011649	0.060986	-0.011280	0.293659	-0.036404	-0.082583	0.057450
installment	0.953929	0.162758	1.000000	0.330381	0.015786	0.188973	-0.067892	0.316455	0.123915	0.202430	0.193694	-0.098628
annual_inc	0.336887	-0.056771	0.330381	1.000000	-0.081685	0.136150	-0.013720	0.299773	0.027871	0.193023	0.236320	-0.050162
dti	0.016636	0.079038	0.015786	-0.081685	1.000000	0.136181	-0.017639	0.063571	0.088375	0.102128	-0.025439	-0.014558
open_acc	0.198556	0.011649	0.188973	0.136150	0.136181	1.000000	-0.018392	0.221192	-0.131420	0.680728	0.109205	-0.027732
pub_rec	-0.077779	0.060986	-0.067892	-0.013720	-0.017639	-0.018392	1.000000	-0.101664	-0.075910	0.019723	0.011552	0.699408
revol_bal	0.328320	-0.011280	0.316455	0.299773	0.063571	0.221192	-0.101664	1.000000	0.226346	0.191616	0.194925	-0.124532
revol_util	0.099911	0.293659	0.123915	0.027871	0.088375	-0.131420	-0.075910	0.226346	1.000000	-0.104273	0.007514	-0.086751
total_acc	0.223886	-0.036404	0.202430	0.193023	0.102128	0.680728	0.019723	0.191616	-0.104273	1.000000	0.381072	0.042035
mort_acc	0.222315	-0.082583	0.193694	0.236320	-0.025439	0.109205	0.011552	0.194925	0.007514	0.381072	1.000000	0.027239
pub_rec_bankruptcies	-0.106539	0.057450	-0.098628	-0.050162	-0.014558	-0.027732	0.699408	-0.124532	-0.086751	0.042035	0.027239	1.000000

	count	mean	std	min	25%	50%	75%	max
loan_status
Charged Off	77673.0	15126.300967	8505.090557	1000.0	8525.0	14000.0	20000.0	40000.0
Fully Paid	318357.0	13866.878771	8302.319699	500.0	7500.0	12000.0	19225.0	40000.0

	loan_amnt	int_rate	installment	annual_inc	dti	open_acc	pub_rec	revol_bal	revol_util	mort_acc	pub_rec_bankruptcies	loan_repaid
total_acc
2.0	6672.222222	15.801111	210.881667	64277.777778	2.279444	1.611111	0.000000	2860.166667	53.527778	0.000000	0.000000	0.777778
3.0	6042.966361	15.615566	198.728318	41270.753884	6.502813	2.611621	0.045872	3382.807339	49.991022	0.052023	0.015480	0.779817
4.0	7587.399031	15.069491	250.050194	42426.565969	8.411963	3.324717	0.041195	4874.231826	58.477400	0.066743	0.022951	0.785945
5.0	7845.734714	14.917564	256.190325	44394.098003	10.118328	3.921598	0.071499	5475.253452	56.890311	0.103289	0.041171	0.796844
6.0	8529.019843	14.651752	278.518228	48470.001156	11.222542	4.511119	0.104003	6546.374957	57.812483	0.151293	0.055077	0.784126
...	...	...	...	...	...	...	...	...	...	...	...	...
124.0	23200.000000	17.860000	587.370000	66000.000000	14.040000	43.000000	0.000000	25497.000000	75.400000	1.000000	0.000000	0.000000
129.0	25000.000000	7.890000	505.600000	200000.000000	8.900000	48.000000	0.000000	27659.000000	8.300000	1.000000	0.000000	1.000000
135.0	24000.000000	15.410000	576.140000	82000.000000	33.850000	57.000000	0.000000	35715.000000	50.800000	3.000000	0.000000	1.000000
150.0	35000.000000	8.670000	1107.630000	189000.000000	6.630000	40.000000	0.000000	39065.000000	44.400000	2.000000	0.000000	1.000000
151.0	35000.000000	13.990000	1196.050000	160000.000000	12.650000	26.000000	0.000000	46643.000000	71.500000	0.000000	0.000000	0.000000

	loss	val_loss
0	0.300101	0.262768
1	0.265768	0.261880
2	0.263329	0.260314
3	0.262459	0.259649
4	0.261652	0.259654
5	0.260856	0.259863
6	0.260470	0.259188
7	0.259991	0.259018
8	0.259822	0.259445
9	0.259569	0.258644
10	0.259577	0.258945
11	0.258858	0.258921
12	0.258878	0.258647
13	0.258598	0.258740
14	0.258404	0.258425
15	0.258042	0.258552
16	0.257948	0.259319
17	0.257684	0.258779
18	0.257482	0.258589
19	0.257214	0.258653
20	0.257455	0.258279
21	0.256869	0.258192
22	0.256862	0.258293
23	0.256762	0.258402
24	0.256617	0.258631

	loan_amnt	term	int_rate	installment	grade	sub_grade	emp_title	emp_length	home_ownership	annual_inc	...	open_acc	revol_bal	revol_util	total_acc	initial_list_status	application_type	mort_acc	address
0	10000.0	36 months	11.44	329.48	B	B4	Marketing	10+ years	RENT	117000.0	...	16.0	36369.0	41.8	25.0	w	INDIVIDUAL	0.0	0174 Michelle Gateway\r\nMendozaberg, OK 22690
1	8000.0	36 months	11.99	265.68	B	B5	Credit analyst	4 years	MORTGAGE	65000.0	...	17.0	20131.0	53.3	27.0	f	INDIVIDUAL	3.0	1076 Carney Fort Apt. 347\r\nLoganmouth, SD 05113
2	15600.0	36 months	10.49	506.97	B	B3	Statistician	< 1 year	RENT	43057.0	...	13.0	11987.0	92.2	26.0	f	INDIVIDUAL	0.0	87025 Mark Dale Apt. 269\r\nNew Sabrina, WV 05113
3	7200.0	36 months	6.49	220.65	A	A2	Client Advocate	6 years	RENT	54000.0	...	6.0	5472.0	21.5	13.0	f	INDIVIDUAL	0.0	823 Reid Ford\r\nDelacruzside, MA 00813
4	24375.0	60 months	17.27	609.33	C	C5	Destiny Management Inc.	9 years	MORTGAGE	55000.0	...	13.0	24584.0	69.8	43.0	f	INDIVIDUAL	1.0	679 Luna Roads\r\nGreggshire, VA 11650