LendingClub is a US peer-to-peer lending company, headquartered in San Francisco, California. It was the first peer-to-peer lender to register its offerings as securities with the Securities and Exchange Commission (SEC), and to offer loan trading on a secondary market. LendingClub is the world's largest peer-to-peer lending platform.
Problem:
Dataset:
Source: https://www.lendingclub.com/
An artificial neural network is an interconnected group of nodes, inspired by a simplification of neurons in a brain. Here, each circular node represents an artificial neuron and an arrow represents a connection from the output of one artificial neuron to the input of another.
ANNs are composed of artificial neurons which are conceptually derived from biological neurons. Each artificial neuron has inputs and produce a single output which can be sent to multiple other neurons. The inputs can be the feature values of a sample of external data, such as images or documents, or they can be the outputs of other neurons. The outputs of the final output neurons of the neural net accomplish the task, such as recognizing an object in an image.
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Import data
df = pd.read_csv('lending_club_loan.csv')
# Check dataframe
df.head()
# Check data statistics
df.describe()
# Check more info
df.info()
# Visualize data
sns.countplot(x = 'loan_status' , data = df)
plt.show()
# visualize data
plt.figure(figsize = (12,4))
sns.distplot(df['loan_amnt'], kde = False, bins = 40)
plt.show()
# Check data correlation
df.corr()
# Visualize data correlation
plt.figure(figsize = (12, 7))
sns.heatmap(df.corr(), annot = True, cmap = 'viridis')
plt.show()
# Visualize data
sns.scatterplot(x = 'installment', y = 'loan_amnt', data = df)
plt.show()
# Visualize data
sns.boxplot(x = 'loan_status', y = 'loan_amnt', data = df)
plt.show()
# Check summary statistics
df.groupby('loan_status')['loan_amnt'].describe()
# Check for unique values
df['grade'].unique()
# Check for unique values
df['sub_grade'].unique()
# Visualize data
sns.countplot(x = 'grade', data = df, hue = 'loan_status')
plt.show()
# Visualize data
plt.figure(figsize = (12, 4))
subgrade_order = sorted(df['sub_grade'].unique())
sns.countplot(x = 'sub_grade', data = df, order = subgrade_order, palette = 'coolwarm')
plt.show()
# Visualize data
plt.figure(figsize = (12, 4))
subgrade_order = sorted(df['sub_grade'].unique())
sns.countplot(x = 'sub_grade', data = df, order = subgrade_order, palette = 'coolwarm', hue = 'loan_status')
plt.show()
# Visualize data
f_and_g = df[(df['grade'] == 'G') | (df['grade'] == 'F')]
plt.figure(figsize = (12, 4))
subgrade_order = sorted(f_and_g['sub_grade'].unique())
sns.countplot(x = 'sub_grade', data = f_and_g, order = subgrade_order, palette = 'coolwarm', hue = 'loan_status')
plt.show()
# Create new column
df['loan_repaid'] = df['loan_status'].map({'Fully Paid': 1, 'Charged Off': 0})
# Check dataframe
df[['loan_repaid', 'loan_status']]
# Visualize correlation impact
plt.figure(figsize = (8, 4))
df.corr()['loan_repaid'].sort_values().drop('loan_repaid').plot(kind = 'bar')
plt.show()
# Check dataframe length
len(df)
# Check for missing values
df.isnull().mean()
# Check the percentage of missing data
(df.isnull().sum() / len(df)) * 100
# Check data unique counts
df['emp_title'].nunique()
# check for value counts
df['emp_title'].value_counts()
# Drop column
df = df.drop('emp_title', axis = 1)
# check and dort data column
sorted(df['emp_length'].dropna().unique())
# Rearrange column data
emp_length_order = [
'< 1 year'
'1 year',
'2 years',
'3 years',
'4 years',
'5 years',
'6 years',
'7 years',
'8 years',
'9 years',
'10+ years']
# Visualize data
plt.figure(figsize = (12, 3))
sns.countplot(x = 'emp_length', data = df, order = emp_length_order, hue = 'loan_status')
plt.show()
# Create filtered dataframe
emp_co = df[df['loan_status'] == 'Charged Off'].groupby('emp_length').count()['loan_status']
# Check dataframe
emp_co
# Create filtered dataframe
emp_fp = df[df['loan_status'] == 'Fully Paid'].groupby('emp_length').count()['loan_status']
# Check dataframe
emp_fp
# Check ratio
emp_co / emp_fp
# Create label
emp_len = emp_co / (emp_co + emp_fp)
# Visualize data
emp_len.plot(kind = 'bar')
plt.show()
# Drop column
df = df.drop('emp_length', axis = 1)
# Check data column
df['purpose'].head(10)
# Check data column
df['title'].head(10)
# Drop column
df = df.drop('title', axis = 1)
# Check data column value counts
df['mort_acc'].value_counts()
# Check specified correlation
df.corr()['mort_acc'].sort_values()
# compare each column by mean according to a specific column
df.groupby('total_acc').mean()
# create new data frame
total_acc_avg = df.groupby('total_acc').mean()['mort_acc']
# Check dataframe
total_acc_avg
# Create function
def fill_mort_acc(total_acc, mort_acc):
if np.isnan(mort_acc):
return total_acc_avg[total_acc]
else:
return mort_acc
df['mort_acc'] = df.apply(lambda x: fill_mort_acc(x['total_acc'], x['mort_acc']), axis = 1) # Lambda = apply function to two pandas dataframe
# Check for missing values
df.isnull().mean()
# Drop columns
df = df.dropna()
# Check for missing values
df.isnull().mean()
# Check string columns
df.select_dtypes(['object']).columns
# Check value counts
df['term'].value_counts()
# Take only the first two characters
df['term'] = df['term'].apply(lambda term: int(term[:3]))
# Check data column
df['term'].value_counts()
# Drop column
df = df.drop('grade', axis = 1)
# Get dummies
dummies = pd.get_dummies(df['sub_grade'], drop_first = True)
# Removie the original column and concat the dummies
df = pd.concat([df.drop('sub_grade', axis = 1), dummies], axis = 1)
# check data column
df.columns
# Get dummies
dummies = pd.get_dummies(df[['verification_status', 'application_type', 'initial_list_status','purpose']], drop_first = True)
# Removie the original column and concat the dummies
df = pd.concat([df.drop(['verification_status', 'application_type','initial_list_status','purpose'], axis = 1), dummies], axis = 1)
# Check value counts
df['home_ownership'].value_counts()
# Replace values
df['home_ownership'] = df['home_ownership'].replace(['NONE', 'ANY'], 'OTHER')
# Check value counts
df['home_ownership'].value_counts()
# Get dummies
dummies = pd.get_dummies(df['home_ownership'], drop_first = True)
# Removie the original column and concat the dummies
df = pd.concat([df.drop('home_ownership', axis = 1), dummies], axis = 1)
# Check dataframe
df.head()
# Check column
df['address']
# Take last five characters
df['zip_code'] = df['address'].apply(lambda address:address[-5:])
# Check data column
df['zip_code']
# Check value counts
df['zip_code'].value_counts()
# Get dummies
dummies = pd.get_dummies(df['zip_code'], drop_first = True)
# Removie the original column and concat the dummies
df = pd.concat([df.drop('zip_code', axis = 1), dummies], axis = 1)
# Drop the original column
df = df.drop('address', axis = 1)
# Drop column
df = df.drop('issue_d', axis = 1)
# Check data column
df['earliest_cr_line']
# Get the last 4 characters
df['earliest_cr_line'] = df['earliest_cr_line'].apply(lambda date: int(date[-4:]))
# Check data column
df['earliest_cr_line']
# Drop data column
df = df.drop('loan_status', axis = 1)
# Set the x and y variables
X = df.drop('loan_repaid', axis = 1).values
y = df['loan_repaid'].values
# Import library
from sklearn.model_selection import train_test_split
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
# Import library
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
# Normalize the data
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Import libraries
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation,Dropout
from tensorflow.keras.constraints import max_norm
# Create model architecture
model = Sequential()
# Input layer
model.add(Dense(78, activation='relu'))
model.add(Dropout(0.2))
# Hidden layer
model.add(Dense(39, activation='relu'))
model.add(Dropout(0.2))
# Hidden layer
model.add(Dense(19, activation='relu'))
model.add(Dropout(0.2))
# Output layer
model.add(Dense(units=1,activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam')
# Train the model
model.fit(x=X_train,
y=y_train,
epochs=25,
batch_size=256,
validation_data=(X_test, y_test),
)
# Check dataframe
pd.DataFrame(model.history.history)
# label the dataframe
losses = pd.DataFrame(model.history.history)
# Plot the evaluation metrics
losses[['loss','val_loss']].plot()
plt.show()
# Import library
from sklearn.metrics import classification_report,confusion_matrix
# Set prediction
predictions = model.predict_classes(X_test)
print(classification_report(y_test,predictions))
# Check confusion matrix
confusion_matrix(y_test,predictions)
# Create sample
import random
random.seed(101)
random_ind = random.randint(0,len(df))
new_customer = df.drop('loan_repaid', axis=1).iloc[random_ind]
new_customer
# Predict Sample
model.predict_classes(new_customer.values.reshape(1,78))
# True value
df.iloc[random_ind]['loan_repaid']
# Import library
from tensorflow.keras.models import load_model
# Save the model
model.save('full_data_project_model.h5')
The model was able to obtained 89% accuracy but there was an issue, the dataset was imbalance. There were more 'Fully paid' than 'Charged off' which has a probability of 80% by just guessing Fully Paid. Possible solution for that is Re-sampling the training dataset (Under-sampling or over-sampling and then apply k-fold cross validation.