Case Study: Credit Card Fraud Detection

Introduction

Fraud detection, one of the many cases of anomaly detection is an important aspect of financial markets. Is there any way to predict whether a transaction is fraudulent or not based on the history of transactions? In this project, neural network architecture will be implemented as it attempts to predict the cases as frauds or not.

Problem:

  • Create a model that detects fraudulent case

Dataset:

  • The dataset contains 28 anonymized variables, 1 “amount” variable, 1 “time” variable, and 1 target variable — Class. The variables are anonymized to protect the privacy of the customers as the dataset is in the public domain. The dataset can be found here. ‘0’ as target variable corresponds to the non-fraudulent cases whereas ‘1’ in target variable corresponds to fraudulent cases.

Source: ULB-ML Group

Libraries and Data Importation

In [22]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import keras

np.random.seed(0)
In [2]:
# Import data
data = pd.read_csv('project_data/creditcard.csv')

Data Exploration

In [3]:
data.head()
Out[3]:
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 149.62 0
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 -0.255425 ... -0.225775 -0.638672 0.101288 -0.339846 0.167170 0.125895 -0.008983 0.014724 2.69 0
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 0.247676 -1.514654 ... 0.247998 0.771679 0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 378.66 0
3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 0.377436 -1.387024 ... -0.108300 0.005274 -0.190321 -1.175575 0.647376 -0.221929 0.062723 0.061458 123.50 0
4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 -0.270533 0.817739 ... -0.009431 0.798278 -0.137458 0.141267 -0.206010 0.502292 0.219422 0.215153 69.99 0

5 rows × 31 columns

In [4]:
# Check data dimension
data.shape
Out[4]:
(284807, 31)
In [5]:
# Check data info
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     284807 non-null  float64
 22  V22     284807 non-null  float64
 23  V23     284807 non-null  float64
 24  V24     284807 non-null  float64
 25  V25     284807 non-null  float64
 26  V26     284807 non-null  float64
 27  V27     284807 non-null  float64
 28  V28     284807 non-null  float64
 29  Amount  284807 non-null  float64
 30  Class   284807 non-null  int64  
dtypes: float64(30), int64(1)
memory usage: 67.4 MB
In [6]:
# Check for missing values
data.isnull().mean()
Out[6]:
Time      0.0
V1        0.0
V2        0.0
V3        0.0
V4        0.0
V5        0.0
V6        0.0
V7        0.0
V8        0.0
V9        0.0
V10       0.0
V11       0.0
V12       0.0
V13       0.0
V14       0.0
V15       0.0
V16       0.0
V17       0.0
V18       0.0
V19       0.0
V20       0.0
V21       0.0
V22       0.0
V23       0.0
V24       0.0
V25       0.0
V26       0.0
V27       0.0
V28       0.0
Amount    0.0
Class     0.0
dtype: float64

Data Pre-processing

In [7]:
# Implement feature scaling
from sklearn.preprocessing import StandardScaler
data['NomarlizedAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1,1))

# Drop the original data value
data = data.drop(['Amount'], axis = 1)

# drop the Time column
data = data.drop(['Time'], axis = 1)
In [8]:
# Check data
data.head()
Out[8]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ... V21 V22 V23 V24 V25 V26 V27 V28 Class NomarlizedAmount
0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 0.090794 ... -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 0 0.244964
1 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 -0.255425 -0.166974 ... -0.225775 -0.638672 0.101288 -0.339846 0.167170 0.125895 -0.008983 0.014724 0 -0.342475
2 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 0.247676 -1.514654 0.207643 ... 0.247998 0.771679 0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 0 1.160686
3 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 0.377436 -1.387024 -0.054952 ... -0.108300 0.005274 -0.190321 -1.175575 0.647376 -0.221929 0.062723 0.061458 0 0.140534
4 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 -0.270533 0.817739 0.753074 ... -0.009431 0.798278 -0.137458 0.141267 -0.206010 0.502292 0.219422 0.215153 0 -0.073403

5 rows × 30 columns

In [9]:
# Create label for dependent and independent variable
X = data.iloc[:, data.columns != 'Class']
y = data.iloc[:, data.columns == 'Class']
In [10]:
# Check independent variables
X.head()
Out[10]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ... V20 V21 V22 V23 V24 V25 V26 V27 V28 NomarlizedAmount
0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 0.090794 ... 0.251412 -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 0.244964
1 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 -0.255425 -0.166974 ... -0.069083 -0.225775 -0.638672 0.101288 -0.339846 0.167170 0.125895 -0.008983 0.014724 -0.342475
2 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 0.247676 -1.514654 0.207643 ... 0.524980 0.247998 0.771679 0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 1.160686
3 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 0.377436 -1.387024 -0.054952 ... -0.208038 -0.108300 0.005274 -0.190321 -1.175575 0.647376 -0.221929 0.062723 0.061458 0.140534
4 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 -0.270533 0.817739 0.753074 ... 0.408542 -0.009431 0.798278 -0.137458 0.141267 -0.206010 0.502292 0.219422 0.215153 -0.073403

5 rows × 29 columns

In [11]:
# Check dependent variable
y.head()
Out[11]:
Class
0 0
1 0
2 0
3 0
4 0
In [12]:
# Split the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
In [13]:
# Check train dataset dimensions
X_train.shape, y_train.shape
Out[13]:
((199364, 29), (199364, 1))
In [14]:
# Check test dataset dimensions
X_test.shape, y_test.shape
Out[14]:
((85443, 29), (85443, 1))
In [15]:
# Transform train dataset into array
X_train = np.array(X_train)
y_train = np.array(y_train)

# Transform test dataset into array
X_test = np.array(X_test)
y_test = np.array(y_test)

Deep Neural Network Model

In [16]:
# Import libraries
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout

# Build model
model = Sequential([
    Dense(units = 16, input_dim = 29, activation = 'relu'),
    Dense(units = 24, activation = 'relu'),
    Dropout(0.5),
    Dense(units = 24, activation = 'relu'),
    Dense(units = 24, activation = 'relu'),
    Dense(1, activation = 'sigmoid')
])
In [17]:
# Check model summary
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 16)                480       
_________________________________________________________________
dense_1 (Dense)              (None, 24)                408       
_________________________________________________________________
dropout (Dropout)            (None, 24)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 24)                600       
_________________________________________________________________
dense_3 (Dense)              (None, 24)                600       
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 25        
=================================================================
Total params: 2,113
Trainable params: 2,113
Non-trainable params: 0
_________________________________________________________________

Model Training

In [18]:
# Compile the model
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

# Fit the model and initiate training
model.fit(X_train, y_train, batch_size = 15, epochs = 5)
Epoch 1/5
13291/13291 [==============================] - 26s 2ms/step - loss: 0.0086 - accuracy: 0.9990
Epoch 2/5
13291/13291 [==============================] - 33s 2ms/step - loss: 0.0039 - accuracy: 0.9993
Epoch 3/5
13291/13291 [==============================] - 40s 3ms/step - loss: 0.0037 - accuracy: 0.9993
Epoch 4/5
13291/13291 [==============================] - 37s 3ms/step - loss: 0.0034 - accuracy: 0.9993
Epoch 5/5
13291/13291 [==============================] - 39s 3ms/step - loss: 0.0034 - accuracy: 0.9993
Out[18]:
<tensorflow.python.keras.callbacks.History at 0x10dcd126460>

Model Evaluation

In [19]:
# Check model score in test dataset
score = model.evaluate(X_test, y_test)
print(score)
2671/2671 [==============================] - 3s 1ms/step - loss: 0.0037 - accuracy: 0.9993
[0.003683381946757436, 0.9993445873260498]
In [21]:
# Import libraries
import itertools
from sklearn.metrics import confusion_matrix

# Create a plot function
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()
In [24]:
# Get the predictions
y_pred = model.predict(X_test)

# Transform into dataframe
y_test = pd.DataFrame(y_test)
In [25]:
# Check confusion matrix for test dataset
cm =  confusion_matrix(y_test, y_pred.round())
print(cm)
[[85286    10]
 [   46   101]]
In [26]:
# Visualize confusion matrix for test dataset
plot_confusion_matrix(cm, classes = [0, 1])
plt.show()
Confusion matrix, without normalization
[[85286    10]
 [   46   101]]
In [28]:
# Apply sampling confusion matrix # Applying the whole parameter # Entire dateset
y_pred = model.predict(X)
y_expected = pd.DataFrame(y)

# Create confusion matrix for entire dataset
cm = confusion_matrix(y_expected, y_pred.round())

# Visualize confusion matrix for entire dataset
plot_confusion_matrix(cm, classes = [0, 1])
plt.show()
Confusion matrix, without normalization
[[284286     29]
 [   169    323]]

Random Forest Model

In [37]:
# Apply random forest
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators = 10)
random_forest.fit(X_train, y_train.values.ravel())

# Get predictions
y_pred = random_forest.predict(X_test)
In [38]:
# Check test score
random_forest.score(X_test, y_test)
Out[38]:
0.9994967405170698
In [43]:
# Check confusion matrix for test dataset
cm = confusion_matrix(y_test, y_pred)

# Print confusion matrix
plot_confusion_matrix(cm, classes = [0, 1])
plt.show()
Confusion matrix, without normalization
[[85290     6]
 [   37   110]]
In [46]:
# Apply sampling confusion matrix # Applying the whole parameter # Entire dateset
y_pred = random_forest.predict(X)

# Confusion matrix for entire dataset
cm = confusion_matrix(y, y_pred.round())

# Print confusion matrix
plot_confusion_matrix(cm, classes = [0, 1])
plt.show()
Confusion matrix, without normalization
[[284308      7]
 [    55    437]]

Decision Tree Model

In [48]:
# Import library
from sklearn.tree import DecisionTreeClassifier
decision_tree = DecisionTreeClassifier()

# Fit the model
decision_tree.fit(X_train, y_train.values.ravel())

# Get predictions
y_pred = decision_tree.predict(X_test)
In [49]:
# Check test score
decision_tree.score(X_test, y_test)
Out[49]:
0.9991807403766253
In [50]:
# Apply sampling confusion matrix # Applying the whole parameter # Entire dateset
y_pred = decision_tree.predict(X)

# Confusion matrix for entire dataset
cm = confusion_matrix(y, y_pred.round())

# Print confusion matrix
plot_confusion_matrix(cm, classes = [0, 1])
plt.show()
Confusion matrix, without normalization
[[284279     36]
 [    34    458]]

Undersampling Application

In [69]:
# Undersampling techinique
# Normal indices
normal_indices = data[data['Class'] == 0].index

# Fraud Indices
fraud_indices = np.array(data[data['Class'] == 1].index)

# Get the number of fraud data
number_records_fraud = len(fraud_indices)
print(number_records_fraud)
492
In [70]:
# Generates a random sample from a given 1-D array
random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace = False)

# Create an array
random_normal_indices = np.array(random_normal_indices)

# Check the size
print(len(random_normal_indices))
492
In [71]:
# concatenate dataframe
under_sample_indices = np.concatenate([fraud_indices, random_normal_indices])
print(len(under_sample_indices))
984
In [74]:
# label under sample data
under_sample_data = data.iloc[under_sample_indices, :]
under_sample_indices
Out[74]:
array([   541,    623,   4920,   6108,   6329,   6331,   6334,   6336,
         6338,   6427,   6446,   6472,   6529,   6609,   6641,   6717,
         6719,   6734,   6774,   6820,   6870,   6882,   6899,   6903,
         6971,   8296,   8312,   8335,   8615,   8617,   8842,   8845,
         8972,   9035,   9179,   9252,   9487,   9509,  10204,  10484,
        10497,  10498,  10568,  10630,  10690,  10801,  10891,  10897,
        11343,  11710,  11841,  11880,  12070,  12108,  12261,  12369,
        14104,  14170,  14197,  14211,  14338,  15166,  15204,  15225,
        15451,  15476,  15506,  15539,  15566,  15736,  15751,  15781,
        15810,  16415,  16780,  16863,  17317,  17366,  17407,  17453,
        17480,  18466,  18472,  18773,  18809,  20198,  23308,  23422,
        26802,  27362,  27627,  27738,  27749,  29687,  30100,  30314,
        30384,  30398,  30442,  30473,  30496,  31002,  33276,  39183,
        40085,  40525,  41395,  41569,  41943,  42007,  42009,  42473,
        42528,  42549,  42590,  42609,  42635,  42674,  42696,  42700,
        42741,  42756,  42769,  42784,  42856,  42887,  42936,  42945,
        42958,  43061,  43160,  43204,  43428,  43624,  43681,  43773,
        44001,  44091,  44223,  44270,  44556,  45203,  45732,  46909,
        46918,  46998,  47802,  48094,  50211,  50537,  52466,  52521,
        52584,  53591,  53794,  55401,  56703,  57248,  57470,  57615,
        58422,  58761,  59539,  61787,  63421,  63634,  64329,  64411,
        64460,  68067,  68320,  68522,  68633,  69498,  69980,  70141,
        70589,  72757,  73784,  73857,  74496,  74507,  74794,  75511,
        76555,  76609,  76929,  77099,  77348,  77387,  77682,  79525,
        79536,  79835,  79874,  79883,  80760,  81186,  81609,  82400,
        83053,  83297,  83417,  84543,  86155,  87354,  88258,  88307,
        88876,  88897,  89190,  91671,  92777,  93424,  93486,  93788,
        94218,  95534,  95597,  96341,  96789,  96994,  99506, 100623,
       101509, 102441, 102442, 102443, 102444, 102445, 102446, 102782,
       105178, 106679, 106998, 107067, 107637, 108258, 108708, 111690,
       112840, 114271, 116139, 116404, 118308, 119714, 119781, 120505,
       120837, 122479, 123141, 123201, 123238, 123270, 123301, 124036,
       124087, 124115, 124176, 125342, 128479, 131272, 135718, 137705,
       140786, 141257, 141258, 141259, 141260, 142405, 142557, 143188,
       143333, 143334, 143335, 143336, 143728, 143731, 144104, 144108,
       144754, 145800, 146790, 147548, 147605, 149145, 149357, 149522,
       149577, 149587, 149600, 149869, 149874, 150601, 150644, 150647,
       150654, 150660, 150661, 150662, 150663, 150665, 150666, 150667,
       150668, 150669, 150677, 150678, 150679, 150680, 150684, 150687,
       150692, 150697, 150715, 150925, 151006, 151007, 151008, 151009,
       151011, 151103, 151196, 151462, 151519, 151730, 151807, 152019,
       152223, 152295, 153823, 153835, 153885, 154234, 154286, 154371,
       154454, 154587, 154633, 154668, 154670, 154676, 154684, 154693,
       154694, 154697, 154718, 154719, 154720, 154960, 156988, 156990,
       157585, 157868, 157871, 157918, 163149, 163586, 167184, 167305,
       172787, 176049, 177195, 178208, 181966, 182992, 183106, 184379,
       189587, 189701, 189878, 190368, 191074, 191267, 191359, 191544,
       191690, 192382, 192529, 192584, 192687, 195383, 197586, 198868,
       199896, 201098, 201601, 203324, 203328, 203700, 204064, 204079,
       204503, 208651, 212516, 212644, 213092, 213116, 214662, 214775,
       215132, 215953, 215984, 218442, 219025, 219892, 220725, 221018,
       221041, 222133, 222419, 223366, 223572, 223578, 223618, 226814,
       226877, 229712, 229730, 230076, 230476, 231978, 233258, 234574,
       234632, 234633, 234705, 235616, 235634, 235644, 237107, 237426,
       238222, 238366, 238466, 239499, 239501, 240222, 241254, 241445,
       243393, 243547, 243699, 243749, 243848, 244004, 244333, 245347,
       245556, 247673, 247995, 248296, 248971, 249167, 249239, 249607,
       249828, 249963, 250761, 251477, 251866, 251881, 251891, 251904,
       252124, 252774, 254344, 254395, 255403, 255556, 258403, 261056,
       261473, 261925, 262560, 262826, 263080, 263274, 263324, 263877,
       268375, 272521, 274382, 274475, 275992, 276071, 276864, 279863,
       280143, 280149, 281144, 281674,  91742, 284081, 270050,  28369,
       280757, 268307, 245137, 175528, 169025, 200382,  71397,  25218,
       184697,  64414, 265243, 148193,  68938,  88961, 177808,  59426,
       265288,   7059,  42168, 176574, 110490, 207957,  64434, 110176,
       208414, 223158, 265188, 102411,  63684, 263595, 260789,  74944,
        76612, 254912,  27890, 230842, 238531, 123655,  57774, 214045,
        61830, 162468, 107359, 134467, 194985, 232873, 107684, 241077,
       266637, 257288,  71567,   2643, 103053, 279714, 107441,   3211,
       228927, 156330,  81016,   8124, 166842, 133443, 219908, 190472,
       284364, 268074, 269552,  22317,  57282, 135846,    953, 111172,
       162973,  12607, 187510,  14963, 283838, 123015, 246540, 268944,
        39425,  87638, 226547, 127860,  12735, 159091, 124211, 280480,
        84060, 258975, 258739,  41168, 131939, 152850, 264315,  62658,
        85626,  42946, 282300, 245471,  71859, 216059,  64433, 162802,
        15598, 216712, 230925,  44173, 283205,  16029, 118210,  82168,
       191490, 179917,  65019, 121832, 136545, 149373,  63929,  30831,
       174466,  92021, 221340, 150664, 183878, 108777, 142851,  20280,
        40469, 236157,  98717, 257973,  14118, 208164, 141909, 141571,
        15285, 279727, 175789, 133570,  59713, 123811, 124501, 112784,
        34616, 110916,  12352, 142102, 156039, 151741, 145248, 217803,
        11844, 170098,   5924, 151365, 182694, 115433, 156844, 117745,
       218126, 125402, 199028,  36235, 102140,  44898, 272362, 160900,
       278930, 151874, 125323, 144004, 229736,  76218,  51705,  53845,
        61198,  56469, 137592, 232023,  20088, 232953,  71703,  22941,
        42863, 155292, 149015, 112868, 119711, 113066,  86847, 196540,
        26247, 191982, 136040, 165147, 154980, 184567, 186419, 132225,
        41135, 181796, 272659,   7532, 107102, 253872, 176741, 154069,
       161710,  39902,  91846,  19289, 234129,  27771, 245739, 231911,
       157360,  46078, 114041,  92408, 162429, 227865, 235301,  92660,
       268539, 136650,  30043,  88613,    605,   7745, 227714,  92558,
       236965,  73612, 151962, 181039,  94939,  26679,   9084, 216170,
        59667,  65985, 248754,  73843, 177364, 135226, 282689, 220435,
       270884, 255024, 210683, 130747, 155717, 196097, 272468, 158399,
        95640, 112027, 180760,  11608,  92690, 149025,  49415, 155303,
       102627, 215471,  17162, 278154,  93586, 111620, 249821, 262214,
        38892, 112682,  97547,  92404, 144136, 243741, 123609,  36466,
       157892, 214065, 275952, 132791, 272796, 217840, 233289,  33504,
        67165,   3747, 175382,  95762,  31943, 134759, 232374, 187416,
       183548, 103946,  13077, 101532, 177556, 190932,  67821, 163831,
       217020, 253971, 146949, 214520,  98916,  77960,  80592, 128722,
       196570,  75788, 225955,  46880,  79375,  30154, 123295,  70901,
        44036, 210617,  62475, 267397,  32230, 283784, 129453, 222533,
        93491,  49148, 153730,  78444, 200338, 257464, 207076, 169742,
       266629, 272883, 108961, 236210, 236151, 269501,  22684, 261105,
       254355,  56932,   1274,  42574, 228218, 199309, 251022,   2170,
       147046, 248966,  17758, 208691, 211726, 229346,  20565,  37762,
       161177,  29640, 251931,  50150,  49903, 207886, 120368, 227427,
       193754, 120820, 141921, 186056, 219232, 107234, 236219,   2055,
        28236, 280883, 246850, 222196, 279458,  53084, 217879,  95244,
        18534, 124039, 221811, 258814,  32812,  33735,  83356, 276454,
       197710, 184973, 277205,  44782, 219213, 222071,  19651, 135158,
        38885, 257203,  99854, 231904,  55891, 253009, 203479,  11529,
        38487,  37511, 185403,  22454, 147047, 271546,  66272,  22566,
        75042,  27014, 223665, 274943,  32523, 280230, 164254,  68588,
       179510,  19977, 277949,  53269,   7340, 116408, 229506,  73913,
        89948,  95242, 189540, 243705, 184835, 100980,  51670, 125952,
         6730,  80174, 251284, 196700, 134825, 239025,   5035, 130754,
       184202, 213471, 219083, 209870,  57177, 168599, 109582, 174163,
       187616, 173083, 210956, 135449, 254955,  30863, 118516,  18199,
       167112, 173922,  55308, 173031, 233205, 284773, 127637,  88292,
        78781,  63185, 230449,  97335,  99914,  38929, 141070,  86628,
       169160, 119555,  83881,  76056,  51735, 185827, 176650, 168550],
      dtype=int64)
In [75]:
# Features
X_undersample = under_sample_data.iloc[:, under_sample_data.columns != 'Class']

# Label
y_undersample = under_sample_data.iloc[:, under_sample_data.columns == 'Class']
In [77]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_undersample, y_undersample, test_size = 0.3)
In [78]:
# Transform into array
X_train = np.array(X_train)
X_test = np.array(X_test)

# Transform into array
y_train = np.array(y_train)
y_test = np.array(y_test)
In [80]:
# Check the model
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 16)                480       
_________________________________________________________________
dense_1 (Dense)              (None, 24)                408       
_________________________________________________________________
dropout (Dropout)            (None, 24)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 24)                600       
_________________________________________________________________
dense_3 (Dense)              (None, 24)                600       
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 25        
=================================================================
Total params: 2,113
Trainable params: 2,113
Non-trainable params: 0
_________________________________________________________________
In [81]:
# Fit the model and initiate training
model.fit(X_train, y_train, batch_size = 15, epochs = 5)
Epoch 1/5
46/46 [==============================] - 0s 2ms/step - loss: 0.2600 - accuracy: 0.9317
Epoch 2/5
46/46 [==============================] - 0s 1ms/step - loss: 0.1316 - accuracy: 0.9462
Epoch 3/5
46/46 [==============================] - 0s 1ms/step - loss: 0.1042 - accuracy: 0.9608
Epoch 4/5
46/46 [==============================] - 0s 1ms/step - loss: 0.1002 - accuracy: 0.9608
Epoch 5/5
46/46 [==============================] - 0s 2ms/step - loss: 0.0930 - accuracy: 0.9637
Out[81]:
<tensorflow.python.keras.callbacks.History at 0x10dd1040b80>
In [82]:
# Predict using the undersample dataset
y_pred = model.predict(X_test)
y_expected = pd.DataFrame(y_test)

# Visualize confusion matrix
cm = confusion_matrix(y_expected, y_pred.round())
plot_confusion_matrix(cm, classes = [0,1])
plt.show()
Confusion matrix, without normalization
[[150   1]
 [ 19 126]]
In [83]:
# Predict using the entire dataset
y_pred = model.predict(X)
y_expected = pd.DataFrame(y)

# Visualize confusion matrix
cm = confusion_matrix(y_expected, y_pred.round())
plot_confusion_matrix(cm, classes = [0,1])
plt.show()
Confusion matrix, without normalization
[[281537   2778]
 [    41    451]]

SMOTE Application

In [86]:
# Install library
!pip install -U imbalanced-learn
Collecting imbalanced-learn
  Downloading imbalanced_learn-0.7.0-py3-none-any.whl (167 kB)
Requirement already satisfied, skipping upgrade: joblib>=0.11 in c:\users\joseff\miniconda3\envs\joseff\lib\site-packages (from imbalanced-learn) (0.15.1)
Requirement already satisfied, skipping upgrade: scikit-learn>=0.23 in c:\users\joseff\miniconda3\envs\joseff\lib\site-packages (from imbalanced-learn) (0.23.1)
Requirement already satisfied, skipping upgrade: numpy>=1.13.3 in c:\users\joseff\miniconda3\envs\joseff\lib\site-packages (from imbalanced-learn) (1.19.0)
Requirement already satisfied, skipping upgrade: scipy>=0.19.1 in c:\users\joseff\miniconda3\envs\joseff\lib\site-packages (from imbalanced-learn) (1.4.1)
Requirement already satisfied, skipping upgrade: threadpoolctl>=2.0.0 in c:\users\joseff\miniconda3\envs\joseff\lib\site-packages (from scikit-learn>=0.23->imbalanced-learn) (2.1.0)
Installing collected packages: imbalanced-learn
Successfully installed imbalanced-learn-0.7.0
In [92]:
# Import libary
from imblearn.over_sampling import SMOTE
In [93]:
# Fit SMOTE
X_resample, y_resample = SMOTE().fit_sample(X, y.values.ravel())
In [94]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_resample, y_resample, test_size = 0.3)
In [95]:
# Transform into array
X_train = np.array(X_train)
X_test = np.array(X_test)

# Transform into array
y_train = np.array(y_train)
y_test = np.array(y_test)
In [96]:
# Fit the model and initiate training
model.fit(X_train, y_train, batch_size = 15, epochs = 5)
Epoch 1/5
26537/26537 [==============================] - 110s 4ms/step - loss: 0.0282 - accuracy: 0.9904
Epoch 2/5
26537/26537 [==============================] - 119s 4ms/step - loss: 0.0144 - accuracy: 0.9961
Epoch 3/5
26537/26537 [==============================] - 89s 3ms/step - loss: 0.0117 - accuracy: 0.9968
Epoch 4/5
26537/26537 [==============================] - 116s 4ms/step - loss: 0.0111 - accuracy: 0.9973
Epoch 5/5
26537/26537 [==============================] - 103s 4ms/step - loss: 0.0096 - accuracy: 0.9975
Out[96]:
<tensorflow.python.keras.callbacks.History at 0x10d8003c790>
In [97]:
# Predict using the undersample dataset
y_pred = model.predict(X_test)
y_expected = pd.DataFrame(y_test)

# Visualize confusion matrix
cm = confusion_matrix(y_expected, y_pred.round())
plot_confusion_matrix(cm, classes = [0,1])
plt.show()
Confusion matrix, without normalization
[[85014   336]
 [   54 85185]]
In [98]:
# Predict using the entire dataset
y_pred = model.predict(X)
y_expected = pd.DataFrame(y)

# Visualize confusion matrix
cm = confusion_matrix(y_expected, y_pred.round())
plot_confusion_matrix(cm, classes = [0,1])
plt.show()
Confusion matrix, without normalization
[[283268   1047]
 [     5    487]]

Conclusion

Several models were tested such as Deep Neural Network, Random Forest and Decision Tree. Under sampling was implemented in this classification problem due to imbalance of dataset and then the most common technique known as SMOTE: Synthetic Minority Over-sampling Technique.

The final model used was Deep Neural Network and it obtained less error in prediction