Case Study: Credit Card Fraud Detection¶

Introduction¶

Fraud detection, one of the many cases of anomaly detection is an important aspect of financial markets. Is there any way to predict whether a transaction is fraudulent or not based on the history of transactions? In this project, neural network architecture will be implemented as it attempts to predict the cases as frauds or not.

Problem:

Create a model that detects fraudulent case

Dataset:

The dataset contains 28 anonymized variables, 1 “amount” variable, 1 “time” variable, and 1 target variable — Class. The variables are anonymized to protect the privacy of the customers as the dataset is in the public domain. The dataset can be found here. ‘0’ as target variable corresponds to the non-fraudulent cases whereas ‘1’ in target variable corresponds to fraudulent cases.

Source: ULB-ML Group

Libraries and Data Importation¶

# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import keras

np.random.seed(0)

# Import data
data = pd.read_csv('project_data/creditcard.csv')

Data Exploration¶

data.head()

# Check data dimension
data.shape

(284807, 31)

# Check data info
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     284807 non-null  float64
 22  V22     284807 non-null  float64
 23  V23     284807 non-null  float64
 24  V24     284807 non-null  float64
 25  V25     284807 non-null  float64
 26  V26     284807 non-null  float64
 27  V27     284807 non-null  float64
 28  V28     284807 non-null  float64
 29  Amount  284807 non-null  float64
 30  Class   284807 non-null  int64  
dtypes: float64(30), int64(1)
memory usage: 67.4 MB

# Check for missing values
data.isnull().mean()

Time      0.0
V1        0.0
V2        0.0
V3        0.0
V4        0.0
V5        0.0
V6        0.0
V7        0.0
V8        0.0
V9        0.0
V10       0.0
V11       0.0
V12       0.0
V13       0.0
V14       0.0
V15       0.0
V16       0.0
V17       0.0
V18       0.0
V19       0.0
V20       0.0
V21       0.0
V22       0.0
V23       0.0
V24       0.0
V25       0.0
V26       0.0
V27       0.0
V28       0.0
Amount    0.0
Class     0.0
dtype: float64

Data Pre-processing¶

# Implement feature scaling
from sklearn.preprocessing import StandardScaler
data['NomarlizedAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1,1))

# Drop the original data value
data = data.drop(['Amount'], axis = 1)

# drop the Time column
data = data.drop(['Time'], axis = 1)

# Check data
data.head()

# Create label for dependent and independent variable
X = data.iloc[:, data.columns != 'Class']
y = data.iloc[:, data.columns == 'Class']

# Check independent variables
X.head()

# Check dependent variable
y.head()

# Split the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

# Check train dataset dimensions
X_train.shape, y_train.shape

((199364, 29), (199364, 1))

# Check test dataset dimensions
X_test.shape, y_test.shape

((85443, 29), (85443, 1))

# Transform train dataset into array
X_train = np.array(X_train)
y_train = np.array(y_train)

# Transform test dataset into array
X_test = np.array(X_test)
y_test = np.array(y_test)

Deep Neural Network Model¶

# Import libraries
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout

# Build model
model = Sequential([
    Dense(units = 16, input_dim = 29, activation = 'relu'),
    Dense(units = 24, activation = 'relu'),
    Dropout(0.5),
    Dense(units = 24, activation = 'relu'),
    Dense(units = 24, activation = 'relu'),
    Dense(1, activation = 'sigmoid')
])

# Check model summary
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 16)                480       
_________________________________________________________________
dense_1 (Dense)              (None, 24)                408       
_________________________________________________________________
dropout (Dropout)            (None, 24)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 24)                600       
_________________________________________________________________
dense_3 (Dense)              (None, 24)                600       
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 25        
=================================================================
Total params: 2,113
Trainable params: 2,113
Non-trainable params: 0
_________________________________________________________________

Model Training¶

# Compile the model
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

# Fit the model and initiate training
model.fit(X_train, y_train, batch_size = 15, epochs = 5)

Epoch 1/5
13291/13291 [==============================] - 26s 2ms/step - loss: 0.0086 - accuracy: 0.9990
Epoch 2/5
13291/13291 [==============================] - 33s 2ms/step - loss: 0.0039 - accuracy: 0.9993
Epoch 3/5
13291/13291 [==============================] - 40s 3ms/step - loss: 0.0037 - accuracy: 0.9993
Epoch 4/5
13291/13291 [==============================] - 37s 3ms/step - loss: 0.0034 - accuracy: 0.9993
Epoch 5/5
13291/13291 [==============================] - 39s 3ms/step - loss: 0.0034 - accuracy: 0.9993

<tensorflow.python.keras.callbacks.History at 0x10dcd126460>

Model Evaluation¶

# Check model score in test dataset
score = model.evaluate(X_test, y_test)
print(score)

2671/2671 [==============================] - 3s 1ms/step - loss: 0.0037 - accuracy: 0.9993
[0.003683381946757436, 0.9993445873260498]

# Import libraries
import itertools
from sklearn.metrics import confusion_matrix

# Create a plot function
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()

# Get the predictions
y_pred = model.predict(X_test)

# Transform into dataframe
y_test = pd.DataFrame(y_test)

# Check confusion matrix for test dataset
cm =  confusion_matrix(y_test, y_pred.round())
print(cm)

[[85286    10]
 [   46   101]]

# Visualize confusion matrix for test dataset
plot_confusion_matrix(cm, classes = [0, 1])
plt.show()

Confusion matrix, without normalization
[[85286    10]
 [   46   101]]

# Apply sampling confusion matrix # Applying the whole parameter # Entire dateset
y_pred = model.predict(X)
y_expected = pd.DataFrame(y)

# Create confusion matrix for entire dataset
cm = confusion_matrix(y_expected, y_pred.round())

# Visualize confusion matrix for entire dataset
plot_confusion_matrix(cm, classes = [0, 1])
plt.show()

Confusion matrix, without normalization
[[284286     29]
 [   169    323]]

Random Forest Model¶

# Apply random forest
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators = 10)
random_forest.fit(X_train, y_train.values.ravel())

# Get predictions
y_pred = random_forest.predict(X_test)

# Check test score
random_forest.score(X_test, y_test)

0.9994967405170698

# Check confusion matrix for test dataset
cm = confusion_matrix(y_test, y_pred)

# Print confusion matrix
plot_confusion_matrix(cm, classes = [0, 1])
plt.show()

Confusion matrix, without normalization
[[85290     6]
 [   37   110]]

# Apply sampling confusion matrix # Applying the whole parameter # Entire dateset
y_pred = random_forest.predict(X)

# Confusion matrix for entire dataset
cm = confusion_matrix(y, y_pred.round())

# Print confusion matrix
plot_confusion_matrix(cm, classes = [0, 1])
plt.show()

Confusion matrix, without normalization
[[284308      7]
 [    55    437]]

Decision Tree Model¶

# Import library
from sklearn.tree import DecisionTreeClassifier
decision_tree = DecisionTreeClassifier()

# Fit the model
decision_tree.fit(X_train, y_train.values.ravel())

# Get predictions
y_pred = decision_tree.predict(X_test)

# Check test score
decision_tree.score(X_test, y_test)

0.9991807403766253

# Apply sampling confusion matrix # Applying the whole parameter # Entire dateset
y_pred = decision_tree.predict(X)

# Confusion matrix for entire dataset
cm = confusion_matrix(y, y_pred.round())

# Print confusion matrix
plot_confusion_matrix(cm, classes = [0, 1])
plt.show()

Confusion matrix, without normalization
[[284279     36]
 [    34    458]]

Undersampling Application¶

# Undersampling techinique
# Normal indices
normal_indices = data[data['Class'] == 0].index

# Fraud Indices
fraud_indices = np.array(data[data['Class'] == 1].index)

# Get the number of fraud data
number_records_fraud = len(fraud_indices)
print(number_records_fraud)

492

# Generates a random sample from a given 1-D array
random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace = False)

# Create an array
random_normal_indices = np.array(random_normal_indices)

# Check the size
print(len(random_normal_indices))

492

# concatenate dataframe
under_sample_indices = np.concatenate([fraud_indices, random_normal_indices])
print(len(under_sample_indices))

984

# label under sample data
under_sample_data = data.iloc[under_sample_indices, :]
under_sample_indices

array([   541,    623,   4920,   6108,   6329,   6331,   6334,   6336,
         6338,   6427,   6446,   6472,   6529,   6609,   6641,   6717,
         6719,   6734,   6774,   6820,   6870,   6882,   6899,   6903,
         6971,   8296,   8312,   8335,   8615,   8617,   8842,   8845,
         8972,   9035,   9179,   9252,   9487,   9509,  10204,  10484,
        10497,  10498,  10568,  10630,  10690,  10801,  10891,  10897,
        11343,  11710,  11841,  11880,  12070,  12108,  12261,  12369,
        14104,  14170,  14197,  14211,  14338,  15166,  15204,  15225,
        15451,  15476,  15506,  15539,  15566,  15736,  15751,  15781,
        15810,  16415,  16780,  16863,  17317,  17366,  17407,  17453,
        17480,  18466,  18472,  18773,  18809,  20198,  23308,  23422,
        26802,  27362,  27627,  27738,  27749,  29687,  30100,  30314,
        30384,  30398,  30442,  30473,  30496,  31002,  33276,  39183,
        40085,  40525,  41395,  41569,  41943,  42007,  42009,  42473,
        42528,  42549,  42590,  42609,  42635,  42674,  42696,  42700,
        42741,  42756,  42769,  42784,  42856,  42887,  42936,  42945,
        42958,  43061,  43160,  43204,  43428,  43624,  43681,  43773,
        44001,  44091,  44223,  44270,  44556,  45203,  45732,  46909,
        46918,  46998,  47802,  48094,  50211,  50537,  52466,  52521,
        52584,  53591,  53794,  55401,  56703,  57248,  57470,  57615,
        58422,  58761,  59539,  61787,  63421,  63634,  64329,  64411,
        64460,  68067,  68320,  68522,  68633,  69498,  69980,  70141,
        70589,  72757,  73784,  73857,  74496,  74507,  74794,  75511,
        76555,  76609,  76929,  77099,  77348,  77387,  77682,  79525,
        79536,  79835,  79874,  79883,  80760,  81186,  81609,  82400,
        83053,  83297,  83417,  84543,  86155,  87354,  88258,  88307,
        88876,  88897,  89190,  91671,  92777,  93424,  93486,  93788,
        94218,  95534,  95597,  96341,  96789,  96994,  99506, 100623,
       101509, 102441, 102442, 102443, 102444, 102445, 102446, 102782,
       105178, 106679, 106998, 107067, 107637, 108258, 108708, 111690,
       112840, 114271, 116139, 116404, 118308, 119714, 119781, 120505,
       120837, 122479, 123141, 123201, 123238, 123270, 123301, 124036,
       124087, 124115, 124176, 125342, 128479, 131272, 135718, 137705,
       140786, 141257, 141258, 141259, 141260, 142405, 142557, 143188,
       143333, 143334, 143335, 143336, 143728, 143731, 144104, 144108,
       144754, 145800, 146790, 147548, 147605, 149145, 149357, 149522,
       149577, 149587, 149600, 149869, 149874, 150601, 150644, 150647,
       150654, 150660, 150661, 150662, 150663, 150665, 150666, 150667,
       150668, 150669, 150677, 150678, 150679, 150680, 150684, 150687,
       150692, 150697, 150715, 150925, 151006, 151007, 151008, 151009,
       151011, 151103, 151196, 151462, 151519, 151730, 151807, 152019,
       152223, 152295, 153823, 153835, 153885, 154234, 154286, 154371,
       154454, 154587, 154633, 154668, 154670, 154676, 154684, 154693,
       154694, 154697, 154718, 154719, 154720, 154960, 156988, 156990,
       157585, 157868, 157871, 157918, 163149, 163586, 167184, 167305,
       172787, 176049, 177195, 178208, 181966, 182992, 183106, 184379,
       189587, 189701, 189878, 190368, 191074, 191267, 191359, 191544,
       191690, 192382, 192529, 192584, 192687, 195383, 197586, 198868,
       199896, 201098, 201601, 203324, 203328, 203700, 204064, 204079,
       204503, 208651, 212516, 212644, 213092, 213116, 214662, 214775,
       215132, 215953, 215984, 218442, 219025, 219892, 220725, 221018,
       221041, 222133, 222419, 223366, 223572, 223578, 223618, 226814,
       226877, 229712, 229730, 230076, 230476, 231978, 233258, 234574,
       234632, 234633, 234705, 235616, 235634, 235644, 237107, 237426,
       238222, 238366, 238466, 239499, 239501, 240222, 241254, 241445,
       243393, 243547, 243699, 243749, 243848, 244004, 244333, 245347,
       245556, 247673, 247995, 248296, 248971, 249167, 249239, 249607,
       249828, 249963, 250761, 251477, 251866, 251881, 251891, 251904,
       252124, 252774, 254344, 254395, 255403, 255556, 258403, 261056,
       261473, 261925, 262560, 262826, 263080, 263274, 263324, 263877,
       268375, 272521, 274382, 274475, 275992, 276071, 276864, 279863,
       280143, 280149, 281144, 281674,  91742, 284081, 270050,  28369,
       280757, 268307, 245137, 175528, 169025, 200382,  71397,  25218,
       184697,  64414, 265243, 148193,  68938,  88961, 177808,  59426,
       265288,   7059,  42168, 176574, 110490, 207957,  64434, 110176,
       208414, 223158, 265188, 102411,  63684, 263595, 260789,  74944,
        76612, 254912,  27890, 230842, 238531, 123655,  57774, 214045,
        61830, 162468, 107359, 134467, 194985, 232873, 107684, 241077,
       266637, 257288,  71567,   2643, 103053, 279714, 107441,   3211,
       228927, 156330,  81016,   8124, 166842, 133443, 219908, 190472,
       284364, 268074, 269552,  22317,  57282, 135846,    953, 111172,
       162973,  12607, 187510,  14963, 283838, 123015, 246540, 268944,
        39425,  87638, 226547, 127860,  12735, 159091, 124211, 280480,
        84060, 258975, 258739,  41168, 131939, 152850, 264315,  62658,
        85626,  42946, 282300, 245471,  71859, 216059,  64433, 162802,
        15598, 216712, 230925,  44173, 283205,  16029, 118210,  82168,
       191490, 179917,  65019, 121832, 136545, 149373,  63929,  30831,
       174466,  92021, 221340, 150664, 183878, 108777, 142851,  20280,
        40469, 236157,  98717, 257973,  14118, 208164, 141909, 141571,
        15285, 279727, 175789, 133570,  59713, 123811, 124501, 112784,
        34616, 110916,  12352, 142102, 156039, 151741, 145248, 217803,
        11844, 170098,   5924, 151365, 182694, 115433, 156844, 117745,
       218126, 125402, 199028,  36235, 102140,  44898, 272362, 160900,
       278930, 151874, 125323, 144004, 229736,  76218,  51705,  53845,
        61198,  56469, 137592, 232023,  20088, 232953,  71703,  22941,
        42863, 155292, 149015, 112868, 119711, 113066,  86847, 196540,
        26247, 191982, 136040, 165147, 154980, 184567, 186419, 132225,
        41135, 181796, 272659,   7532, 107102, 253872, 176741, 154069,
       161710,  39902,  91846,  19289, 234129,  27771, 245739, 231911,
       157360,  46078, 114041,  92408, 162429, 227865, 235301,  92660,
       268539, 136650,  30043,  88613,    605,   7745, 227714,  92558,
       236965,  73612, 151962, 181039,  94939,  26679,   9084, 216170,
        59667,  65985, 248754,  73843, 177364, 135226, 282689, 220435,
       270884, 255024, 210683, 130747, 155717, 196097, 272468, 158399,
        95640, 112027, 180760,  11608,  92690, 149025,  49415, 155303,
       102627, 215471,  17162, 278154,  93586, 111620, 249821, 262214,
        38892, 112682,  97547,  92404, 144136, 243741, 123609,  36466,
       157892, 214065, 275952, 132791, 272796, 217840, 233289,  33504,
        67165,   3747, 175382,  95762,  31943, 134759, 232374, 187416,
       183548, 103946,  13077, 101532, 177556, 190932,  67821, 163831,
       217020, 253971, 146949, 214520,  98916,  77960,  80592, 128722,
       196570,  75788, 225955,  46880,  79375,  30154, 123295,  70901,
        44036, 210617,  62475, 267397,  32230, 283784, 129453, 222533,
        93491,  49148, 153730,  78444, 200338, 257464, 207076, 169742,
       266629, 272883, 108961, 236210, 236151, 269501,  22684, 261105,
       254355,  56932,   1274,  42574, 228218, 199309, 251022,   2170,
       147046, 248966,  17758, 208691, 211726, 229346,  20565,  37762,
       161177,  29640, 251931,  50150,  49903, 207886, 120368, 227427,
       193754, 120820, 141921, 186056, 219232, 107234, 236219,   2055,
        28236, 280883, 246850, 222196, 279458,  53084, 217879,  95244,
        18534, 124039, 221811, 258814,  32812,  33735,  83356, 276454,
       197710, 184973, 277205,  44782, 219213, 222071,  19651, 135158,
        38885, 257203,  99854, 231904,  55891, 253009, 203479,  11529,
        38487,  37511, 185403,  22454, 147047, 271546,  66272,  22566,
        75042,  27014, 223665, 274943,  32523, 280230, 164254,  68588,
       179510,  19977, 277949,  53269,   7340, 116408, 229506,  73913,
        89948,  95242, 189540, 243705, 184835, 100980,  51670, 125952,
         6730,  80174, 251284, 196700, 134825, 239025,   5035, 130754,
       184202, 213471, 219083, 209870,  57177, 168599, 109582, 174163,
       187616, 173083, 210956, 135449, 254955,  30863, 118516,  18199,
       167112, 173922,  55308, 173031, 233205, 284773, 127637,  88292,
        78781,  63185, 230449,  97335,  99914,  38929, 141070,  86628,
       169160, 119555,  83881,  76056,  51735, 185827, 176650, 168550],
      dtype=int64)

# Features
X_undersample = under_sample_data.iloc[:, under_sample_data.columns != 'Class']

# Label
y_undersample = under_sample_data.iloc[:, under_sample_data.columns == 'Class']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_undersample, y_undersample, test_size = 0.3)

# Transform into array
X_train = np.array(X_train)
X_test = np.array(X_test)

# Transform into array
y_train = np.array(y_train)
y_test = np.array(y_test)

# Check the model
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 16)                480       
_________________________________________________________________
dense_1 (Dense)              (None, 24)                408       
_________________________________________________________________
dropout (Dropout)            (None, 24)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 24)                600       
_________________________________________________________________
dense_3 (Dense)              (None, 24)                600       
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 25        
=================================================================
Total params: 2,113
Trainable params: 2,113
Non-trainable params: 0
_________________________________________________________________

# Fit the model and initiate training
model.fit(X_train, y_train, batch_size = 15, epochs = 5)

Epoch 1/5
46/46 [==============================] - 0s 2ms/step - loss: 0.2600 - accuracy: 0.9317
Epoch 2/5
46/46 [==============================] - 0s 1ms/step - loss: 0.1316 - accuracy: 0.9462
Epoch 3/5
46/46 [==============================] - 0s 1ms/step - loss: 0.1042 - accuracy: 0.9608
Epoch 4/5
46/46 [==============================] - 0s 1ms/step - loss: 0.1002 - accuracy: 0.9608
Epoch 5/5
46/46 [==============================] - 0s 2ms/step - loss: 0.0930 - accuracy: 0.9637

<tensorflow.python.keras.callbacks.History at 0x10dd1040b80>

# Predict using the undersample dataset
y_pred = model.predict(X_test)
y_expected = pd.DataFrame(y_test)

# Visualize confusion matrix
cm = confusion_matrix(y_expected, y_pred.round())
plot_confusion_matrix(cm, classes = [0,1])
plt.show()

Confusion matrix, without normalization
[[150   1]
 [ 19 126]]

# Predict using the entire dataset
y_pred = model.predict(X)
y_expected = pd.DataFrame(y)

# Visualize confusion matrix
cm = confusion_matrix(y_expected, y_pred.round())
plot_confusion_matrix(cm, classes = [0,1])
plt.show()

Confusion matrix, without normalization
[[281537   2778]
 [    41    451]]

SMOTE Application¶

# Install library
!pip install -U imbalanced-learn

Collecting imbalanced-learn
  Downloading imbalanced_learn-0.7.0-py3-none-any.whl (167 kB)
Requirement already satisfied, skipping upgrade: joblib>=0.11 in c:\users\joseff\miniconda3\envs\joseff\lib\site-packages (from imbalanced-learn) (0.15.1)
Requirement already satisfied, skipping upgrade: scikit-learn>=0.23 in c:\users\joseff\miniconda3\envs\joseff\lib\site-packages (from imbalanced-learn) (0.23.1)
Requirement already satisfied, skipping upgrade: numpy>=1.13.3 in c:\users\joseff\miniconda3\envs\joseff\lib\site-packages (from imbalanced-learn) (1.19.0)
Requirement already satisfied, skipping upgrade: scipy>=0.19.1 in c:\users\joseff\miniconda3\envs\joseff\lib\site-packages (from imbalanced-learn) (1.4.1)
Requirement already satisfied, skipping upgrade: threadpoolctl>=2.0.0 in c:\users\joseff\miniconda3\envs\joseff\lib\site-packages (from scikit-learn>=0.23->imbalanced-learn) (2.1.0)
Installing collected packages: imbalanced-learn
Successfully installed imbalanced-learn-0.7.0

# Import libary
from imblearn.over_sampling import SMOTE

# Fit SMOTE
X_resample, y_resample = SMOTE().fit_sample(X, y.values.ravel())

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_resample, y_resample, test_size = 0.3)

# Transform into array
X_train = np.array(X_train)
X_test = np.array(X_test)

# Transform into array
y_train = np.array(y_train)
y_test = np.array(y_test)

# Fit the model and initiate training
model.fit(X_train, y_train, batch_size = 15, epochs = 5)

Epoch 1/5
26537/26537 [==============================] - 110s 4ms/step - loss: 0.0282 - accuracy: 0.9904
Epoch 2/5
26537/26537 [==============================] - 119s 4ms/step - loss: 0.0144 - accuracy: 0.9961
Epoch 3/5
26537/26537 [==============================] - 89s 3ms/step - loss: 0.0117 - accuracy: 0.9968
Epoch 4/5
26537/26537 [==============================] - 116s 4ms/step - loss: 0.0111 - accuracy: 0.9973
Epoch 5/5
26537/26537 [==============================] - 103s 4ms/step - loss: 0.0096 - accuracy: 0.9975

<tensorflow.python.keras.callbacks.History at 0x10d8003c790>

# Predict using the undersample dataset
y_pred = model.predict(X_test)
y_expected = pd.DataFrame(y_test)

# Visualize confusion matrix
cm = confusion_matrix(y_expected, y_pred.round())
plot_confusion_matrix(cm, classes = [0,1])
plt.show()

Confusion matrix, without normalization
[[85014   336]
 [   54 85185]]

# Predict using the entire dataset
y_pred = model.predict(X)
y_expected = pd.DataFrame(y)

# Visualize confusion matrix
cm = confusion_matrix(y_expected, y_pred.round())
plot_confusion_matrix(cm, classes = [0,1])
plt.show()

Confusion matrix, without normalization
[[283268   1047]
 [     5    487]]

Conclusion¶

Several models were tested such as Deep Neural Network, Random Forest and Decision Tree. Under sampling was implemented in this classification problem due to imbalance of dataset and then the most common technique known as SMOTE: Synthetic Minority Over-sampling Technique.

The final model used was Deep Neural Network and it obtained less error in prediction

	Time	V1	V2	V3	V4	V5	V6	V7	V8	V9	...	V21	V22	V23	V24	V25	V26	V27	V28	Amount
0	0.0	-1.359807	-0.072781	2.536347	1.378155	-0.338321	0.462388	0.239599	0.098698	0.363787	...	-0.018307	0.277838	-0.110474	0.066928	0.128539	-0.189115	0.133558	-0.021053	149.62
1	0.0	1.191857	0.266151	0.166480	0.448154	0.060018	-0.082361	-0.078803	0.085102	-0.255425	...	-0.225775	-0.638672	0.101288	-0.339846	0.167170	0.125895	-0.008983	0.014724	2.69
2	1.0	-1.358354	-1.340163	1.773209	0.379780	-0.503198	1.800499	0.791461	0.247676	-1.514654	...	0.247998	0.771679	0.909412	-0.689281	-0.327642	-0.139097	-0.055353	-0.059752	378.66
3	1.0	-0.966272	-0.185226	1.792993	-0.863291	-0.010309	1.247203	0.237609	0.377436	-1.387024	...	-0.108300	0.005274	-0.190321	-1.175575	0.647376	-0.221929	0.062723	0.061458	123.50
4	2.0	-1.158233	0.877737	1.548718	0.403034	-0.407193	0.095921	0.592941	-0.270533	0.817739	...	-0.009431	0.798278	-0.137458	0.141267	-0.206010	0.502292	0.219422	0.215153	69.99

	V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	...	V21	V22	V23	V24	V25	V26	V27	V28	NomarlizedAmount
0	-1.359807	-0.072781	2.536347	1.378155	-0.338321	0.462388	0.239599	0.098698	0.363787	0.090794	...	-0.018307	0.277838	-0.110474	0.066928	0.128539	-0.189115	0.133558	-0.021053	0.244964
1	1.191857	0.266151	0.166480	0.448154	0.060018	-0.082361	-0.078803	0.085102	-0.255425	-0.166974	...	-0.225775	-0.638672	0.101288	-0.339846	0.167170	0.125895	-0.008983	0.014724	-0.342475
2	-1.358354	-1.340163	1.773209	0.379780	-0.503198	1.800499	0.791461	0.247676	-1.514654	0.207643	...	0.247998	0.771679	0.909412	-0.689281	-0.327642	-0.139097	-0.055353	-0.059752	1.160686
3	-0.966272	-0.185226	1.792993	-0.863291	-0.010309	1.247203	0.237609	0.377436	-1.387024	-0.054952	...	-0.108300	0.005274	-0.190321	-1.175575	0.647376	-0.221929	0.062723	0.061458	0.140534
4	-1.158233	0.877737	1.548718	0.403034	-0.407193	0.095921	0.592941	-0.270533	0.817739	0.753074	...	-0.009431	0.798278	-0.137458	0.141267	-0.206010	0.502292	0.219422	0.215153	-0.073403

	V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	...	V20	V21	V22	V23	V24	V25	V26	V27	V28	NomarlizedAmount
0	-1.359807	-0.072781	2.536347	1.378155	-0.338321	0.462388	0.239599	0.098698	0.363787	0.090794	...	0.251412	-0.018307	0.277838	-0.110474	0.066928	0.128539	-0.189115	0.133558	-0.021053	0.244964
1	1.191857	0.266151	0.166480	0.448154	0.060018	-0.082361	-0.078803	0.085102	-0.255425	-0.166974	...	-0.069083	-0.225775	-0.638672	0.101288	-0.339846	0.167170	0.125895	-0.008983	0.014724	-0.342475
2	-1.358354	-1.340163	1.773209	0.379780	-0.503198	1.800499	0.791461	0.247676	-1.514654	0.207643	...	0.524980	0.247998	0.771679	0.909412	-0.689281	-0.327642	-0.139097	-0.055353	-0.059752	1.160686
3	-0.966272	-0.185226	1.792993	-0.863291	-0.010309	1.247203	0.237609	0.377436	-1.387024	-0.054952	...	-0.208038	-0.108300	0.005274	-0.190321	-1.175575	0.647376	-0.221929	0.062723	0.061458	0.140534
4	-1.158233	0.877737	1.548718	0.403034	-0.407193	0.095921	0.592941	-0.270533	0.817739	0.753074	...	0.408542	-0.009431	0.798278	-0.137458	0.141267	-0.206010	0.502292	0.219422	0.215153	-0.073403