Case Study: Spam Email Detection¶

Introduction¶

Spam emails may be very irritable whenever you check your email inbox. luckily, there's a solution for that and big companies are using this to help their users to filter their inboxes such as Gmail and Outlock , with the help of machine learning specifically Naive Bayes Classifier it can easily predict if the messages are spam or ham.

Problem:

Predict if the messages are spam or ham (Legitimate)

Dataset:

The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham or spam.
The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.

Source: Kaggle Competition

Data and Libraries Importation¶

# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# Import data
spam_df = pd.read_csv('project_data/emails.csv')

Data Exploration¶

# show dataframe
spam_df.head()

# Check text
spam_df['text'][0]

"Subject: naturally irresistible your corporate identity  lt is really hard to recollect a company : the  market is full of suqgestions and the information isoverwhelminq ; but a good  catchy logo , stylish statlonery and outstanding website  will make the task much easier .  we do not promise that havinq ordered a iogo your  company will automaticaily become a world ieader : it isguite ciear that  without good products , effective business organization and practicable aim it  will be hotat nowadays market ; but we do promise that your marketing efforts  will become much more effective . here is the list of clear  benefits : creativeness : hand - made , original logos , specially done  to reflect your distinctive company image . convenience : logo and stationery  are provided in all formats ; easy - to - use content management system letsyou  change your website content and even its structure . promptness : you  will see logo drafts within three business days . affordability : your  marketing break - through shouldn ' t make gaps in your budget . 100 % satisfaction  guaranteed : we provide unlimited amount of changes with no extra fees for you to  be surethat you will love the result of this collaboration . have a look at our  portfolio _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ not interested . . . _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _"

# Describe dataset
spam_df.describe()

# More info
spam_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5728 entries, 0 to 5727
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    5728 non-null   object
 1   spam    5728 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 89.6+ KB

Data Visualization¶

# Check ham data
ham = spam_df[ spam_df['spam'] == 0]
ham

# Check ham data
spam = spam_df[ spam_df['spam'] == 1]
spam

# Check spam percentage
print('spam percentage:', (len(spam) / len(spam_df))*100, '%')

spam percentage: 23.88268156424581 %

# Check ham percentage
print('spam percentage:', (len(ham) / len(spam_df))*100, '%')

spam percentage: 76.11731843575419 %

# Create countplot for spam and ham dataset
sns.countplot(spam_df['spam'], label = 'Spam vs Ham')
plt.show()

Data Preprocessing¶

# Import library
from sklearn.feature_extraction.text import CountVectorizer

# Transform words into numbers by CountVectorizer
vectorizer = CountVectorizer()
spamham_countvectorizer = vectorizer.fit_transform(spam_df['text'])

# Get feature names
vectorizer.get_feature_names();

# Check in array format
print(spamham_countvectorizer.toarray())

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [4 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

# Check dimension
spamham_countvectorizer.shape

(5728, 37303)

Model Creation¶

# Import library
from sklearn.naive_bayes import MultinomialNB

# Create labels
label = spam_df['spam'].values
label

array([1, 1, 1, ..., 0, 0, 0], dtype=int64)

# Fit the model
NB_classifier = MultinomialNB()
NB_classifier.fit(spamham_countvectorizer, label)

MultinomialNB()

# Create a testing sample
sample = ['free money!!!!', 'Hi joseff, why are you so handsome?', "thanks for the pizza"]
sample

['free money!!!!',
 'Hi joseff, why are you so handsome?',
 'thanks for the pizza']

# test the sample # Not handsome :(
sample_countvectorizer = vectorizer.transform(sample)
test_predict = NB_classifier.predict(sample_countvectorizer)
test_predict

array([1, 1, 0], dtype=int64)

# Create labels
X = spamham_countvectorizer
y = label

# Check dimension
X.shape, y.shape

((5728, 37303), (5728,))

# Split the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

# Fit the model
from sklearn.naive_bayes import MultinomialNB
NB_classifier = MultinomialNB()
NB_classifier.fit(X_train, y_train)

MultinomialNB()

Model Evaluation¶

# Import libraries
from sklearn.metrics import classification_report, confusion_matrix

# Initiate prediction for training data
y_predict_train = NB_classifier.predict(X_train)
y_predict_train

array([0, 0, 0, ..., 0, 0, 1], dtype=int64)

# Visualize confusion matrix for training data
cm = confusion_matrix(y_train, y_predict_train)
sns.heatmap(cm, annot = True)
plt.show()

# Initiate prediction for testing data
y_predict_test = NB_classifier.predict(X_test)
y_predict_test

array([0, 0, 1, ..., 1, 0, 0], dtype=int64)

# Visualize confusion matrix for testing data
cm = confusion_matrix(y_test, y_predict_test)
sns.heatmap(cm, annot = True)
plt.show()

# Classification report
print(classification_report(y_test, y_predict_test))

              precision    recall  f1-score   support

           0       1.00      0.99      0.99       882
           1       0.96      1.00      0.98       264

    accuracy                           0.99      1146
   macro avg       0.98      0.99      0.99      1146
weighted avg       0.99      0.99      0.99      1146

Conclusion¶

The model (Naive Bayes Classifier) were able to achieved 99% accuracy. This can be deploy to the production to easily filter out the spam messages from the legitimate messgaes to increase user's satisfaction.

	text	spam
0	Subject: naturally irresistible your corporate...	1
1	Subject: the stock trading gunslinger fanny i...	1
2	Subject: unbelievable new homes made easy im ...	1
3	Subject: 4 color printing special request add...	1
4	Subject: do not have money , get software cds ...	1

	spam
count	5728.000000
mean	0.238827
std	0.426404
min	0.000000
25%	0.000000
50%	0.000000
75%	0.000000
max	1.000000

	text	spam
1368	Subject: hello guys , i ' m " bugging you " f...	0
1369	Subject: sacramento weather station fyi - - ...	0
1370	Subject: from the enron india newsdesk - jan 1...	0
1371	Subject: re : powerisk 2001 - your invitation ...	0
1372	Subject: re : resco database and customer capt...	0
...	...	...
5723	Subject: re : research and development charges...	0
5724	Subject: re : receipts from visit jim , than...	0
5725	Subject: re : enron case study update wow ! a...	0
5726	Subject: re : interest david , please , call...	0
5727	Subject: news : aurora 5 . 2 update aurora ve...	0

	text	spam
0	Subject: naturally irresistible your corporate...	1
1	Subject: the stock trading gunslinger fanny i...	1
2	Subject: unbelievable new homes made easy im ...	1
3	Subject: 4 color printing special request add...	1
4	Subject: do not have money , get software cds ...	1
...	...	...
1363	Subject: are you ready to get it ? hello ! v...	1
1364	Subject: would you like a $ 250 gas card ? do...	1
1365	Subject: immediate reply needed dear sir , i...	1
1366	Subject: wanna see me get fisted ? fist bang...	1
1367	Subject: hot stock info : drgv announces anoth...	1