Spam emails may be very irritable whenever you check your email inbox. luckily, there's a solution for that and big companies are using this to help their users to filter their inboxes such as Gmail and Outlock , with the help of machine learning specifically Naive Bayes Classifier it can easily predict if the messages are spam or ham.
Problem:
Dataset:
The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham or spam.
The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.
Source: Kaggle Competition
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# Import data
spam_df = pd.read_csv('project_data/emails.csv')
# show dataframe
spam_df.head()
# Check text
spam_df['text'][0]
# Describe dataset
spam_df.describe()
# More info
spam_df.info()
# Check ham data
ham = spam_df[ spam_df['spam'] == 0]
ham
# Check ham data
spam = spam_df[ spam_df['spam'] == 1]
spam
# Check spam percentage
print('spam percentage:', (len(spam) / len(spam_df))*100, '%')
# Check ham percentage
print('spam percentage:', (len(ham) / len(spam_df))*100, '%')
# Create countplot for spam and ham dataset
sns.countplot(spam_df['spam'], label = 'Spam vs Ham')
plt.show()
# Import library
from sklearn.feature_extraction.text import CountVectorizer
# Transform words into numbers by CountVectorizer
vectorizer = CountVectorizer()
spamham_countvectorizer = vectorizer.fit_transform(spam_df['text'])
# Get feature names
vectorizer.get_feature_names();
# Check in array format
print(spamham_countvectorizer.toarray())
# Check dimension
spamham_countvectorizer.shape
# Import library
from sklearn.naive_bayes import MultinomialNB
# Create labels
label = spam_df['spam'].values
label
# Fit the model
NB_classifier = MultinomialNB()
NB_classifier.fit(spamham_countvectorizer, label)
# Create a testing sample
sample = ['free money!!!!', 'Hi joseff, why are you so handsome?', "thanks for the pizza"]
sample
# test the sample # Not handsome :(
sample_countvectorizer = vectorizer.transform(sample)
test_predict = NB_classifier.predict(sample_countvectorizer)
test_predict
# Create labels
X = spamham_countvectorizer
y = label
# Check dimension
X.shape, y.shape
# Split the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
# Fit the model
from sklearn.naive_bayes import MultinomialNB
NB_classifier = MultinomialNB()
NB_classifier.fit(X_train, y_train)
# Import libraries
from sklearn.metrics import classification_report, confusion_matrix
# Initiate prediction for training data
y_predict_train = NB_classifier.predict(X_train)
y_predict_train
# Visualize confusion matrix for training data
cm = confusion_matrix(y_train, y_predict_train)
sns.heatmap(cm, annot = True)
plt.show()
# Initiate prediction for testing data
y_predict_test = NB_classifier.predict(X_test)
y_predict_test
# Visualize confusion matrix for testing data
cm = confusion_matrix(y_test, y_predict_test)
sns.heatmap(cm, annot = True)
plt.show()
# Classification report
print(classification_report(y_test, y_predict_test))
The model (Naive Bayes Classifier) were able to achieved 99% accuracy. This can be deploy to the production to easily filter out the spam messages from the legitimate messgaes to increase user's satisfaction.