Build Email Spam Classification Model (Naive Bayes Classifier)
Email Spam Classification Model using the theory and concepts that I have learned during my academics.
Have you ever encountered a mail that states- “Congratulations, you’ve won the $5000 “Bla Bla” Gift Card Grand Prize in our ‘$5000 Summer Bla Bla’ contest! To claim your prize, please click on this link”.
Yes, you guess it right! It’s a SPAM.
When I was learning the classification algorithm, I always wondering that how Gmail, Outlook, and other email services filter spam emails. So, I decided to build my own Email Spam Classification Model using the theory and concepts that I have learned during my academics. I have built this model as part of my academics using the Multinomial Naive Bayes Classifier algorithm. In this article, We will discuss the types of spam emails and demonstrate a simple Email Spam Classification Model. In this article, we will create a simple Naive Bayes Classification Model to predict whether a particular message is spam or not.
Types of Spam
Spam is an unsolicited message sent in bulk by digital messaging services. According to CISCO, there are basically five types of Spam messages.
1- Marketing and Advertisement
When you by default subscribed to any newsletter or share your email id with the business while registering on any new websites, you start receiving advertisement emails. These emails consist of paid advertisements or offers to buy products. For example, ‘Chance to win a Free Macbook on every purchase. Click here to buy!‘.
2- Antivirus Warning
These types of emails warn you about a computer virus infection and offer a solution, for example, an antivirus scan. But taking the bait and clicking the link can grant the spammer access to your system or may download a malicious file. If you suspect that your computer is infected, do not click a random email link.
3- Email Spoofing or Phishing Email
In these types of emails, hackers or spammers try to mimic legitimate corporate messages to get you to act and click on external links. So, before you reply or click anything, check the From line to make sure that the sender’s email address (not just the alias) is legitimate. When in doubt, contact the company to verify whether the email is real or spam email.
4- Sweepstakes Winners
Spammers often send spam emails claiming that you have won a sweepstake or a prize. For example, ‘Congratulations! You’ve won $3000 in a daily lucky draw. Click this link to claim‘.They ask you to respond quickly to claim your prize by clicking on an external link or submitting some personal information. If you don’t recognize the email, don’t click any links or reply with any personal details.
5- Money Scams
In these types of emails, spammers frames or fabricates a story about needing funds for a family emergency or a tragic life event.
What Is Email Spam Classification?
A Classification is a method or a set of operations/processes using which we classify a given dataset into classes. For example, the classification of an email as spam or not spam by the email service. Email Spam Classification is an example of text data classification using Natural Language Processing (NLP).
Before, jumping into the model training, let’s discuss the Multinomial Naive Bayes Classifier algorithm.
Naive Bayes Classifier: Multinomial Naive Bayes Classification Model
Naïve Bayes classifier works on the concept of probability and has a wide range of applications like spam filtering, sentiment analysis, or document classification. The principle of the Naïve Bayes classifier is based on the work of Thomas Bayes (1702-1761) of the Bayes Theorem for conditional probability.
Naïve Baye’s Classifier works on the principle of Bayes theorem with a Naïve assumption that a presence of a particular feature in a class is completely unrelated to the presence of other features (Input features are independent of each other).
P(Ck|x) is the posterior probability of class C for the given predictor x
P(Ck) is the prior probability of class C
P(x) is the prior probability of the predictor
P(x|Ck) is the likelihood of probability of predictor x for a given class C
Model Building: Email Spam Classification Using Multinomial Naïve Bayes Classifier
1- Problem Statement:
The SMS Spam Collection is a set of SMS-tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English 5,572 messages, tagged according to being ham (legitimate) or spam. Using the given dataset, build a classification model that can take the input text data from the user and classify whether its a spam or ham. You can download the dataset from here.
2- Import Necessary Libraries
# for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# for importing csv and working with dataframes
import pandas as pd
# for label encoding
from sklearn.preprocessing import LabelEncoder
# for working with text data
import string
import re
# natural language toolkit
import nltk
nltk.download('stopwords',quiet=True)
# for removing the stopwords from text data
from nltk.corpus import stopwords
# for creating word cloud using the list of words
from wordcloud import WordCloud
# for feature extraction
from sklearn.feature_extraction.text import CountVectorizer
# for splitting training and testing dataset
from sklearn.model_selection import train_test_split
# Multinomial Naïve Bayes classifier
from sklearn.naive_bayes import MultinomialNB
# for checking the model accuracy, classification report, and confusion matrix
from sklearn.metrics import classification_report, plot_confusion_matrix, accuracy_score
# for ignoring any warnings
import warnings
warnings.filterwarnings("ignore")
3- Importing Dataset and Renaming Columns
df = pd.read_csv('spamsms-1.csv',encoding='latin-1')
df.rename(columns = {'type':'label', 'text':'message'}, inplace=True)
df.head(5)
4- Remove Duplicates (if any) and Do Label Encoding
print("Shape of dataset is {}".format(df.shape))
print("Duplicates present {}".format(df.duplicated().sum()))
print('-'*50)
#remove duplicates
df.drop_duplicates(inplace=True)
print("Shape of dataset after removing duplicates is {}".format(df.shape))
print('-'*50)
#label encoding
Le = LabelEncoder()
df['label']=Le.fit_transform(df['label'])
print('0 means.... {}'.format(Le.inverse_transform([0])))
print('1 means.... {}'.format(Le.inverse_transform([1])))
5- Countplot of Target Labels
label_count = df.label.value_counts()
from numpy.ma.core import size
plt.figure(figsize=(4,4))
ax = sns.countplot('label',data = df)
plt.xticks(size = 12)
plt.xlabel('Labels')
plt.yticks(size = 12)
plt.ylabel('Count Of Labels')
plt.show()
print('-'*50)
print('Count of 0 and 1 is {0} and {1} respectively.'.format(label_count[0],label_count[1]))
print('-'*50)
From the above plot, we can observe that the target labels are not balanced in our dataset.
6- Create a function to pre-process the text data that performs the following tasks:
- Remove all punctuation
- Remove all stopwords
- Returns a list of the cleaned text
def preprocess_text(message):
"""
Takes in a string of text, then performs the following:
1. Remove all punctuation
2. Remove all stopwords
3. Returns a list of the cleaned text
"""
# Check characters to see if they are in punctuation
# To check punctuations, we will use string.punctuation
without_punc = [char for char in message if char not in string.punctuation]
# Join the characters again to form the string.
without_punc = ''.join(without_punc)
# Now just remove any stopwords and return the list of the cleaned text
return [word for word in without_punc.split() if word.lower() not in stopwords.words('english')]
# apply the function to message column
df['message'].apply(preprocess_text).head(2)
7- Visualize the Text Data Using Word Cloud (for both spam and ham messages)
spam_words = ' '.join(list(df[df['label'] == 1]['message']))
spam_wc = WordCloud(width = 512,height = 512).generate(spam_words)
plt.figure(figsize = (10, 8), facecolor = 'k')
plt.imshow(spam_wc)
plt.show()
ham_words = ' '.join(list(df[df['label'] == 0]['message']))
ham_wc = WordCloud(width = 512,height = 512).generate(ham_words)
plt.figure(figsize = (10, 8), facecolor = 'k')
plt.imshow(ham_wc)
plt.show()
8- Prepare the Train and Test Data
X = df['message'] #text data
y = df['label'] #target label
# CountVectorizer is used to transform a given text into a vector on the
# basis of the frequency (count) of each word that occurs in the entire text
cv = CountVectorizer()
X = cv.fit_transform(X)
# split the data into train and test sample.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)
9- Model Building and Classification Report
# fit the train sample in MultinomialNB
classifier = MultinomialNB().fit(X_train, y_train)
# Check the Classification Report, Accuracy, and Confusion Matrix for training data
pred_train = classifier.predict(X_train)
print(classification_report(y_train, pred_train))
print('-'*50)
print('Accuracy : ',accuracy_score(y_train, pred_train))
print('-'*50)
print('Confusion Matrix:\n')
plot_confusion_matrix(classifier, X_train, y_train,cmap=plt.cm.Blues);
Here, we can observe that our model is giving an accuracy of 99.39%. Also, from the confusion matrix, we can state that our model is performing well.
Let’s check for the test data.
# Check the Classification Report, Accuracy, and Confusion Matrix for test data
pred_test = classifier.predict(X_test)
print(classification_report(y_test, pred_test))
print('-'*50)
print('Accuracy : ',accuracy_score(y_test, pred_test))
print('-'*50)
print('Confusion Matrix:\n')
plot_confusion_matrix(classifier, X_test, y_test,cmap=plt.cm.Blues);
InHere, we can observe that our model is giving an accuracy of 97.77% in the test sample. Also, from the confusion matrix, we can say that the model is performing well for the classification of label 0 as compared to label 1. This is due to the imbalanced dataset for the target class. Out of the 5169 observations, we have only 653 observations corresponding to label 1.
10- Predict the Label For the First 10 Samples of Test Data
# print the actual values
print('Actual Labels: {}'.format(y_test.iloc[0:10].values))
# print the predictions
print('Predicted Labels: {}'.format(pred_test[0:10]))
For the first 10 samples, the model predicted all the labels correctly. Cool! We have achieved 100% accuracy for the first 10 samples.
11- Function To Take Input From User And Predict The Label
def sms(text):
# creating a list of labels
lab = ['not a spam','a spam']
# perform tokenization
X = cv.transform(text).toarray()
# predict the text
p = classifier.predict(X)
# convert the words in string with the help of list
s = [str(i) for i in p]
a = int("".join(s))
# show out the final result
res = str("This message is "+ lab[a])
return res
Take the data from the user and predict its class.
From the above output, we can observe that the model is able to predict correct labels for unseen data.
12- Build The GUI Using Python’s Tkinter Library
Python’s Tkinter is a library in Python which is used to create a GUI-based application. It is a standard GUI library for python.
How to create a Tkinter App in Python is out of the scope of this article but you can refer to the official documentation for more information.
In the below code, I have used the Tkinter in python to create a GUI. Please note that if you are using Google Colab then Tkinter will not work. You have to use your local system/PC to use the Tkinter library.
from tkinter import *
import tkinter as tk
gui = Tk()
gui.configure(background= 'light yellow')
gui.title('Spam Detection')
gui.geometry('450x150')
head = Label(gui,text = 'Type Your Message', font=('times',14,'bold'),bg='light yellow')
head.pack()
message = Entry(gui,width=400,borderwidth = 2)
message.pack()
result = Label(gui)
def sms():
global result
result.destroy()
global message
text = message.get()
# creating a list of labels
lab = ['not a spam','a spam']
# perform tokenization
X = cv.transform([text]).toarray()
# predict the text
p = classifier.predict(X)
# convert the words in string with the help of list
s = [str(i) for i in p]
a = int("".join(s))
# show out the final result
res = str("This message is "+ lab[a])
#print(text,res)
result = Label(gui,text=res,font=('times',18,'bold'),fg = 'blue',bg='light yellow')
result.pack()
b = Button(gui,text='Click To Check',font=('times',12,'bold'), fg = 'white',bg ='green',command = sms)
b.pack()
gui.mainloop()
Congratulations! We have successfully built an Email Spam Classification Model that can classify a text message into spam or ham.
You can find the complete Notebook file here.
If you want to explore more with other related projects, you can read my article on How To Build a Chatbot In Python Using NLTK.
We all know that building an application is very easy but deploying it on the web to make it scalable requires extra effort. I have explained step by step on deploying a Python Flask Application here.
Conclusion
In this article, we have discussed Spam and types of Spam emails. We also discussed What is Email Spam Classification and the algorithm, Naive Bayes Classifier in detail. Further, we have successfully worked on a case study in which we built an Email Spam Classification Model using a Multinomial Naive Bayes Classifier. After that, we studied the classification report, accuracy, and confusion matrix. We have observed that the F1- score of our model is 99.39% in training and 97.77% in testing. We have also created a function to predict the target label based on user input data and predicted the result successfully. At last, we finally compiled our model as an application using the Python Tkinter module.
Thank you for reading this article!