USD ($)
$
United States Dollar
Euro Member Countries
India Rupee
Br
Ethiopian Birr
¥
China Yuan Renminbi
Pakistan Rupee
£
Egypt Pound
د.إ
United Arab Emirates dirham
R
South Africa Rand
ر.س
Saudi Arabia Riyal

Step-by-Step Guide to Build an Email Spam Classification Model

Created by Vishal Verma in Articles 29 Feb 2024
Share

In today's digital age, we receive thousands of spam messages in our inboxes, posing threats to privacy and productivity. However, with the power of Python and machine learning algorithms, we can develop robust spam detection systems to filter out unwanted messages. In this blog post, we'll walk through the process of building such a system step by step.


Step 1: Understanding the Problem


Spam detection involves identifying unsolicited or unwanted messages, typically sent in bulk, often for commercial purposes. These messages can include phishing attempts, advertisements, or malicious content. Our goal is to develop a system that can automatically classify incoming messages as either spam or legitimate (ham).


Step 2: Importing Necessary Libraries


Python offers a wide range of libraries for data analysis, natural language processing (NLP), and machine learning. We'll leverage libraries such as Pandas for data manipulation, NLTK for text preprocessing, and scikit-learn for building and evaluating machine learning models.


import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

from sklearn.preprocessing import LabelEncoder
import string
import re
import nltk
nltk.download('stopwords',quiet=True)

from nltk.corpus import stopwords
from wordcloud import WordCloud
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings("ignore")

Step 3: Data Acquisition and Exploration


The first task is to obtain a dataset containing labeled examples of spam and ham messages. We'll explore the dataset to understand its structure, features, and distribution of classes. Understanding the data is crucial for making informed decisions during the model building process.


df = pd.read_csv('spamsms-1.csv',encoding='latin-1')
df.head() # To display the top 5 rows

Step 4: Data Preprocessing


Raw text data often contains noise, such as punctuation and stopwords, which can hinder model performance. We'll preprocess the text data by removing punctuation, converting text to lowercase, and eliminating stopwords. This step ensures that the model focuses on meaningful features.


df.rename(columns = {'type':'labels', 'text':'message'}, inplace=True)

df.drop_duplicates(inplace=True) # Remove any duplicates

def preprocess_text(message):
"""
Takes in a string of text, then performs the following:
1. Remove all punctuation
2. Remove all stopwords
3. Returns a list of the cleaned text
"""

# Check characters to see if they are in punctuation
without_punc = [char for char in message if char not in string.punctuation]

# Join the characters again to form the string.
without_punc = ''.join(without_punc)

# Now just remove any stopwords
return [word for word in without_punc.split()
if word.lower() not in stopwords.words('english')]

# Apply the preprocess_text func on complete data
df['message'].head().apply(preprocess_text)

# Perform label encoding
Le = LabelEncoder()
df['labels']=Le.fit_transform(df['labels'])
df.head()

Step 5: Feature Engineering


Machine learning algorithms require numerical inputs, so we need to convert the text data into a numerical format. We'll use techniques like Bag-of-Words (BoW) or Term Frequency-Inverse Document Frequency (TF-IDF) to represent text data as feature vectors.


X = df['message']
y = df['labels']

cv = CountVectorizer()
X = cv.fit_transform(X)

Step 6: Model Selection and Training


With the preprocessed data, we'll select a suitable machine learning algorithm for our task. Naive Bayes classifiers are commonly used for text classification tasks due to their simplicity and effectiveness. We'll train the model on the training data and evaluate its performance using appropriate metrics.


# Perform train test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=0)

# Train a Multinomial Naive Bayes Classifier
classifier = MultinomialNB().fit(X_train, y_train)

Step 7: Model Evaluation


To assess the model's performance, we'll evaluate its accuracy, precision, recall, and F1-score on both the training and test datasets. Additionally, we'll visualize the results using confusion matrices to gain insights into the model's strengths and weaknesses.


# On training data
pred_train = classifier.predict(X_train)
print(classification_report(y_train, pred_train))
print('-'*50)
print('Accuracy : ',accuracy_score(y_train, pred_train))
print('-'*50)
print('Confusion Matrix:\n')
cm = confusion_matrix(y_train, pred_train, labels=classifier.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=classifier.classes_)
disp.plot()
plt.show()

# On testing data
pred_test = classifier.predict(X_test)
print(classification_report(y_test, pred_test))
print('-'*50)
print('Accuracy : ',accuracy_score(y_test, pred_test))
print('-'*50)
print('Confusion Matrix:\n')
cm = confusion_matrix(y_test, pred_test, labels=classifier.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=classifier.classes_)
disp.plot()
plt.show()

Step 8: Building a GUI-Based Application


To make our spam detection system user-friendly, we'll create a graphical user interface (GUI) using the Tkinter library. The GUI will allow users to input text messages and receive real-time predictions on whether the messages are spam or ham.


# Create an UDF to predict the label
def sms(text):

# creating a list of labels
lab = ['not a spam','a spam']

# perform tokenization
x = cv.transform(text).toarray()

# predict the text
p = classifier.predict(x)

# convert the words in string with the help of list
s = [str(i) for i in p]
a = int("".join(s))

# show out the final result
res = str("This message is "+ lab[a])
print(res)

# Sample usage
sms(['Hurray! you have won $3000. Get your money using this link.'])

# Build GUI using Tkinter library
from tkinter import *
import tkinter as tk

gui = Tk()
gui.configure(background= 'light yellow')
gui.title('Spam Detection - Baacumen')
gui.geometry('450x300')

head = Label(gui,text = 'Type Your Message',
font=('times',14,'bold'),bg='light yellow')
head.pack()

message = Entry(gui,width=400,borderwidth = 2)
message.pack()
result = Label(gui)
def sms():
global result
result.destroy()
global message
text = message.get()
# creating a list of labels
lab = ['not a spam','a spam']

# perform tokenization
x = cv.transform([text]).toarray()

# predict the text
p = classifier.predict(x)

# convert the words in string with the help of list
s = [str(i) for i in p]
a = int("".join(s))

# show out the final result
res = str("This message is "+ lab[a])
#print(text,res)
result = Label(gui,text=res,font=('times',18,'bold'),
fg = 'blue',bg='light yellow')
result.pack()

b = Button(gui,text='Click To Check',
font=('times',12,'bold'), fg = 'white',bg ='green',command = sms)
b.pack()

gui.mainloop()

Step 9: Deployment and Future Improvements


Once our model and GUI are ready, we can deploy the spam detection system for practical use. Continuous monitoring and feedback from users will help improve the system's accuracy and adaptability over time. Additionally, we can explore advanced techniques such as deep learning for further enhancing performance.

GitHub Repository: Click here to access the repository.

Not Spam:


Spam:


In conclusion, building a spam detection system with Python involves a systematic approach, from data preprocessing to model evaluation and deployment. By following this step-by-step guide, you can develop a reliable and efficient system for protecting against unwanted messages.

Comments (0)

Share

Share this post with others

GDPR

When you visit any of our websites, it may store or retrieve information on your browser, mostly in the form of cookies. This information might be about you, your preferences or your device and is mostly used to make the site work as you expect it to. The information does not usually directly identify you, but it can give you a more personalized web experience. Because we respect your right to privacy, you can choose not to allow some types of cookies. Click on the different category headings to find out more and manage your preferences. Please note, that blocking some types of cookies may impact your experience of the site and the services we are able to offer.