Step-by-Step Guide to Build an Email Spam Classification Model

Created by Vishal Verma in Articles 29 Feb 2024

In today's digital age, we receive thousands of spam messages in our inboxes, posing threats to privacy and productivity. However, with the power of Python and machine learning algorithms, we can develop robust spam detection systems to filter out unwanted messages. In this blog post, we'll walk through the process of building such a system step by step.

Step 1: Understanding the Problem

Spam detection involves identifying unsolicited or unwanted messages, typically sent in bulk, often for commercial purposes. These messages can include phishing attempts, advertisements, or malicious content. Our goal is to develop a system that can automatically classify incoming messages as either spam or legitimate (ham).

Step 2: Importing Necessary Libraries

Python offers a wide range of libraries for data analysis, natural language processing (NLP), and machine learning. We'll leverage libraries such as Pandas for data manipulation, NLTK for text preprocessing, and scikit-learn for building and evaluating machine learning models.

import matplotlib.pyplot as plt

import pandas as pd

%matplotlib inline



from sklearn.preprocessing import LabelEncoder

import string

import re

import nltk

nltk.download('stopwords',quiet=True)



from nltk.corpus import stopwords

from wordcloud import WordCloud

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import MultinomialNB

from sklearn.metrics import classification_report

from sklearn.metrics import confusion_matrix

from sklearn.metrics import ConfusionMatrixDisplay

from sklearn.metrics import accuracy_score



import warnings

warnings.filterwarnings("ignore")

Step 3: Data Acquisition and Exploration

The first task is to obtain a dataset containing labeled examples of spam and ham messages. We'll explore the dataset to understand its structure, features, and distribution of classes. Understanding the data is crucial for making informed decisions during the model building process.

df = pd.read_csv('spamsms-1.csv',encoding='latin-1')

df.head() # To display the top 5 rows

Step 4: Data Preprocessing

Raw text data often contains noise, such as punctuation and stopwords, which can hinder model performance. We'll preprocess the text data by removing punctuation, converting text to lowercase, and eliminating stopwords. This step ensures that the model focuses on meaningful features.

df.rename(columns = {'type':'labels', 'text':'message'}, inplace=True)



df.drop_duplicates(inplace=True) # Remove any duplicates



def preprocess_text(message):

    """

    Takes in a string of text, then performs the following:

    1. Remove all punctuation

    2. Remove all stopwords

    3. Returns a list of the cleaned text

    """

    # Check characters to see if they are in punctuation

    without_punc = [char for char in message if char not in string.punctuation]



    # Join the characters again to form the string.

    without_punc = ''.join(without_punc)



    # Now just remove any stopwords

    return [word for word in without_punc.split() 

if word.lower() not in stopwords.words('english')]



# Apply the preprocess_text func on complete data

df['message'].head().apply(preprocess_text) 



# Perform label encoding

Le = LabelEncoder()

df['labels']=Le.fit_transform(df['labels'])

df.head()

Step 5: Feature Engineering

Machine learning algorithms require numerical inputs, so we need to convert the text data into a numerical format. We'll use techniques like Bag-of-Words (BoW) or Term Frequency-Inverse Document Frequency (TF-IDF) to represent text data as feature vectors.

X = df['message']

y = df['labels']



cv = CountVectorizer()

X =  cv.fit_transform(X)

Step 6: Model Selection and Training

With the preprocessed data, we'll select a suitable machine learning algorithm for our task. Naive Bayes classifiers are commonly used for text classification tasks due to their simplicity and effectiveness. We'll train the model on the training data and evaluate its performance using appropriate metrics.

# Perform train test split

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size=0.20, random_state=0)



# Train a Multinomial Naive Bayes Classifier

classifier = MultinomialNB().fit(X_train, y_train)

Step 7: Model Evaluation

To assess the model's performance, we'll evaluate its accuracy, precision, recall, and F1-score on both the training and test datasets. Additionally, we'll visualize the results using confusion matrices to gain insights into the model's strengths and weaknesses.

# On training data

pred_train = classifier.predict(X_train)

print(classification_report(y_train, pred_train))

print('-'*50)

print('Accuracy : ',accuracy_score(y_train, pred_train))

print('-'*50)

print('Confusion Matrix:\n')

cm = confusion_matrix(y_train, pred_train, labels=classifier.classes_)

disp = ConfusionMatrixDisplay(confusion_matrix=cm,

display_labels=classifier.classes_)

disp.plot()

plt.show()



# On testing data

pred_test = classifier.predict(X_test)

print(classification_report(y_test, pred_test))

print('-'*50)

print('Accuracy : ',accuracy_score(y_test, pred_test))

print('-'*50)

print('Confusion Matrix:\n')

cm = confusion_matrix(y_test, pred_test, labels=classifier.classes_)

disp = ConfusionMatrixDisplay(confusion_matrix=cm,

display_labels=classifier.classes_)

disp.plot()

plt.show()

Step 8: Building a GUI-Based Application

To make our spam detection system user-friendly, we'll create a graphical user interface (GUI) using the Tkinter library. The GUI will allow users to input text messages and receive real-time predictions on whether the messages are spam or ham.

# Create an UDF to predict the label

def sms(text):



    # creating a list of labels

    lab = ['not a spam','a spam'] 



    # perform tokenization

    x = cv.transform(text).toarray()



    # predict the text

    p = classifier.predict(x)



    # convert the words in string with the help of list

    s = [str(i) for i in p]

    a = int("".join(s))



    # show out the final result

    res = str("This message is "+ lab[a])

    print(res)



# Sample usage

sms(['Hurray! you have won $3000. Get your money using this link.'])



# Build GUI using Tkinter library

from tkinter import *

import tkinter as tk



gui = Tk()

gui.configure(background= 'light yellow')

gui.title('Spam Detection - Baacumen')

gui.geometry('450x300')



head = Label(gui,text = 'Type Your Message',

font=('times',14,'bold'),bg='light yellow')

head.pack()



message = Entry(gui,width=400,borderwidth = 2)

message.pack()

result = Label(gui)

def sms():

    global result

    result.destroy()

    global message

    text = message.get()

    # creating a list of labels

    lab = ['not a spam','a spam'] 



    # perform tokenization

    x = cv.transform([text]).toarray()



    # predict the text

    p = classifier.predict(x)



    # convert the words in string with the help of list

    s = [str(i) for i in p]

    a = int("".join(s))



    # show out the final result

    res = str("This message is "+ lab[a])

    #print(text,res)

    result = Label(gui,text=res,font=('times',18,'bold'),

fg = 'blue',bg='light yellow')

    result.pack()



b = Button(gui,text='Click To Check',

font=('times',12,'bold'), fg = 'white',bg ='green',command = sms)

b.pack()



gui.mainloop()

Step 9: Deployment and Future Improvements

Once our model and GUI are ready, we can deploy the spam detection system for practical use. Continuous monitoring and feedback from users will help improve the system's accuracy and adaptability over time. Additionally, we can explore advanced techniques such as deep learning for further enhancing performance.

GitHub Repository: Click here to access the repository.

Not Spam:

Spam:

In conclusion, building a spam detection system with Python involves a systematic approach, from data preprocessing to model evaluation and deployment. By following this step-by-step guide, you can develop a reliable and efficient system for protecting against unwanted messages.