In today's digital age, we receive thousands of spam messages in our inboxes, posing threats to privacy and productivity. However, with the power of Python and machine learning algorithms, we can develop robust spam detection systems to filter out unwanted messages. In this blog post, we'll walk through the process of building such a system step by step.
Spam detection involves identifying unsolicited or unwanted messages, typically sent in bulk, often for commercial purposes. These messages can include phishing attempts, advertisements, or malicious content. Our goal is to develop a system that can automatically classify incoming messages as either spam or legitimate (ham).
Python offers a wide range of libraries for data analysis, natural language processing (NLP), and machine learning. We'll leverage libraries such as Pandas for data manipulation, NLTK for text preprocessing, and scikit-learn for building and evaluating machine learning models.
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
from sklearn.preprocessing import LabelEncoder
import string
import re
import nltk
nltk.download('stopwords',quiet=True)
from nltk.corpus import stopwords
from wordcloud import WordCloud
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings("ignore")
The first task is to obtain a dataset containing labeled examples of spam and ham messages. We'll explore the dataset to understand its structure, features, and distribution of classes. Understanding the data is crucial for making informed decisions during the model building process.
df = pd.read_csv('spamsms-1.csv',encoding='latin-1')
df.head() # To display the top 5 rows
Raw text data often contains noise, such as punctuation and stopwords, which can hinder model performance. We'll preprocess the text data by removing punctuation, converting text to lowercase, and eliminating stopwords. This step ensures that the model focuses on meaningful features.
df.rename(columns = {'type':'labels', 'text':'message'}, inplace=True)
df.drop_duplicates(inplace=True) # Remove any duplicates
def preprocess_text(message):
"""
Takes in a string of text, then performs the following:
1. Remove all punctuation
2. Remove all stopwords
3. Returns a list of the cleaned text
"""
# Check characters to see if they are in punctuation
without_punc = [char for char in message if char not in string.punctuation]
# Join the characters again to form the string.
without_punc = ''.join(without_punc)
# Now just remove any stopwords
return [word for word in without_punc.split()
if word.lower() not in stopwords.words('english')]
# Apply the preprocess_text func on complete data
df['message'].head().apply(preprocess_text)
# Perform label encoding
Le = LabelEncoder()
df['labels']=Le.fit_transform(df['labels'])
df.head()
Machine learning algorithms require numerical inputs, so we need to convert the text data into a numerical format. We'll use techniques like Bag-of-Words (BoW) or Term Frequency-Inverse Document Frequency (TF-IDF) to represent text data as feature vectors.
X = df['message']
y = df['labels']
cv = CountVectorizer()
X = cv.fit_transform(X)
With the preprocessed data, we'll select a suitable machine learning algorithm for our task. Naive Bayes classifiers are commonly used for text classification tasks due to their simplicity and effectiveness. We'll train the model on the training data and evaluate its performance using appropriate metrics.
# Perform train test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=0)
# Train a Multinomial Naive Bayes Classifier
classifier = MultinomialNB().fit(X_train, y_train)
To assess the model's performance, we'll evaluate its accuracy, precision, recall, and F1-score on both the training and test datasets. Additionally, we'll visualize the results using confusion matrices to gain insights into the model's strengths and weaknesses.
# On training data
pred_train = classifier.predict(X_train)
print(classification_report(y_train, pred_train))
print('-'*50)
print('Accuracy : ',accuracy_score(y_train, pred_train))
print('-'*50)
print('Confusion Matrix:\n')
cm = confusion_matrix(y_train, pred_train, labels=classifier.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=classifier.classes_)
disp.plot()
plt.show()
# On testing data
pred_test = classifier.predict(X_test)
print(classification_report(y_test, pred_test))
print('-'*50)
print('Accuracy : ',accuracy_score(y_test, pred_test))
print('-'*50)
print('Confusion Matrix:\n')
cm = confusion_matrix(y_test, pred_test, labels=classifier.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=classifier.classes_)
disp.plot()
plt.show()
To make our spam detection system user-friendly, we'll create a graphical user interface (GUI) using the Tkinter library. The GUI will allow users to input text messages and receive real-time predictions on whether the messages are spam or ham.
# Create an UDF to predict the label
def sms(text):
# creating a list of labels
lab = ['not a spam','a spam']
# perform tokenization
x = cv.transform(text).toarray()
# predict the text
p = classifier.predict(x)
# convert the words in string with the help of list
s = [str(i) for i in p]
a = int("".join(s))
# show out the final result
res = str("This message is "+ lab[a])
print(res)
# Sample usage
sms(['Hurray! you have won $3000. Get your money using this link.'])
# Build GUI using Tkinter library
from tkinter import *
import tkinter as tk
gui = Tk()
gui.configure(background= 'light yellow')
gui.title('Spam Detection - Baacumen')
gui.geometry('450x300')
head = Label(gui,text = 'Type Your Message',
font=('times',14,'bold'),bg='light yellow')
head.pack()
message = Entry(gui,width=400,borderwidth = 2)
message.pack()
result = Label(gui)
def sms():
global result
result.destroy()
global message
text = message.get()
# creating a list of labels
lab = ['not a spam','a spam']
# perform tokenization
x = cv.transform([text]).toarray()
# predict the text
p = classifier.predict(x)
# convert the words in string with the help of list
s = [str(i) for i in p]
a = int("".join(s))
# show out the final result
res = str("This message is "+ lab[a])
#print(text,res)
result = Label(gui,text=res,font=('times',18,'bold'),
fg = 'blue',bg='light yellow')
result.pack()
b = Button(gui,text='Click To Check',
font=('times',12,'bold'), fg = 'white',bg ='green',command = sms)
b.pack()
gui.mainloop()
Step 9: Deployment and Future Improvements
Once our model and GUI are ready, we can deploy the spam detection system for practical use. Continuous monitoring and feedback from users will help improve the system's accuracy and adaptability over time. Additionally, we can explore advanced techniques such as deep learning for further enhancing performance.
GitHub Repository: Click here to access the repository.
Not Spam:
Spam:
In conclusion, building a spam detection system with Python involves a systematic approach, from data preprocessing to model evaluation and deployment. By following this step-by-step guide, you can develop a reliable and efficient system for protecting against unwanted messages.