Building an Email Spam Detection Model – Supervised Learning – Session 7

Shubham Gupta 2 Comments September 11, 2024

Building an Email Spam Detection Model – Supervised Learning – Session 7

Introduction

In this guide, we’ll walk you through the process of building a supervised learning project to detect spam emails using the Naive Bayes algorithm. We’ll cover setting up the project, loading and preprocessing data, training the model, and evaluating it. By the end of this tutorial, you’ll have a fully functioning spam detection model.

Step 1: Setting Up the Environment

1.1 Create a Virtual Environment

To keep your project dependencies organized, it’s a good idea to set up a virtual environment. This ensures that your project’s libraries don’t conflict with others on your system.

Run the following commands in your terminal or command prompt:

# Create a virtual environment
python -m venv venv

# Activate the virtual environment (for Windows)
./venv/Scripts/activate

# For Linux/Mac, use:
# source ./venv/bin/activate

# Upgrade pip
python.exe -m pip install --upgrade pip

1.2 Install Required Libraries

Once your virtual environment is activated, install the necessary libraries:

# Install required libraries
pip install pandas scikit-learn numpy nltk --cache-dir "D:/internship/supervised_learning/email_spam_detection/.cache"

Step 2: Creating the Dataset from Email Files

If your dataset is made up of separate files for spam and ham emails, you’ll need to consolidate them into a CSV file.

SpamAssassin Public Corpus Dataset download

Description: A well-known dataset containing spam and ham emails, categorized into different folders.
Link: SpamAssassin Public Corpus
How to use: You can download individual archives of spam and ham emails, then extract and use them for your project.

A dataset is a collection of data that is typically organized in a structured format and used for analysis, training machine learning models, or solving specific problems. Datasets can come in various formats such as CSV (comma-separated values), Excel files, databases, text files, or even collections of images or sounds.

Key Features of a Dataset:

Data Instances (Rows):
- These are individual entries or records in the dataset. For example, in a dataset of emails, each email is a single instance.
Attributes or Features (Columns):
- These represent the characteristics of the data. For instance, in an email spam detection dataset, you might have columns like email (the content of the email) and label (whether it’s spam or not).
Labels or Target Variable:
- This is the outcome you are trying to predict. In supervised machine learning, the label is the actual result or output, such as spam or not spam in a spam detection task.
Types of Data in a Dataset:
- Numerical Data: Data that consists of numbers (e.g., age, price).
- Categorical Data: Data that falls into distinct categories (e.g., spam/not spam, color).
- Text Data: Free-form text, such as email bodies or document contents.
- Images/Sounds: Sometimes datasets consist of non-textual data like images, sounds, or videos.

Example of a Dataset (Email Spam Detection):

Email Content	Label (Spam or Not Spam)
“Congratulations! You’ve won a free iPhone.”	Spam
“Meeting at 3 PM. Please review the report.”	Not Spam
“Get cheap loans now with low interest rates!”	Spam
“Your Amazon order has been shipped.”	Not Spam

Email Content is the feature (input).
Label is the target variable or output that you want the model to predict.

Types of Datasets in Machine Learning:

Training Dataset:
- The dataset used to train a machine learning model. It contains both features (input) and labels (output).
Test Dataset:
- A separate dataset used to evaluate the performance of the trained model. It helps determine how well the model generalizes to unseen data.
Validation Dataset:
- Sometimes used to fine-tune models during training, ensuring that the model doesn’t overfit to the training data.

In the Context of Your Project:

For your email spam detection project, the dataset would typically consist of a collection of emails (the feature) along with labels (spam or not spam), which the model will use to learn patterns associated with spam emails.

In this case, a dataset might look like:

Email Content	Label
“Win $1000 now by clicking this link!”	Spam
“Reminder for the meeting tomorrow at 10 AM.”	Not Spam
“Hurry! Last chance to get 50% off on all items.”	Spam

You will use this data to train a machine learning model to classify new emails as spam or not spam based on their content.

Difference between spam and ham

The difference between spam and ham lies in their classification as types of email:

Spam:

Definition: Spam refers to unwanted, unsolicited emails sent in bulk, often for advertising, phishing, or malicious purposes.
Content: Spam emails typically include promotions for products or services, deceptive offers, requests for personal information (phishing), or malicious attachments or links.
Purpose: The goal of spam is usually to persuade recipients to take an action, such as clicking a link, downloading malware, or buying a product. Many spam emails are sent out in bulk to a large number of recipients, often without their consent.
Examples:
- “Congratulations! You’ve won a prize! Click here to claim it.”
- “Get a 90% discount on all products now!”
- “Urgent: Verify your account information to avoid closure.”

Ham:

Definition: Ham refers to legitimate, wanted emails that are not spam. These are emails that you expect or have requested, and they are important for personal or professional communication.
Content: Ham emails are typically from people or organizations with whom you have a relationship, and the content is relevant to you. They can be personal messages, business emails, newsletters you’ve subscribed to, or any emails that aren’t spam.
Purpose: Ham emails serve genuine communication purposes such as business correspondence, notifications, transactional messages, or personal conversations.
Examples:
- “Reminder: Meeting at 3 PM tomorrow.”
- “Your Amazon order has been shipped.”
- “Family reunion this Saturday. Please RSVP.”

Key Differences Between Spam and Ham:

Feature	Spam	Ham
Solicitation	Unsolicited, sent without recipient’s consent	Expected or requested by the recipient
Content Type	Advertisements, phishing, malware, scams	Personal or professional communication, newsletters
Frequency	Often sent in bulk to many recipients	Typically sent to specific individuals or groups
Purpose	To promote, deceive, or spread malware	To communicate genuinely or provide relevant information
Legitimacy	Usually illegal or against service terms	Legitimate emails from known or trusted sources

In the Context of Spam Detection:

Spam: Your model will learn to identify patterns and keywords often associated with spam, like “Congratulations,” “Win,” “Click here,” etc.
Ham: The model will recognize normal, useful emails that are important and legitimate, like personal or work-related communication.

2.1 Load Email Files and Create a Dataset

Use the following script to read all email files from their respective directories and store them in a DataFrame:

A DataFrame is a two-dimensional, tabular data structure used primarily in the pandas library in Python. It is similar to a table in a database, an Excel spreadsheet, or a CSV file, with rows and columns. Each column in a DataFrame can have a different data type (e.g., integers, floats, strings, etc.), and the rows represent individual records or observations.

NLTK (Natural Language Toolkit) is a Python library used for working with human language, like analyzing and processing text. It helps you break down text into smaller parts (words or sentences), clean it up, and make sense of it for tasks like identifying the meaning of words, classifying text as positive or negative, or figuring out if an email is spam.

Stopwords are common words in a language that carry little meaningful information on their own, such as “the,” “is,” “in,” “on,” etc. These words are often removed during text preprocessing in natural language processing (NLP) tasks to focus on the more relevant words.

import os
import pandas as pd

# Define paths to the directories containing the spam and ham emails
spam_dir = 'D:/internship/supervised_learning/email_spam_detection/datasets/spam'
ham_dir = 'D:/internship/supervised_learning/email_spam_detection/datasets/easy_ham'

# Function to read all email files and store them in a list
def load_emails_from_directory(directory, label):
    emails = []
    for filename in os.listdir(directory):
        with open(os.path.join(directory, filename), 'r', encoding='latin-1') as file:
            email_content = file.read()
            emails.append((email_content, label))  # Tuple (email_content, label)
    return emails

# Load spam and ham emails
spam_emails = load_emails_from_directory(spam_dir, 1)  # 1 for spam
ham_emails = load_emails_from_directory(ham_dir, 0)    # 0 for ham

# Combine spam and ham into a single list
all_emails = spam_emails + ham_emails

# Create a DataFrame with two columns: 'email' and 'label'
df = pd.DataFrame(all_emails, columns=['email', 'label'])

# Save the DataFrame to a CSV file
df.to_csv('spam_ham_dataset.csv', index=False)

print(f'Dataset saved with {len(df)} emails.')

This script reads all spam and ham email files from their respective directories, creates a DataFrame, and saves it to a CSV file named spam_ham_dataset.csv.

Step 3: Loading and Preprocessing the Data

3.1 Load the CSV Dataset

Once you have your dataset saved as a CSV file, load it into your Python environment using pandas:

import pandas as pd

# Load dataset
df = pd.read_csv('spam_ham_dataset.csv')

# Display the first few rows of the dataset
print(df.head())

3.2 Preprocessing the Emails

Before feeding the email text into a machine learning model, we need to clean and preprocess it. The following functions will help:

Convert to lowercase: To make the text case-insensitive.
Remove punctuation: As punctuation does not contribute to determining spam.
Remove stopwords: Common words like “the”, “is”, “and” that do not add value.

import string
import nltk
from nltk.corpus import stopwords

# Download stopwords
nltk.download('stopwords')

# Define preprocessing functions
def to_lowercase(text):
    return text.lower()

def remove_punctuation(text):
    return ''.join([char for char in text if char not in string.punctuation])

def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    return ' '.join([word for word in text.split() if word not in stop_words])

# Apply preprocessing steps
df['cleaned_email'] = df['email'].apply(lambda x: to_lowercase(x))
df['cleaned_email'] = df['cleaned_email'].apply(lambda x: remove_punctuation(x))
df['cleaned_email'] = df['cleaned_email'].apply(lambda x: remove_stopwords(x))

Now, the cleaned_email column contains preprocessed emails that are ready to be vectorized.

Step 4: Vectorizing the Email Data

To convert text into a numerical format, we’ll use TF-IDF (Term Frequency-Inverse Document Frequency) vectorization, which helps the model understand the importance of each word.

Vectorizing the text data using TF-IDF refers to the process of converting raw text (like emails, reviews, or any unstructured text) into numerical features that a machine learning model can understand and work with. Since machine learning algorithms cannot directly interpret text data, we need to transform it into a format that they can process, and TF-IDF (Term Frequency-Inverse Document Frequency) is one of the most commonly used techniques for this purpose.

from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TF-IDF vectorizer
tfidf = TfidfVectorizer(max_features=3000)

# Fit and transform the cleaned email text
X = tfidf.fit_transform(df['cleaned_email']).toarray()

# Target variable (spam or ham labels)
y = df['label']

Step 5: Splitting the Data into Training and Test Sets

To evaluate the model’s performance, we split the dataset into training and testing sets.

from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Explanation of the Code:

`train_test_split(X, y, test_size=0.2, random_state=42)`

X: The feature matrix. In your case, it contains the numerical TF-IDF vectors representing the emails (input data).
y: The target variable. In your case, y contains the labels for each email (whether it’s spam or not spam).
- 1 represents spam.
- 0 represents not spam (ham).
test_size=0.2: This specifies the proportion of the dataset that should be set aside for testing. Here, 20% of the data will be used for testing, and the remaining 80% will be used for training the model.
random_state=42: This is a seed value that ensures the random splitting of the data is reproducible. By setting random_state, you ensure that every time you run the code, the same split between training and testing data will occur. You can set this to any integer value, but using the same value ensures consistent results when testing the model.

Output Variables:

X_train: The training portion of the feature matrix (80% of the data). This is used to train the machine learning model.
X_test: The testing portion of the feature matrix (20% of the data). This is used to evaluate how well the model performs on unseen data.
y_train: The training portion of the target variable (y). These are the corresponding labels (spam or not spam) for the training emails.
y_test: The testing portion of the target variable (y). These are the corresponding labels for the test emails, which are used to evaluate the model’s predictions.

Purpose of Splitting the Data:

Training Set (X_train, y_train): The model is trained using this data, which consists of known inputs (emails) and outputs (whether they are spam or not).
Test Set (X_test, y_test): After the model has been trained, it is tested on this set of data that the model hasn’t seen before. The test set helps evaluate how well the model can generalize to new, unseen emails.

Step 6: Training the Naive Bayes Model

Now, let’s train the Naive Bayes model, which is often used for text classification due to its simplicity and effectiveness.

The Naive Bayes model is a family of probabilistic machine learning algorithms based on Bayes’ Theorem. It is particularly effective for classification tasks like spam detection, sentiment analysis, and text classification. Naive Bayes is called “naive” because it assumes that the features (e.g., words in an email) are independent of each other, which is often not the case in real life but still works surprisingly well in practice.

Types of Naive Bayes Models:

Multinomial Naive Bayes: Used for discrete data like word counts. This is commonly used for text classification problems, such as spam detection, where the features are word frequencies or TF-IDF scores.
Bernoulli Naive Bayes: Used when features are binary (e.g., whether a particular word appears or not). This is also used for text data, but instead of counting word frequencies, it checks whether a word is present or absent.
Gaussian Naive Bayes: Used for continuous data, where features follow a normal (Gaussian) distribution.

from sklearn.naive_bayes import MultinomialNB

# Initialize and train the model
model = MultinomialNB()
model.fit(X_train, y_train)

Step 7: Evaluating the Model

Once the model is trained, you can evaluate its performance using various metrics, such as accuracy, confusion matrix, and classification report.

accuracy_score(y_test, y_pred) is used to calculate the accuracy of your machine learning model. The accuracy metric measures how well the model’s predictions match the actual labels for the test data.

What is Accuracy?

Accuracy is defined as the ratio of correctly predicted instances (both true positives and true negatives) to the total number of instances in the dataset. In simpler terms, it tells you the percentage of predictions the model got correct.

confusion_matrix(y_test, y_pred) is used to compute a confusion matrix, which is a summary of the prediction results for a classification problem. It shows how well your machine learning model is performing by comparing the predicted labels (y_pred) with the actual labels (y_test).

What is a Confusion Matrix?

A confusion matrix is a table that summarizes the performance of a classification model by showing the counts of:

True Positives (TP): Correctly predicted positives (e.g., correctly predicted spam emails).
True Negatives (TN): Correctly predicted negatives (e.g., correctly predicted non-spam emails).
False Positives (FP): Incorrectly predicted positives (e.g., predicting an email as spam when it is not).
False Negatives (FN): Incorrectly predicted negatives (e.g., predicting an email as not spam when it is actually spam).

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Predict on the test set
y_pred = model.predict(X_test)

# Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')

# Print confusion matrix
print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))

# Print classification report
print('Classification Report:')
print(classification_report(y_test, y_pred))

Example of a Classification Report

              precision    recall  f1-score   support

           0       0.75      1.00      0.86         3
           1       1.00      0.50      0.67         2

    accuracy                           0.80         5
   macro avg       0.88      0.75      0.76         5
weighted avg       0.85      0.80      0.78         5

Class 0 (Not Spam):
- Precision = 0.75: Out of all emails predicted as “not spam,” 75% were actually not spam.
- Recall = 1.00: The model correctly identified 100% of the “not spam” emails.
- F1-Score = 0.86: This is the harmonic mean of precision and recall for “not spam” emails.
- Support = 3: There are 3 “not spam” emails in the test set.
Class 1 (Spam):
- Precision = 1.00: Out of all emails predicted as “spam,” 100% were actually spam.
- Recall = 0.50: The model correctly identified 50% of the actual spam emails.
- F1-Score = 0.67: This is the harmonic mean of precision and recall for “spam” emails.
- Support = 2: There are 2 spam emails in the test set.
Overall Metrics:
- Accuracy = 0.80: The model’s overall accuracy is 80% (i.e., it correctly classified 80% of the emails).
- Macro Avg: This is the unweighted average of precision, recall, and F1-score across all classes.
- Weighted Avg: This is the weighted average of precision, recall, and F1-score, where each class’s contribution is weighted by its support (i.e., the number of occurrences in the test set).

Step 8: Testing the Model with a New Email

You can now test the model with a new email to see if it correctly classifies it as spam or not spam.

def check_spam(email_text, model, tfidf_vectorizer):
    processed_email = to_lowercase(remove_punctuation(remove_stopwords(email_text)))
    email_vector = tfidf_vectorizer.transform([processed_email])
    prediction = model.predict(email_vector)
    return "SPAM" if prediction[0] == 1 else "NOT SPAM"

# Test the model with a new email
new_email = "Congratulations! You've won a free iPhone. Click here to claim your prize."
result = check_spam(new_email, model, tfidf)
print(result)

Complete Code

import os
import pandas as pd
import string
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Step 1: Set up the environment (only run these commands in the terminal)
# python -m venv venv
# ./venv/Scripts/activate
# python.exe -m pip install --upgrade pip
# pip install pandas scikit-learn numpy nltk --cache-dir "D:/internship/supervised_learning/email_spam_detection/.cache"

# Step 2: Load and process email files, create the dataset

# Define paths to the directories containing the spam and ham emails
spam_dir = 'D:/internship/supervised_learning/email_spam_detection/datasets/spam'
ham_dir = 'D:/internship/supervised_learning/email_spam_detection/datasets/easy_ham'

# Function to read all email files and store them in a list
def load_emails_from_directory(directory, label):
    emails = []
    for filename in os.listdir(directory):
        with open(os.path.join(directory, filename), 'r', encoding='latin-1') as file:
            email_content = file.read()
            emails.append((email_content, label))  # Tuple (email_content, label)
    return emails

# Load spam and ham emails
spam_emails = load_emails_from_directory(spam_dir, 1)  # 1 for spam
ham_emails = load_emails_from_directory(ham_dir, 0)    # 0 for ham

# Combine spam and ham into a single list
all_emails = spam_emails + ham_emails

# Create a DataFrame with two columns: 'email' and 'label'
df = pd.DataFrame(all_emails, columns=['email', 'label'])

# Save the DataFrame to a CSV file
df.to_csv('spam_ham_dataset.csv', index=False)
print(f'Dataset saved with {len(df)} emails.')

# Step 3: Preprocess the emails
nltk.download('stopwords')

# Define preprocessing functions
def to_lowercase(text):
    return text.lower()

def remove_punctuation(text):
    return ''.join([char for char in text if char not in string.punctuation])

def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    return ' '.join([word for word in text.split() if word not in stop_words])

# Apply preprocessing
df['cleaned_email'] = df['email'].apply(lambda x: to_lowercase(x))
df['cleaned_email'] = df['cleaned_email'].apply(lambda x: remove_punctuation(x))
df['cleaned_email'] = df['cleaned_email'].apply(lambda x: remove_stopwords(x))

# Step 4: Vectorize the text data using TF-IDF
tfidf = TfidfVectorizer(max_features=3000)
X = tfidf.fit_transform(df['cleaned_email']).toarray()
y = df['label']

# Step 5: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 6: Train the Naive Bayes model
model = MultinomialNB()
model.fit(X_train, y_train)

# Step 7: Evaluate the model
y_pred = model.predict(X_test)

# Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')

# Print confusion matrix
print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))

# Print classification report
print('Classification Report:')
print(classification_report(y_test, y_pred))

# Step 8: Test the model with a new email
def check_spam(email_text, model, tfidf_vectorizer):
    processed_email = to_lowercase(remove_punctuation(remove_stopwords(email_text)))
    email_vector = tfidf_vectorizer.transform([processed_email])
    prediction = model.predict(email_vector)
    return "SPAM" if prediction[0] == 1 else "NOT SPAM"

# Test the model with a new email
new_email = "Congratulations! You've won a free iPhone. Click here to claim your prize."
result = check_spam(new_email, model, tfidf)
print(result)

Code with model saving and loading from disk

import os
import pandas as pd
import string
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import joblib

# Step 1: Load and process email files, create the dataset

# Define paths to the directories containing the spam and ham emails
spam_dir = 'D:/internship/supervised_learning/email_spam/datasets/spam'
ham_dir = 'D:/internship/supervised_learning/email_spam/datasets/easy_ham'

# Function to read all email files and store them in a list
def load_emails_from_directory(directory, label):
    emails = []
    for filename in os.listdir(directory):
        with open(os.path.join(directory, filename), 'r', encoding='latin-1') as file:
            email_content = file.read()
            emails.append((email_content, label))  # Tuple (email_content, label)
    return emails

# Load spam and ham emails
spam_emails = load_emails_from_directory(spam_dir, 1)  # 1 for spam
ham_emails = load_emails_from_directory(ham_dir, 0)    # 0 for ham

# Combine spam and ham into a single list
all_emails = spam_emails + ham_emails

# Create a DataFrame with two columns: 'email' and 'label'
df = pd.DataFrame(all_emails, columns=['email', 'label'])

# Step 2: Preprocess the emails
nltk.download('stopwords')

# Define preprocessing functions
def to_lowercase(text):
    return text.lower()

def remove_punctuation(text):
    return ''.join([char for char in text if char not in string.punctuation])

def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    return ' '.join([word for word in text.split() if word not in stop_words])

# Apply preprocessing
df['cleaned_email'] = df['email'].apply(lambda x: to_lowercase(x))
df['cleaned_email'] = df['cleaned_email'].apply(lambda x: remove_punctuation(x))
df['cleaned_email'] = df['cleaned_email'].apply(lambda x: remove_stopwords(x))

# Step 3: Vectorize the text data using TF-IDF
tfidf = TfidfVectorizer(max_features=3000)
X = tfidf.fit_transform(df['cleaned_email']).toarray()
y = df['label']

# Step 4: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 5: Train the Naive Bayes model
model = MultinomialNB()
model.fit(X_train, y_train)

# Step 6: Evaluate the model
y_pred = model.predict(X_test)

# Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')

# Print confusion matrix
print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))

# Print classification report
print('Classification Report:')
print(classification_report(y_test, y_pred))

# Step 7: Save the model and TF-IDF vectorizer
joblib.dump(model, 'spam_classifier_model.pkl')
joblib.dump(tfidf, 'tfidf_vectorizer.pkl')
print("Model and vectorizer saved.")

# Step 8: Load the model and vectorizer from file
loaded_model = joblib.load('spam_classifier_model.pkl')
loaded_tfidf = joblib.load('tfidf_vectorizer.pkl')
print("Model and vectorizer loaded.")

# Step 9: Test the model with a new email
def check_spam(email_text, model, tfidf_vectorizer):
    processed_email = to_lowercase(remove_punctuation(remove_stopwords(email_text)))
    email_vector = tfidf_vectorizer.transform([processed_email])
    prediction = model.predict(email_vector)
    return "SPAM" if prediction[0] == 1 else "NOT SPAM"

# Test the loaded model with a new email
new_email = "Congratulations! You've won a free iPhone. Click here to claim your prize."
result = check_spam(new_email, loaded_model, loaded_tfidf)
print(result)

Step 9: Training Multiple Models

Now we’ll train five different models: Logistic Regression, K-Nearest Neighbors (KNN), Support Vector Machine (SVM), Decision Trees, and Random Forests.

6.1 Logistic Regression

from sklearn.linear_model import LogisticRegression

logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)

# Evaluate
y_pred_logistic = logistic_model.predict(X_test)
print(f'Logistic Regression Accuracy: {accuracy_score(y_test, y_pred_logistic) * 100:.2f}%')

6.2 K-Nearest Neighbors (KNN)

from sklearn.neighbors import KNeighborsClassifier

knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train, y_train)

# Evaluate
y_pred_knn = knn_model.predict(X_test)
print(f'KNN Accuracy: {accuracy_score(y_test, y_pred_knn) * 100:.2f}%')

6.3 Support Vector Machine (SVM)

from sklearn.svm import SVC

svm_model = SVC()
svm_model.fit(X_train, y_train)

# Evaluate
y_pred_svm = svm_model.predict(X_test)
print(f'SVM Accuracy: {accuracy_score(y_test, y_pred_svm) * 100:.2f}%')

6.4 Decision Trees

from sklearn.tree import DecisionTreeClassifier

decision_tree_model = DecisionTreeClassifier()
decision_tree_model.fit(X_train, y_train)

# Evaluate
y_pred_tree = decision_tree_model.predict(X_test)
print(f'Decision Tree Accuracy: {accuracy_score(y_test, y_pred_tree) * 100:.2f}%')

6.5 Random Forests

from sklearn.ensemble import RandomForestClassifier

random_forest_model = RandomForestClassifier()
random_forest_model.fit(X_train, y_train)

# Evaluate
y_pred_forest = random_forest_model.predict(X_test)
print(f'Random Forest Accuracy: {accuracy_score(y_test, y_pred_forest) * 100:.2f}%')

Complete code with all models

import os
import pandas as pd
import string
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Download stopwords
nltk.download('stopwords')

# Step 1: Load and process email files, create the dataset
spam_dir = 'D:/internship/supervised_learning/email_spam_ml/datasets/spam'
ham_dir = 'D:/internship/supervised_learning/email_spam_ml/datasets/easy_ham'

# Function to read email files and label them
def load_emails_from_directory(directory, label):
    emails = []
    for filename in os.listdir(directory):
        with open(os.path.join(directory, filename), 'r', encoding='latin-1') as file:
            email_content = file.read()
            emails.append((email_content, label))
    return emails

# Load spam and ham emails
spam_emails = load_emails_from_directory(spam_dir, 1)
ham_emails = load_emails_from_directory(ham_dir, 0)

# Combine spam and ham emails
all_emails = spam_emails + ham_emails

# Create DataFrame
df = pd.DataFrame(all_emails, columns=['email', 'label'])

# Step 2: Preprocess the emails
def to_lowercase(text):
    return text.lower()

def remove_punctuation(text):
    return ''.join([char for char in text if char not in string.punctuation])

def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    return ' '.join([word for word in text.split() if word not in stop_words])

# Apply preprocessing
df['cleaned_email'] = df['email'].apply(lambda x: to_lowercase(x))
df['cleaned_email'] = df['cleaned_email'].apply(lambda x: remove_punctuation(x))
df['cleaned_email'] = df['cleaned_email'].apply(lambda x: remove_stopwords(x))

# Step 3: Vectorize the text data using TF-IDF
tfidf = TfidfVectorizer(max_features=3000)
X = tfidf.fit_transform(df['cleaned_email']).toarray()
y = df['label']

# Step 4: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 5: Model Implementations

# 5.1 Logistic Regression
from sklearn.linear_model import LogisticRegression

logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)
y_pred_logistic = logistic_model.predict(X_test)

print(f'Logistic Regression Accuracy: {accuracy_score(y_test, y_pred_logistic) * 100:.2f}%')
print('Logistic Regression Classification Report:')
print(classification_report(y_test, y_pred_logistic))

# 5.2 K-Nearest Neighbors (KNN)
from sklearn.neighbors import KNeighborsClassifier

knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train, y_train)
y_pred_knn = knn_model.predict(X_test)

print(f'KNN Accuracy: {accuracy_score(y_test, y_pred_knn) * 100:.2f}%')
print('KNN Classification Report:')
print(classification_report(y_test, y_pred_knn))

# 5.3 Support Vector Machine (SVM)
from sklearn.svm import SVC

svm_model = SVC()
svm_model.fit(X_train, y_train)
y_pred_svm = svm_model.predict(X_test)

print(f'SVM Accuracy: {accuracy_score(y_test, y_pred_svm) * 100:.2f}%')
print('SVM Classification Report:')
print(classification_report(y_test, y_pred_svm))

# 5.4 Decision Trees
from sklearn.tree import DecisionTreeClassifier

decision_tree_model = DecisionTreeClassifier()
decision_tree_model.fit(X_train, y_train)
y_pred_tree = decision_tree_model.predict(X_test)

print(f'Decision Tree Accuracy: {accuracy_score(y_test, y_pred_tree) * 100:.2f}%')
print('Decision Tree Classification Report:')
print(classification_report(y_test, y_pred_tree))

# 5.5 Random Forest
from sklearn.ensemble import RandomForestClassifier

random_forest_model = RandomForestClassifier()
random_forest_model.fit(X_train, y_train)
y_pred_forest = random_forest_model.predict(X_test)

print(f'Random Forest Accuracy: {accuracy_score(y_test, y_pred_forest) * 100:.2f}%')
print('Random Forest Classification Report:')
print(classification_report(y_test, y_pred_forest))

# Step 6: Test the model with a new email example
def check_spam(email_text, model, tfidf_vectorizer):
    processed_email = to_lowercase(remove_punctuation(remove_stopwords(email_text)))
    email_vector = tfidf_vectorizer.transform([processed_email])
    prediction = model.predict(email_vector)
    return "SPAM" if prediction[0] == 1 else "NOT SPAM"

# Example of testing the model with a real-life example email
new_email = "Congratulations! You've won a free iPhone. Click here to claim your prize."
result = check_spam(new_email, logistic_model, tfidf)
print(f'Test email result: {result}')

Output

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\shubh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Logistic Regression Accuracy: 98.67%
Logistic Regression Classification Report:
              precision    recall  f1-score   support

           0       0.99      1.00      0.99       482
           1       0.99      0.94      0.97       119

    accuracy                           0.99       601
   macro avg       0.99      0.97      0.98       601
weighted avg       0.99      0.99      0.99       601

KNN Accuracy: 95.84%
KNN Classification Report:
              precision    recall  f1-score   support

           0       0.95      1.00      0.97       482
           1       0.99      0.80      0.88       119

    accuracy                           0.96       601
   macro avg       0.97      0.90      0.93       601
weighted avg       0.96      0.96      0.96       601

SVM Accuracy: 99.33%
SVM Classification Report:
              precision    recall  f1-score   support
              precision    recall  f1-score   support


           0       0.99      1.00      1.00       482
           1       0.99      0.97      0.98       119

    accuracy                           0.99       601
   macro avg       0.99      0.99      0.99       601
           0       0.99      1.00      1.00       482
           1       0.99      0.97      0.98       119

    accuracy                           0.99       601
   macro avg       0.99      0.99      0.99       601
weighted avg       0.99      0.99      0.99       601

           1       0.99      0.97      0.98       119

    accuracy                           0.99       601
   macro avg       0.99      0.99      0.99       601
weighted avg       0.99      0.99      0.99       601

Decision Tree Accuracy: 99.33%
    accuracy                           0.99       601
   macro avg       0.99      0.99      0.99       601
weighted avg       0.99      0.99      0.99       601

Decision Tree Accuracy: 99.33%
Decision Tree Classification Report:
              precision    recall  f1-score   support
weighted avg       0.99      0.99      0.99       601

Decision Tree Accuracy: 99.33%
Decision Tree Classification Report:
              precision    recall  f1-score   support

Decision Tree Accuracy: 99.33%
Decision Tree Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       482
Decision Tree Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       482
           1       0.98      0.98      0.98       119

           0       1.00      1.00      1.00       482
           1       0.98      0.98      0.98       119

           0       1.00      1.00      1.00       482
           1       0.98      0.98      0.98       119

           1       0.98      0.98      0.98       119

    accuracy                           0.99       601

    accuracy                           0.99       601
   macro avg       0.99      0.99      0.99       601
    accuracy                           0.99       601
   macro avg       0.99      0.99      0.99       601
weighted avg       0.99      0.99      0.99       601
   macro avg       0.99      0.99      0.99       601
weighted avg       0.99      0.99      0.99       601

weighted avg       0.99      0.99      0.99       601

Random Forest Accuracy: 99.67%
Random Forest Accuracy: 99.67%
Random Forest Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       482
           1       0.99      0.99      0.99       119

    accuracy                           1.00       601
   macro avg       0.99      0.99      0.99       601
weighted avg       1.00      1.00      1.00       601

Test email result: SPAM

Here’s a simple explanation of each model used in the spam email classification project, along with real-life examples:

1. Logistic Regression

Explanation: Logistic Regression is used for binary classification (two classes). It estimates the probability of an event occurring by fitting the data to a logistic function. The output is always between 0 and 1, which can be interpreted as the probability of the input belonging to one class (e.g., spam) or the other (e.g., not spam).

Example: Imagine you receive an email saying, “Congratulations, you’ve won $1,000,000!” Logistic Regression will analyze the words and patterns in the email and assign a probability that it’s spam, like 0.95. Since this is closer to 1, it classifies the email as spam.

Example: Imagine you’re a doctor diagnosing whether a patient has a disease (yes/no). Logistic regression will take patient data (age, weight, etc.) and predict whether the patient likely has the disease. If the probability is higher than 0.5, it predicts “Yes,” otherwise “No.”

Real-life Use Case: It’s often used in detecting fraudulent activities or for predicting if a customer will purchase a product (yes/no).

2. K-Nearest Neighbors (KNN)

Explanation: KNN is a simple algorithm that classifies data points based on the majority class of their nearest neighbors. It works by comparing the new data point to the k nearest data points in the dataset and assigning it to the class with the most neighbors.

Example: Suppose KNN checks the five emails (neighbors) closest to the one you received. If 3 out of 5 neighbors are spam, KNN will classify the email as spam.

Example: Imagine you’re at a park and want to predict the type of tree you’re standing near. You look at the three closest trees around you (neighbors) and see that two are oak trees and one is a maple tree. Since the majority are oak trees, you predict that the tree you’re near is also an oak tree.

Real-life Use Case: KNN can be used in recommendation systems, such as suggesting products to users based on what similar users have liked.

3. Support Vector Machine (SVM)

Explanation: SVM tries to find the best boundary (or hyperplane) that separates the data points of different classes. It maximizes the margin between two classes, making it a robust choice when there’s a clear separation between classes.

Example: Imagine a plot of emails with spam and not spam as two groups of points. SVM will find the line that best separates these two groups. If the email you receive falls on the spam side of the line, SVM will classify it as spam.

Example: Imagine drawing a line on the ground between two groups of people (Group A and Group B) such that everyone in Group A stays on one side of the line, and everyone in Group B stays on the other side. SVM finds the best line to keep the groups separate.

Real-life Use Case: SVM is used in facial recognition, where it separates images of one person from images of others by finding the optimal boundary.

4. Decision Trees

Explanation: Decision Trees classify data by splitting it based on feature values. It creates a tree-like structure where each internal node represents a decision (based on features like certain words in an email), and each leaf node represents the final classification (spam or not spam).

Example: Imagine a decision tree where the first question is, “Does the email contain the word ‘prize’?” If yes, it might go further and ask, “Does it contain a link?” Based on the answers, it will classify the email as spam or not spam.

Example: Imagine you’re trying to decide whether to take an umbrella with you. You make decisions based on questions like, “Is it cloudy?” or “Is it raining?”. Based on the answers, you’ll eventually come to a conclusion (take an umbrella or not).

Real-life Use Case: Decision trees are used in credit scoring, where banks use them to decide whether to approve or deny loan applications based on various criteria (like income, credit score, etc.).

5. Random Forest

Explanation: Random Forest is an ensemble of decision trees. It builds multiple decision trees using random subsets of the data and features, then combines their results. This reduces overfitting and increases accuracy by averaging the decisions of many trees.

Example: Suppose you have 100 decision trees. Each tree votes on whether the email is spam or not spam. If most trees say it’s spam, Random Forest classifies it as spam.

Example: Imagine you have 10 friends, and you ask each of them whether it will rain tomorrow. Eight of them say “yes”

Real-life Use Case: Random Forest is used in medical diagnosis to predict whether a patient has a certain disease based on symptoms. Multiple decision trees vote on the diagnosis, reducing the chances of errors.

Summary of Models with Simple Analogies

Logistic Regression: Like a weather forecast that gives a percentage chance of rain (spam). If it’s 90% likely to rain, you prepare for rain.
K-Nearest Neighbors (KNN): Like asking your five neighbors whether they think an email is spam, and you go with the majority opinion.
SVM: Like drawing a line on the ground that separates spam and non-spam emails, and checking which side your new email lands on.
Decision Trees: Like playing 20 questions, where each yes/no answer narrows down whether the email is spam or not.
Random Forest: Like having multiple decision trees and letting them vote. The majority decides if the email is spam or not.

Conclusion

Congratulations! You’ve successfully built a supervised learning model for email spam detection. Here’s a recap of what we’ve covered:

Set up the project environment.
Created a dataset from email files and saved it as CSV.
Preprocessed the email data by cleaning the text.
Converted text into numerical features using TF-IDF vectorization.
Trained a Naive Bayes model.
Evaluated the model’s performance and tested it with new email data.

This project can be expanded further by experimenting with different algorithms, tuning hyperparameters, or adding more advanced text preprocessing steps like stemming and lemmatization.

2 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Durga Prasad

2 months ago

The program which is displayed in the above is accepted our gmail’s or not for checking spam or not?

Shubham Gupta

Author

Reply to Durga Prasad

In response to your question about whether the program shown uses the same algorithm as Gmail to detect spam emails, I would like to clarify that the dataset used in this program is relatively small. As I mentioned during the session, if a larger dataset were implemented, the model could perform similarly to Gmail’s spam detection system.

While Gmail uses more complex algorithms with larger datasets, this program can still be utilized in production for your specific needs. By scaling up the dataset and refining the model, it can provide effective spam detection for a variety of real-world applications.