Clustering Algorithms and Dimensionality Reduction Techniques: A Comprehensive Guide

Shubham Gupta 1 Comment October 3, 2024

Clustering Algorithms and Dimensionality Reduction Techniques: A Comprehensive Guide

1. Introduction

Clustering and dimensionality reduction are essential techniques in machine learning, aiding in understanding large datasets by grouping similar data points and simplifying data structures. This article introduces key clustering algorithms such as K-Means and Hierarchical Clustering, dimensionality reduction techniques like PCA and t-SNE, anomaly detection, and a hands-on customer segmentation project with live Python code.

2. Clustering Algorithms

2.1 K-Means Clustering

K-Means is an unsupervised learning algorithm that partitions the dataset into K distinct clusters. Each data point belongs to the cluster with the nearest mean.

Examples

2.1.1. Customer Segmentation in Marketing

Scenario: A retail company wants to better understand its customers to create targeted marketing strategies. The company collects data on customer demographics (e.g., age, income) and purchasing behavior (e.g., purchase frequency, total spending).
Application of K-Means: K-Means clustering can be used to segment customers into different groups based on their buying behavior and demographic features. For example, clusters might include high-spending customers, occasional shoppers, or budget-conscious customers.
Benefit: By identifying these segments, the company can tailor its marketing strategies and promotions to meet the needs of each customer group, improving customer satisfaction and sales.

2.1.2 Image Compression

Scenario: In digital imaging, storing high-resolution images requires a lot of memory. To reduce file size, an image compression algorithm is needed while maintaining visual quality.
Application of K-Means: K-Means clustering is used in image compression by clustering the pixel values of an image into K clusters. Each pixel is then assigned to the nearest cluster centroid (mean color), reducing the number of unique colors in the image while preserving the overall appearance.
Benefit: This reduces the amount of memory required to store the image while retaining visual quality, making it useful for efficient storage and faster image transmission on websites or apps.

2.1.3 Document Clustering in Text Mining

Scenario: A company or organization has a large collection of documents, articles, or customer feedback that needs to be organized automatically into topics for better management or analysis.
Application of K-Means: By converting text documents into numerical feature vectors (e.g., using TF-IDF), K-Means clustering can group similar documents into clusters. For example, articles might be clustered into topics such as technology, sports, or finance based on their content.
Benefit: This helps in document categorization, search optimization, and content recommendation, allowing users to easily find relevant information in large text datasets.

Practical Example: Let’s consider a simple dataset with customer information based on two features: Annual Income and Spending Score.

# python -m venv venv
# venv\Scripts\activate
# pip install numpy pandas matplotlib scikit-learn --cache-dir D:/internship/unsupervised_learning/customer_segmentation_test/.cache

# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Sample data: Annual Income and Spending Score of customers
data = {'Annual Income': [15, 16, 17, 19, 20, 80, 85, 88, 90, 95],
        'Spending Score': [39, 81, 6, 77, 40, 5, 3, 90, 6, 77]}
df = pd.DataFrame(data)

# Scale the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)

# Applying K-Means clustering
kmeans = KMeans(n_clusters=3, random_state=0)
df['Cluster'] = kmeans.fit_predict(scaled_data)

# Visualizing the clusters
plt.scatter(df['Annual Income'], df['Spending Score'], c=df['Cluster'], cmap='viridis')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.title('Customer Segments Based on K-Means')
plt.show()

Output: This plot will show customers clustered into different groups based on their income and spending behavior. The clusters could help businesses identify high spenders, average spenders, or low-budget customers.

2.2 Hierarchical Clustering

Hierarchical clustering arranges data into a tree-like structure (dendrogram) based on their similarity. We can use it to find clusters at various levels of granularity.

Hierarchical clustering, with its ability to organize data into a tree-like structure (dendrogram), is useful in many real-world scenarios where data naturally forms nested groups. Here are a few examples of how hierarchical clustering is applied in real-world problems:

2.2.1. Biological Taxonomy

Scenario: In biology, scientists study the evolutionary relationships between species. The goal is to classify organisms into a hierarchy of taxonomic groups such as genus, family, and species based on genetic or physical similarities.
Application of Hierarchical Clustering: Hierarchical clustering is used to construct phylogenetic trees, which show how species are related based on their genetic data or physical traits. Species that are genetically similar are grouped together, and the hierarchy shows how species diverged from common ancestors.
Benefit: It helps scientists visualize evolutionary relationships, study speciation, and make predictions about evolutionary patterns.

2.2.2 Market Basket Analysis in Retail

Scenario: A retail company wants to group its products based on customer purchasing behavior. Customers who purchase similar sets of products could belong to distinct buying groups.
Application of Hierarchical Clustering: By analyzing customer transaction data, hierarchical clustering can group products that are frequently bought together. These clusters might reveal natural product bundles, such as snacks and drinks, or personal care products.
Benefit: Retailers can use this insight to optimize store layout, design product bundles for promotions, or recommend complementary products to customers.

2.2.3. Document Clustering in Information Retrieval

Scenario: A company or an organization has thousands of documents, such as research papers or legal files, and needs to organize them based on topic similarity for easy retrieval.
Application of Hierarchical Clustering: Hierarchical clustering is applied to group documents that discuss similar topics based on the similarity of their text content. For example, documents could be clustered into topics like finance, technology, or healthcare. The hierarchical structure allows for subcategories, like different subfields within technology (e.g., AI, cloud computing).
Benefit: This hierarchical organization helps in information retrieval, allowing users to navigate large collections of documents efficiently by topic.

2.2.4. Customer Segmentation in E-commerce

Scenario: An e-commerce company wants to segment its customers based on their purchase behavior to deliver personalized marketing.
Application of Hierarchical Clustering: Customer data, including age, income, and purchasing habits, can be clustered hierarchically. At the top level, customers might be broadly divided into high spenders and budget-conscious customers. Within those categories, further clusters could reveal distinct shopping behaviors like frequent buyers versus infrequent buyers.
Benefit: This segmentation helps businesses create targeted marketing strategies and improve customer satisfaction by personalizing offers and promotions.

2.2.5. Genomics and Gene Expression Analysis

Scenario: Researchers in genomics need to analyze gene expression data to understand how different genes behave under certain conditions (e.g., diseases or treatments).
Application of Hierarchical Clustering: Hierarchical clustering groups genes with similar expression patterns across various samples or conditions, helping to identify gene families or co-expressed genes. This is particularly useful for identifying genes involved in the same biological pathways.
Benefit: This insight can help in understanding diseases like cancer, where certain genes might be overexpressed or underexpressed, and in identifying potential therapeutic targets.

2.2.6. Social Network Analysis

Scenario: A social media platform wants to analyze the connections between its users to identify communities or clusters of closely connected individuals.
Application of Hierarchical Clustering: Hierarchical clustering can be used to group users based on their interaction patterns, such as likes, comments, or follows. Larger clusters might represent communities (e.g., fans of a certain celebrity or interest group), and smaller subclusters might reveal friend groups or active participants in a specific event.
Benefit: This helps social media platforms suggest new connections, recommend content, or study the structure of online communities.

2.2.7. Fraud Detection in Financial Transactions

Scenario: A bank or financial institution wants to detect fraudulent transactions among millions of legitimate ones by identifying unusual patterns.
Application of Hierarchical Clustering: Transactions are grouped based on features like transaction amount, frequency, location, and time. By visualizing the dendrogram, fraud analysts can detect small, isolated clusters that represent anomalous behavior, potentially indicative of fraud.
Benefit: Early detection of fraud allows financial institutions to block suspicious transactions before significant damage is done, improving security and customer trust.

# python -m venv venv
# venv\Scripts\activate
# pip install numpy pandas matplotlib scikit-learn --cache-dir D:/internship/unsupervised_learning/customer_segmentation_test/.cache

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.preprocessing import normalize

# Sample data: Annual Income and Spending Score of customers
data = {'Annual Income': [15, 16, 17, 19, 20, 80, 85, 88, 90, 95],
        'Spending Score': [39, 81, 6, 77, 40, 5, 3, 90, 6, 77]}

# Creating a DataFrame
df = pd.DataFrame(data)

# Normalize the data
normalized_data = normalize(df[['Annual Income', 'Spending Score']])

# Perform hierarchical clustering using the 'ward' method
linked = linkage(normalized_data, method='ward')

# Plot the dendrogram
plt.figure(figsize=(10, 7))
dendrogram(linked, orientation='top', labels=range(1, len(df) + 1), distance_sort='descending')
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Distance')
plt.show()

Output: The dendrogram shows how data points (customers) merge into clusters. You can visually decide the optimal number of clusters by cutting the dendrogram at a specific height.

3. Dimensionality Reduction Techniques

3.1 Principal Component Analysis (PCA)

PCA reduces the dimensionality of a dataset by transforming it into a set of uncorrelated variables (principal components).

3.1.1 Why is PCA needed?

Imagine you have a lot of information about your customers (like their age, income, spending habits, etc.). With all these details, it can be hard to visualize and understand patterns because there are so many features. This is where PCA comes in.

PCA is like taking all those complicated details and boiling them down to just 2 or 3 important things (called principal components) that still capture most of the important differences between customers.

In short: PCA simplifies the data, making it easier to see patterns or clusters in fewer dimensions (like a 2D plot).

3.1.2 How to Read a PCA Plot

Dots = Data Points (Customers) Each dot on the plot is a customer. The closer two dots are, the more similar the customers are based on the information you have (age, income, spending score, etc.).
Position on Axes (Principal Components)
- The x-axis (Principal Component 1) and y-axis (Principal Component 2) are the new “summary features” that PCA has created.
- These new axes are made by combining the original information (age, income, etc.) into just two components.
- The more spread out the dots are on these axes, the more different the customers are.
Colors = Clusters
- If you’ve applied clustering (like KMeans), the colors show which customers belong to the same group.
- Dots with the same color are more similar to each other.

3.1.3 Why use PCA?

Simplification: It reduces many complex features (like age, income, etc.) into just two or three key dimensions that are easy to plot and understand.
Visualization: PCA lets you plot high-dimensional data in 2D or 3D, which helps you see patterns like customer segments or groups.

3.1.4 Example:

Without PCA, imagine you have 10 different features about customers (age, income, spending habits, etc.). Trying to look at all 10 at once would be confusing. PCA takes those 10 features and reduces them to just 2, making it possible to plot and understand the overall patterns among your customers.

Practical Example:

data.csv

CustomerID,Age,AnnualIncome,SpendingScore,Cluster
1,19,15000,39,1
2,21,18000,81,2
3,20,22000,6,3
4,23,35000,77,2
5,31,45000,40,1
6,22,29000,76,2
7,35,40000,6,3
8,40,60000,94,2
9,50,75000,5,3
10,60,55000,79,2
11,27,48000,60,1
12,22,17000,95,2
13,35,20000,20,1
14,45,33000,15,3
15,28,39000,87,2

Program file

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Load the data into a DataFrame (assuming a CSV file)
df = pd.read_csv('data.csv')  # Replace 'your_file.csv' with the path to your dataset

# Ensure that the 'Cluster' column exists for color-coding; if not, remove or modify this part.
# Standardize the data, dropping the 'Cluster' column
scaled_data = df.drop(columns=['Cluster'], errors='ignore')  # Remove the 'Cluster' column if it exists
scaler = StandardScaler()
scaled_data = scaler.fit_transform(scaled_data)

# Applying PCA to reduce data to 2 dimensions
pca = PCA(n_components=2)
pca_components = pca.fit_transform(scaled_data)

# Visualizing the data after PCA
plt.scatter(pca_components[:, 0], pca_components[:, 1], c=df.get('Cluster', None), cmap='viridis')  # Use Cluster if it exists
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA Visualization of Customer Data')
plt.show()

Output: This scatter plot shows the 2D representation of the customer data, where each point corresponds to a customer, and colors represent different clusters.

3.2 t-SNE

t-SNE is another dimensionality reduction technique that is particularly useful for visualizing high-dimensional data.

3.2.1 What is t-SNE?

t-SNE is a technique that takes data with lots of features (high-dimensional data) and reduces it down to two or three dimensions so you can plot and visualize it. Unlike PCA, which focuses on capturing variance in the data, t-SNE is specifically designed for creating meaningful visualizations by preserving the local structure of the data (i.e., it tries to keep similar data points close together).

3.2.2 Why Use t-SNE?

Better for Visualization: t-SNE is particularly good for visualizing clusters and grouping in complex datasets. It often provides more clear and visually appealing results than PCA, especially when the data is not linearly separable.
Focus on Local Structure: While PCA tries to capture the global structure (variance) in the data, t-SNE focuses more on local relationships, which means that data points that are similar stay close together in the visualization.

3.2.3 How Does t-SNE Work?

Without getting too technical:

t-SNE converts the relationships between points in high-dimensional space (your original dataset) into probabilities. It then arranges the data in lower dimensions (like 2D or 3D) while trying to maintain the structure from the high-dimensional data.
It places similar points close together and dissimilar points far apart, giving you a clear, visual separation of groups or clusters.

3.2.4 Example Use of t-SNE

Let’s say you have a dataset with 100 features, like customer age, spending, income, preferences, etc. Plotting all of that data directly would be impossible. t-SNE helps you visualize the clusters (groups of similar customers) in just 2 or 3 dimensions, making it easier to understand patterns, without losing too much important information.

3.2.5 Key Differences Between PCA and t-SNE:

PCA focuses on capturing the maximum variance in the data using linear combinations of features, and is often used for feature reduction and interpretation.
t-SNE is focused on visualizing data, often giving more meaningful and clearer clusters, but it doesn’t try to explain variance and is more computationally expensive than PCA.

3.2.6 When to Use t-SNE?

When your main goal is visualization, especially in situations where PCA doesn’t provide clear results.
When your data has complex, non-linear relationships that are hard to see in simpler dimensionality reduction methods like PCA.

3.2.7 In Summary

PCA is great for simplifying data and capturing large-scale patterns and variance.
t-SNE is excellent for visualizing complex, high-dimensional data and highlighting local clusters or groupings.

Practical Example:

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Load the CSV file into a pandas dataframe
df = pd.read_csv('diabetes.csv')  # Make sure the path is correct

# Print the first few rows to confirm the data is loaded
print(df.head())

# Using all features except for 'Outcome' for t-SNE
features = df.drop('Outcome', axis=1)

# Scaling the features
scaler = StandardScaler()
scaled_data = scaler.fit_transform(features)

# Applying t-SNE to reduce to 2 dimensions
tsne = TSNE(n_components=2, random_state=0)
tsne_components = tsne.fit_transform(scaled_data)

# Visualizing the data after t-SNE
plt.scatter(tsne_components[:, 0], tsne_components[:, 1], c=df['Outcome'], cmap='viridis')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.title('t-SNE Visualization of Diabetes Data')
plt.colorbar(label='Outcome (0 = No Diabetes, 1 = Diabetes)')
plt.show()

This t-SNE plot provides a two-dimensional visualization of the high-dimensional diabetes dataset, reducing the features down to two t-SNE components. Here’s how to interpret the plot:

Axes (t-SNE Components 1 and 2):
- The x-axis represents the first t-SNE component, and the y-axis represents the second t-SNE component.
- These components are abstract and do not directly correspond to any of the original features (like glucose level, age, etc.). Instead, they are dimensions created by the t-SNE algorithm to help us visualize how similar or different data points are to each other.
Color (Outcome):
- The color of the points represents the diabetes outcome (0 or 1), with:
  - Purple (darker points) representing individuals with no diabetes (Outcome = 0).
  - Yellow (lighter points) representing individuals with diabetes (Outcome = 1).
- The colorbar on the right shows this range from 0 (no diabetes) to 1 (diabetes).
Clustering:
- Points that are closer together in the t-SNE plot are more similar in terms of the original features (e.g., glucose levels, BMI, etc.).
- Areas with dense clusters of similarly colored points may indicate groups with similar diabetes outcomes.
- For example, you can see some clustering of purple points (no diabetes) and yellow points (diabetes), indicating some separation between the two classes. However, there is overlap, which may suggest that some features of individuals with and without diabetes are similar.
Overlapping Regions:
- In regions where purple and yellow dots are mixed, the features of individuals with and without diabetes overlap, suggesting that there isn’t a clear separation between these two groups for some data points. This could point to areas where diabetes prediction might be more challenging based on the given features.

Overall, the plot helps you visualize how the t-SNE algorithm has grouped individuals based on their features and how those groups relate to the diabetes outcome. If you want more distinct clusters, you might explore other techniques like clustering methods (e.g., k-means) or experiment with different t-SNE parameters.

Output: The t-SNE plot will display customer data points based on their similarity, making it easy to identify clusters.

4. Anomaly Detection

Anomaly detection can identify rare and unusual observations in a dataset, which could represent fraud, system failures, or other interesting outliers.

Techniques Commonly Used for Anomaly Detection:

Isolation Forest: As you’re using, it works by isolating data points that behave differently from the rest of the data.
Autoencoders: In deep learning, autoencoders can be trained to compress data. Data that doesn’t follow the normal pattern results in high reconstruction errors, which signals an anomaly.
One-Class SVM: This technique finds a decision boundary that separates “normal” data points from potential anomalies.
Local Outlier Factor (LOF): LOF measures the local density of data points, labeling points in sparsely populated regions as anomalies.

Practical Example:

# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Example dataset (replace with your actual data)
data = {
    'Annual Income': [15, 16, 17, 18, 19, 20, 1000, 21, 22, 23],
    'Spending Score': [39, 81, 6, 77, 40, 76, 90, 14, 25, 73]
}

df = pd.DataFrame(data)

# Scaling the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df[['Annual Income', 'Spending Score']])

# Using Isolation Forest for anomaly detection
isolation_forest = IsolationForest(contamination=0.1, random_state=0)
df['Anomaly'] = isolation_forest.fit_predict(scaled_data)

# Visualizing anomalies
plt.scatter(df['Annual Income'], df['Spending Score'], c=df['Anomaly'], cmap='coolwarm')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.title('Anomaly Detection in Customer Data')
plt.show()

The scatter plot you shared shows the results of anomaly detection using the Isolation Forest algorithm. Here’s a breakdown of the visualization:

Axes:
- The x-axis represents “Annual Income” (ranging from 0 to around 1000).
- The y-axis represents “Spending Score” (ranging from 0 to around 90).
Points:
- The data points in red represent “normal” data points, as identified by the Isolation Forest.
- The blue point on the far right (Annual Income = 1000) represents an “anomaly,” which is likely due to its significantly higher income compared to the other data points.

Key observations:

The anomaly stands out because the “Annual Income” of the blue point is much larger than the rest of the data points, which are clustered between 15 and 25 in terms of income.
Isolation Forest identified this large income discrepancy as an anomaly.

If this matches your expectations for the data, the model is working as intended. However, if you’d like to tweak the sensitivity of the anomaly detection, you can adjust the contamination parameter of the IsolationForest model.

Output: Customers classified as anomalies will be colored differently, helping businesses detect unusual spending behavior.

5. Hands-on Project: Customer Segmentation Using K-Means

Objective:

The goal is to segment customers based on their purchasing behavior using K-Means and visualize the clusters.

Dataset:

We’ll use a customer dataset containing features like Annual Income, Spending Score, and Age to create customer segments.

Live Code Example:

# Import required libraries
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Sample dataset
data = {
    'Age': [23, 45, 34, 54, 32, 65, 29, 40, 50, 61],
    'Annual Income': [15, 16, 17, 19, 20, 80, 85, 88, 90, 95],
    'Spending Score': [39, 81, 6, 77, 40, 5, 3, 90, 6, 77]
}
df = pd.DataFrame(data)

# Scale the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)

# Apply K-Means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
df['Cluster'] = kmeans.fit_predict(scaled_data)

# Analyze clusters
print(df.groupby('Cluster').mean())

# Visualize the customer segments
plt.scatter(df['Annual Income'], df['Spending Score'], c=df['Cluster'], cmap='viridis')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.title('Customer Segmentation with K-Means')
plt.show()

Output: This code segments customers into three distinct groups based on their income and spending patterns. Businesses can use this insight to design marketing strategies tailored to different customer groups.

6. Conclusion

Clustering algorithms like K-Means and Hierarchical Clustering, paired with dimensionality reduction techniques like PCA and t-SNE, offer powerful ways to gain insights from data. By applying these techniques, businesses can segment customers, detect anomalies, and simplify high-dimensional data for visualization. The hands-on examples provided in this guide give you a practical understanding of how to implement these methods in real-world scenarios.