An Introduction to Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) Networks: Modeling Sequential Data

Shubham Gupta Leave a Comment October 22, 2024

An Introduction to Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) Networks: Modeling Sequential Data

Recurrent Neural Networks (RNNs):

Imagine your brain is like a very smart calculator. Normally, calculators can only work with one number or a few numbers at a time. But in real life, when you read a book or listen to a story, your brain remembers what happened earlier and uses that memory to understand what’s going on now.

RNNs are a bit like your brain—they are special computer programs that can “remember” information from earlier. When they look at a new word in a sentence, for example, they don’t just see it by itself. They also “remember” the words that came before, so they can understand the full meaning of the sentence.

Long Short-Term Memory (LSTM):

Now, RNNs are great at remembering things, but sometimes they forget things too quickly, like when you forget something you learned last week. That’s where LSTMs come in.

LSTMs are a special type of RNN that are really good at remembering important things for a long time, and they know when to forget things that aren’t needed anymore. It’s like having a superpower in your brain that helps you remember the important parts of a story and forget the rest.

Example:

Imagine you’re reading a story about a dog. At first, you learn the dog’s name is Max. A regular program might forget the name later in the story, but an LSTM would remember the name of the dog throughout the whole story, helping it make sense of everything that happens to Max!

In short, RNNs and LSTMs help computers understand and remember things that happen over time, like sentences in a story, or even songs and videos!

Professional words

Recurrent Neural Networks (RNNs):

Recurrent Neural Networks (RNNs) are a class of artificial neural networks designed to recognize patterns in sequences of data, such as time series, natural language, or sequential events. Unlike traditional feedforward neural networks, which assume that inputs are independent of each other, RNNs leverage the concept of temporal dependencies, meaning that they have a form of memory. They use this memory to retain information about previous inputs in the sequence, allowing them to model sequential data more effectively.

The key idea behind an RNN is the recurrent loop—a mechanism where the output from one time step is fed as input to the next time step. This allows the network to maintain information across the sequence, which is particularly useful for tasks like speech recognition, language modeling, and video analysis.

Long Short-Term Memory (LSTM) Networks:

One of the limitations of traditional RNNs is their inability to capture long-range dependencies due to the problem of vanishing and exploding gradients during backpropagation. To address this, Long Short-Term Memory (LSTM) networks were introduced. LSTMs are a specialized form of RNN that incorporate memory cells and gating mechanisms (input, output, and forget gates) to regulate the flow of information through the network.

The core innovation of LSTMs is their ability to store important information for extended periods while selectively forgetting irrelevant data. The gates control when to allow new input, when to forget past data, and when to output information from the memory cell. This makes LSTMs highly effective for tasks that require long-term memory retention, such as long text sequences, speech generation, or financial data prediction.

Key Applications:

Natural Language Processing (NLP): RNNs and LSTMs are widely used in machine translation, text generation, and sentiment analysis due to their ability to understand context over sequences of words.
Speech Recognition: They are integral to systems that convert spoken language into text by modeling the temporal relationships in speech signals.
Time Series Forecasting: RNNs and LSTMs are highly effective in predicting trends and patterns in sequential data such as stock prices or weather conditions.

In summary, RNNs and LSTMs are powerful tools for handling sequential data, with LSTMs being particularly well-suited for capturing long-range dependencies in complex tasks.

Introduction

In today’s era of big data and online communication, understanding the sentiment behind written text is a powerful capability. Whether analyzing movie reviews, tweets, or customer feedback, sentiment analysis helps businesses and researchers gain valuable insights. In this article, we will build a Long Short-Term Memory (LSTM) network using Python and TensorFlow to perform sentiment analysis on the IMDB movie review dataset.

LSTM networks are a type of Recurrent Neural Network (RNN) designed to retain information over long sequences of data, making them especially effective for understanding text sequences and classifying them based on context.

Prerequisites

Before diving into the project, you’ll need the following:

Python 3.x installed on your machine.
A virtual environment setup to ensure isolation of dependencies. If you haven’t created one, you can use the following commands:

python -m venv venv
.\venv\Scripts\activate  # Windows
source venv/bin/activate  # macOS/Linux

3. Install the necessary libraries using the following command:

pip install tensorflow pandas numpy scikit-learn nltk matplotlib --cache-dir "D:/internship/neural_networks/.cache"

Step 1: Data Loading and Preprocessing

We’ll be using the IMDB Movie Reviews Dataset, which consists of 50,000 movie reviews labeled as either positive or negative. Our goal is to clean the data, tokenize the text, and then feed it into the LSTM model for classification.

Code Implementation:

import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
import re

# Download NLTK stopwords
nltk.download('stopwords')

# Load dataset (Assuming a CSV file with 'review' and 'sentiment' columns)
df = pd.read_csv("datasets/IMDB Dataset.csv")  # Replace with actual path

Step 2: Data Preprocessing

Data preprocessing is a crucial step to clean and prepare the text for input into the model. We will:

Remove special characters and numbers.
Convert all text to lowercase.
Remove stopwords (common words like “the” or “is” that don’t add much meaning).

# Preprocessing function to clean and tokenize the text
def preprocess_text(text):
    # Remove special characters and digits
    text = re.sub(r'\W', ' ', text)
    text = re.sub(r'\d', ' ', text)
    # Lowercase the text
    text = text.lower()
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

# Apply preprocessing to the reviews
df['cleaned_review'] = df['review'].apply(preprocess_text)

Step 3: Tokenization and Padding

After cleaning the data, we need to tokenize the text (convert words into numerical indices) and pad the sequences to ensure uniform input length.

# Convert the labels into binary format (1 for positive, 0 for negative)
label_encoder = LabelEncoder()
df['sentiment'] = label_encoder.fit_transform(df['sentiment'])  # 1 for positive, 0 for negative

# Tokenize and pad sequences
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(df['cleaned_review'])

# Convert text to sequences
X = tokenizer.texts_to_sequences(df['cleaned_review'])

# Pad sequences to make all input sequences the same length (100 words long)
X = pad_sequences(X, maxlen=100)

# Convert sentiment labels to numpy array
y = df['sentiment'].values

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Tokenize and pad sequences

Purpose: This line initializes a Tokenizer object from Keras, which will be used to convert text into numerical tokens.

Tokenizer: The tokenizer converts words in the text to integer values based on word frequency. Each word is assigned a unique index, and more frequent words get lower integer indices.

num_words=5000: This parameter specifies that only the top 5000 most frequent words in the dataset will be tokenized. Words beyond this will be ignored. This helps to focus on the most relevant words and reduce computational complexity.

Fit tokenizer on texts

tokenizer.fit_on_texts(df['cleaned_review'])

Purpose: This line fits the tokenizer to the cleaned_review column of the dataframe, which contains the preprocessed reviews.

What it does: The tokenizer learns the vocabulary of the dataset by analyzing all the text and building a word index. This index maps each word in the dataset to a unique integer. The tokenizer counts word frequencies and assigns indices based on their frequency (lower index for more frequent words).

Convert text to sequences

X = tokenizer.texts_to_sequences(df['cleaned_review'])

Purpose: This line converts the text reviews into sequences of integers, where each word in the review is replaced by its corresponding integer index from the tokenizer’s word index.

What it does: After the tokenizer is fit, each word in the review is mapped to its assigned integer. So, a sentence like "This movie is great" might become [34, 78, 15, 202], where each number corresponds to a word in the tokenizer’s dictionary.

Result: Each review is now represented as a sequence of numbers, where each number corresponds to a word in the review.

Pad sequences

X = pad_sequences(X, maxlen=100)

Purpose: This line pads all the sequences (i.e., the reviews converted to sequences of numbers) to ensure that they are of the same length.

What it does: The pad_sequences function adds padding (zeros by default) to the sequences so that all reviews are exactly 100 words long. If a review is shorter than 100 words, it is padded with zeros at the beginning (or end, depending on settings). If a review is longer than 100 words, it is truncated.

maxlen=100: This specifies that each sequence will either be truncated or padded to 100 words. All sequences will be of the same length, which is necessary for feeding the data into a neural network.

Step 4: Building the LSTM Model

We will now build the LSTM network using the Sequential API from Keras. The model includes an embedding layer, an LSTM layer, and a dense output layer for binary classification.

# Build the LSTM model
model = Sequential([
    Embedding(input_dim=5000, output_dim=128, input_length=100),  # Embedding layer
    LSTM(128, return_sequences=False),  # LSTM layer
    Dense(1, activation='sigmoid')  # Output layer for binary classification
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Display the model architecture
model.summary()

Building the Model using Sequential API

model = Sequential([
    Embedding(input_dim=5000, output_dim=128, input_length=100),  # Embedding layer
    LSTM(128, return_sequences=False),  # LSTM layer
    Dense(1, activation='sigmoid')  # Output layer for binary classification
])

Sequential(): The Sequential API in Keras allows you to build a linear stack of layers where one layer follows another. It’s a straightforward way to create models when each layer feeds into the next in sequence.

Embedding Layer:

Embedding(input_dim=5000, output_dim=128, input_length=100)

Purpose: The Embedding layer converts the integer-encoded words (from tokenization) into dense vectors of fixed size. Instead of representing words as discrete integers, the embedding maps each word to a continuous vector in a lower-dimensional space.

input_dim=5000: This specifies the size of the vocabulary, i.e., the top 5000 most frequent words in the dataset (as defined in the tokenizer).

output_dim=128: This is the size of the dense word vector (also called the embedding dimension). Each word is represented by a vector of 128 dimensions. You can think of this as the number of features used to represent each word.

input_length=100: This defines the length of the input sequences (100 words per review, as determined by padding). The Embedding layer needs to know how long the sequences are to generate the appropriate shape for its output.

LSTM Layer:

LSTM(128, return_sequences=False)

Purpose: The Long Short-Term Memory (LSTM) layer is the core of the model. It processes the sequence of word embeddings and learns long-term dependencies between the words in the review.

LSTM(128): The number 128 represents the number of LSTM units or “memory cells.” Each LSTM unit captures relationships and patterns in the input sequence.

return_sequences=False: This specifies that the LSTM should only return the output for the last time step (i.e., the final hidden state). This is because we are performing classification and don’t need the intermediate outputs of each time step (word in the sequence). If set to True, the LSTM would return the output at every time step, which is useful in tasks like sequence-to-sequence models.

Dense Layer (Output Layer):

Dense(1, activation='sigmoid')

Purpose: The Dense layer is a fully connected layer that serves as the output layer. In this case, it’s responsible for outputting the final prediction of the sentiment (positive or negative).

Dense(1): The number 1 means the layer has one output unit because it’s a binary classification problem (positive or negative sentiment).

activation='sigmoid': The sigmoid activation function squashes the output to a value between 0 and 1, which is perfect for binary classification. The model will output a probability score that indicates the likelihood of the review being positive (if the output is closer to 1) or negative (if the output is closer to 0).

Compiling the Model

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

optimizer='adam': The Adam optimizer is a widely used optimization algorithm that adjusts the learning rate during training for faster convergence. It’s popular due to its robustness and adaptability.

loss='binary_crossentropy': The binary cross-entropy loss function is used for binary classification tasks. It measures the difference between the predicted probabilities and the actual labels (1 for positive, 0 for negative).

metrics=['accuracy']: This tells the model to track accuracy during training and evaluation. Accuracy is a useful metric for binary classification tasks, showing how often the model makes correct predictions.

Displaying the Model Architecture

Purpose: This function prints a summary of the model’s architecture, including the number of layers, the shape of the inputs and outputs at each layer, and the number of parameters (weights) in each layer. It helps you understand the structure of the model.

Step 5: Training the Model

We’ll train the LSTM model for 5 epochs with a batch size of 64. We’ll also split 20% of the training data into a validation set to monitor the model’s performance.

# Train the model
history = model.fit(X_train, y_train, epochs=5, batch_size=64, validation_split=0.2)

model.fit: This method is used to train the model on the training data (X_train, y_train).
X_train: This is the input data for training. It’s typically a numpy array or a tensor of features.
y_train: This is the target or label data, corresponding to X_train. It contains the correct output for each input.
epochs=5: This indicates the number of times the training algorithm will work through the entire training dataset. In this case, the model will go through the dataset 5 times.
batch_size=64: This sets the number of samples per gradient update. Instead of updating the model’s parameters after seeing every single sample, the model will update them after seeing 64 samples (a batch). Larger batch sizes can lead to faster training, but smaller batch sizes allow the model to generalize better.
validation_split=0.2: This specifies that 20% of the training data will be set aside for validation. During training, the model will check its performance on this validation set after each epoch to monitor overfitting or improvement in generalization.
history: The fit method returns a History object, which stores the details of the training process, including metrics like loss and accuracy for both training and validation over epochs. You can access this information via history.history.

Step 6: Evaluating the Model

Once training is complete, we’ll evaluate the model on the test set and generate a classification report to measure the model’s accuracy, precision, recall, and F1-score.

# Evaluate the model on the test set
test_loss, test_accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {test_accuracy * 100:.2f}%")

# Predict on test data
y_pred = (model.predict(X_test) > 0.5).astype("int32")

# Print the classification report
print(classification_report(y_test, y_pred))

model.evaluate(X_test, y_test): This evaluates the trained model on the test data (X_test, y_test). It returns two values:

test_loss: The loss on the test set.
test_accuracy: The accuracy of the model on the test set, which is how well the model predicts the correct labels.

print(f"Test Accuracy: {test_accuracy * 100:.2f}%"): This prints the test accuracy in percentage format, rounded to two decimal places.

y_pred = (model.predict(X_test) > 0.5).astype("int32")

model.predict(X_test): This generates predictions from the model for the test dataset (X_test).

If this is a binary classification problem, the output is usually a probability (a value between 0 and 1) indicating the model’s confidence that the input belongs to class 1.

(model.predict(X_test) > 0.5): Converts the predicted probabilities into binary classes. Predictions with a probability greater than 0.5 are classified as 1, and those below 0.5 are classified as 0.

.astype("int32"): Converts the boolean predictions (True/False) into integers (1/0).

Step 7: Visualizing the Results

To better understand the training process, we can plot the training and validation accuracy/loss over epochs.

# Plot training & validation accuracy values
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(loc='upper left')
plt.show()

history.history['accuracy']: This retrieves the training accuracy values for each epoch.
history.history['val_accuracy']: This retrieves the validation accuracy values for each epoch.
plt.plot(...): Plots both the training and validation accuracy over the epochs.
plt.title('Model Accuracy'): Sets the title of the plot as “Model Accuracy.”
plt.ylabel('Accuracy'): Labels the y-axis as “Accuracy.”
plt.xlabel('Epoch'): Labels the x-axis as “Epoch” (since accuracy is measured per epoch).
plt.legend(loc='upper left'): Adds a legend to differentiate between training and validation accuracy, positioned at the upper left.
plt.show(): Displays the plot.

This plot helps you understand whether the model is improving over time and how it performs on both the training and validation sets.

# Plot training & validation loss values
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(loc='upper left')
plt.show()

history.history['loss']: This retrieves the training loss values for each epoch.
history.history['val_loss']: This retrieves the validation loss values for each epoch.
plt.plot(...): Plots both the training and validation loss over the epochs.
plt.title('Model Loss'): Sets the title of the plot as “Model Loss.”
plt.ylabel('Loss'): Labels the y-axis as “Loss.”
plt.xlabel('Epoch'): Labels the x-axis as “Epoch” (since loss is measured per epoch).
plt.legend(loc='upper left'): Adds a legend to differentiate between training and validation loss, positioned at the upper left.
plt.show(): Displays the plot.

This plot helps you monitor whether the model’s loss decreases over time. Ideally, both the training and validation loss should decrease, but if the validation loss increases while training loss decreases, it might indicate overfitting.

Complete code

# python -m venv venv
# .\venv\Scripts\activate
# pip install tensorflow pandas numpy sklearn nltk matplotlib --cache-dir "D:/internship/neural_networks/.cache"
# Import necessary libraries
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
import re

# Download NLTK stopwords
nltk.download('stopwords')

# Load dataset (Assuming a CSV file with 'review' and 'sentiment' columns)
df = pd.read_csv("datasets/IMDB Dataset.csv")  # Replace with actual path

# Display the first few rows of the dataset
df.head()

# Preprocessing function to clean and tokenize the text
def preprocess_text(text):
    # Remove special characters and digits
    text = re.sub(r'/W', ' ', text)
    text = re.sub(r'/d', ' ', text)
    # Lowercase the text
    text = text.lower()
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

# Apply preprocessing to the reviews
df['cleaned_review'] = df['review'].apply(preprocess_text)

# Convert the labels into binary format (1 for positive, 0 for negative)
label_encoder = LabelEncoder()
df['sentiment'] = label_encoder.fit_transform(df['sentiment'])  # 1 for positive, 0 for negative

# Tokenize and pad sequences
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(df['cleaned_review'])

# Convert text to sequences
X = tokenizer.texts_to_sequences(df['cleaned_review'])

# Pad sequences to make all input sequences the same length (100 words long)
X = pad_sequences(X, maxlen=100)

# Convert sentiment labels to numpy array
y = df['sentiment'].values

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Build the LSTM model
model = Sequential([
    Embedding(input_dim=5000, output_dim=128, input_length=100),  # Embedding layer
    LSTM(128, return_sequences=False),  # LSTM layer
    Dense(1, activation='sigmoid')  # Output layer for binary classification
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Display the model architecture
model.summary()

# Train the model
history = model.fit(X_train, y_train, epochs=5, batch_size=64, validation_split=0.2)

# Evaluate the model on the test set
test_loss, test_accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {test_accuracy * 100:.2f}%")

# Predict on test data
y_pred = (model.predict(X_test) > 0.5).astype("int32")

# Print the classification report
print(classification_report(y_test, y_pred))

# Plot training & validation accuracy values
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(loc='upper left')
plt.show()

# Plot training & validation loss values
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(loc='upper left')
plt.show()

Key Observations:

Training Accuracy (blue line):
- The training accuracy steadily increases over the epochs. This suggests that the model is learning and improving its performance on the training data.
- By the 5th epoch, the training accuracy reaches nearly 94%.
Validation Accuracy (orange line):
- The validation accuracy starts lower than the training accuracy, around 88%, and slightly decreases over the epochs.
- By the end of the 5th epoch, the validation accuracy is about 86%, which is lower than where it started.

Interpretation:

Overfitting: The graph suggests that the model is overfitting. This is because the training accuracy keeps increasing, while the validation accuracy declines. In overfitting, the model performs well on the training data but fails to generalize to unseen (validation) data, as evidenced by the drop in validation accuracy.

What you can do:

Regularization techniques like dropout, early stopping, or L2 regularization could help reduce overfitting.
Increasing the size of the validation set or using techniques like cross-validation might give a more robust validation accuracy.
Simplifying the model architecture (reducing layers or parameters) can also help prevent overfitting.

The plot is useful in diagnosing that the model is likely memorizing the training data instead of learning generalizable patterns.

Key Observations:

Training Loss (blue line):
- The training loss decreases significantly over the epochs, which indicates that the model is becoming better at minimizing errors on the training data. By the end of the 5th epoch, the training loss is very low, close to 0.10.
Validation Loss (orange line):
- The validation loss, on the other hand, starts around 0.30 and increases over the epochs. By the end of the 5th epoch, it reaches around 0.40, which is notably higher than where it started.

Interpretation:

Overfitting: Similar to the accuracy plot, this loss graph suggests overfitting. The training loss decreases steadily, meaning the model fits the training data well, but the validation loss increases, showing that the model is not generalizing well to the unseen validation data.
Diverging Loss Curves: Ideally, both the training and validation loss should decrease together. However, in this case, while the training loss is decreasing, the validation loss is increasing, indicating that the model is “memorizing” the training data rather than learning generalized patterns.

Possible Actions:

Early Stopping: Implementing early stopping could be beneficial, as the validation loss starts increasing early in the training process.
Regularization: Techniques such as dropout or L2 regularization could help reduce overfitting.
Reduce Model Complexity: You might also try simplifying the model, such as reducing the number of layers or parameters, to prevent it from overfitting the training data.

Summary:

This plot highlights that the model performs very well on the training set but struggles to generalize on the validation set. The increasing validation loss as training continues is a key indicator of overfitting. Adjusting the model or using regularization methods would likely improve performance on unseen data.

Output

2024-10-22 16:29:30.470815: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-10-22 16:29:31.572983: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\shubh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
D:\internship\neural_networks\venv\lib\site-packages\keras\src\layers\core\embedding.py:90: UserWarning: Argument `input_length` is deprecated. Just remove it.
  warnings.warn(
2024-10-22 16:29:59.101637: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                         ┃ Output Shape                ┃         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ embedding (Embedding)                │ ?                           │     0 (unbuilt) │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ lstm (LSTM)                          │ ?                           │     0 (unbuilt) │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense (Dense)                        │ ?                           │     0 (unbuilt) │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
 Total params: 0 (0.00 B)
 Trainable params: 0 (0.00 B)
 Non-trainable params: 0 (0.00 B)
Epoch 1/5
500/500 ━━━━━━━━━━━━━━━━━━━━ 46s 89ms/step - accuracy: 0.7739 - loss: 0.4422 - val_accuracy: 0.8789 - val_loss: 0.2982
Epoch 2/5
500/500 ━━━━━━━━━━━━━━━━━━━━ 43s 87ms/step - accuracy: 0.9111 - loss: 0.2255 - val_accuracy: 0.8756 - val_loss: 0.3113
Epoch 3/5
500/500 ━━━━━━━━━━━━━━━━━━━━ 44s 87ms/step - accuracy: 0.9318 - loss: 0.1793 - val_accuracy: 0.8712 - val_loss: 0.3396
Epoch 4/5
500/500 ━━━━━━━━━━━━━━━━━━━━ 44s 87ms/step - accuracy: 0.9455 - loss: 0.1488 - val_accuracy: 0.8650 - val_loss: 0.3934
Epoch 5/5
500/500 ━━━━━━━━━━━━━━━━━━━━ 44s 88ms/step - accuracy: 0.9558 - loss: 0.1206 - val_accuracy: 0.8579 - val_loss: 0.3875
313/313 ━━━━━━━━━━━━━━━━━━━━ 7s 22ms/step - accuracy: 0.8540 - loss: 0.3835
Test Accuracy: 85.58%
313/313 ━━━━━━━━━━━━━━━━━━━━ 7s 23ms/step

Epoch Results:

Epoch 1:
- Training Accuracy: 77.39%
- Training Loss: 0.4422
- Validation Accuracy: 87.89%
- Validation Loss: 0.2982
- The model starts with a relatively high validation accuracy, suggesting a good starting point.
Epoch 2:
- Training Accuracy: 91.11%
- Training Loss: 0.2255
- Validation Accuracy: 87.56%
- Validation Loss: 0.3113
- Training accuracy improves significantly, but validation accuracy slightly decreases. Validation loss starts to increase, a sign of overfitting.
Epoch 3:
- Training Accuracy: 93.18%
- Training Loss: 0.1793
- Validation Accuracy: 87.12%
- Validation Loss: 0.3396
- Training continues to improve, but the gap between training and validation accuracy is growing, and the validation loss continues to increase, reinforcing signs of overfitting.
Epoch 4:
- Training Accuracy: 94.55%
- Training Loss: 0.1488
- Validation Accuracy: 86.50%
- Validation Loss: 0.3934
- The gap between training and validation performance continues to widen. The model is now overfitting, as validation accuracy decreases while training accuracy increases.
Epoch 5:
- Training Accuracy: 95.58%
- Training Loss: 0.1206
- Validation Accuracy: 85.79%
- Validation Loss: 0.3875
- By the final epoch, the model shows clear overfitting. The training accuracy is very high, but the validation accuracy keeps dropping, and the validation loss remains high.

Test Results:

Test Accuracy: 85.58%
Test Loss: 0.3835
- The test accuracy of 85.58% is close to the final validation accuracy, which suggests that the model is performing consistently between validation and test sets. However, the model might not generalize as well as it could due to overfitting.

Classification Report:

The classification report shows the performance of the model on each class (0 and 1). Here’s what the metrics mean:

Precision: Out of all predicted positives, how many were actually correct.
Recall: Out of all actual positives, how many did the model correctly identify.
F1-score: The harmonic mean of precision and recall, balancing the two.
Support: The number of instances of each class in the test set.

Both classes (0 and 1) have very similar performance, with precision, recall, and F1-score around 0.86, indicating that the model is fairly balanced across both classes.

Summary:

Overfitting: The model is overfitting to the training data, as seen in the widening gap between training and validation performance. This can be addressed by techniques like early stopping, dropout, or using regularization.
Good Test Performance: Despite the overfitting, the model achieves a reasonable test accuracy of 85.58%, and the classification report shows balanced performance between the two classes.

Consider implementing techniques to reduce overfitting to improve validation performance and generalization.

              precision    recall  f1-score   support

           0       0.86      0.85      0.85      4961
           1       0.86      0.86      0.86      5039

    accuracy                           0.86     10000
   macro avg       0.86      0.86      0.86     10000
weighted avg       0.86      0.86      0.86     10000

Here’s a breakdown of what each metric represents in this specific case:

1. Class 0 and Class 1:

The two classes in this binary classification task are:
- Class 0: Negative sentiment (e.g., negative movie reviews).
- Class 1: Positive sentiment (e.g., positive movie reviews).
The classification report provides detailed performance metrics for each class.

2. Precision:

Class 0 (Negative): Precision = 0.86 → Out of all the predictions the model made as negative, 86% of them were correct (i.e., truly negative).
Class 1 (Positive): Precision = 0.86 → Out of all the predictions made as positive, 86% were correct (i.e., truly positive).
Precision is important when the cost of false positives is high. In this case, precision reflects how many of the reviews predicted as negative or positive were actually negative or positive.

3. Recall:

Class 0 (Negative): Recall = 0.85 → Out of all the actual negative reviews in the test set, the model correctly identified 85% of them.
Class 1 (Positive): Recall = 0.86 → Out of all the actual positive reviews in the test set, the model correctly identified 86% of them.
Recall is important when the cost of false negatives is high. Here, recall shows how well the model identifies all instances of positive or negative sentiment in the dataset.

4. F1-score:

The F1-score is the harmonic mean of precision and recall, giving a balanced measure when both precision and recall are equally important:
- Class 0 (Negative): F1-score = 0.85
- Class 1 (Positive): F1-score = 0.86
A high F1-score means the model is performing well in terms of both precision and recall for both classes.

5. Support:

Support refers to the number of actual occurrences of each class in the test set:
- Class 0 (Negative): There are 4961 negative reviews in the test set.
- Class 1 (Positive): There are 5039 positive reviews in the test set.
The support values are nearly balanced, which makes the overall accuracy and performance metrics easier to interpret without needing specific weighting for one class over the other.

6. Accuracy:

Overall Accuracy: 86% → The model correctly predicted the sentiment for 86% of the reviews in the test set, regardless of class.
Accuracy is a good general measure of how well the model is performing, but it does not account for class imbalances (which are minimal in this case).

7. Macro Average:

Macro avg of precision, recall, and F1-score: These averages are calculated by taking the mean of the scores for each class without weighting by the number of instances in each class. Since both classes have similar support, the macro average values (0.86 for precision, recall, and F1-score) are the same as the class-level scores.

8. Weighted Average:

Weighted avg: This average is weighted by the number of instances in each class (i.e., support), providing a summary that reflects the class distribution. Since both classes are almost equally represented in the dataset, the weighted average also results in 0.86 for precision, recall, and F1-score.

Summary:

Performance: The model performs equally well for both positive and negative sentiment classification, achieving an overall accuracy of 86%.
Balanced Classes: The dataset seems fairly balanced (roughly equal support for both classes), making the macro and weighted averages equivalent.
Good Generalization: The F1-scores of 0.85 and 0.86 suggest that the model is performing well in predicting both positive and negative sentiments with a good balance between precision and recall.

Conclusion

In this article, we built a sentiment analysis model using LSTM networks. By leveraging the IMDB movie review dataset, we were able to achieve high accuracy by utilizing the LSTM’s ability to retain and understand the sequential relationships between words in a text.

This model can be extended and improved by experimenting with different hyperparameters, using additional layers like Bidirectional LSTMs, or applying more sophisticated preprocessing techniques. Moreover, you can deploy the trained model for real-world applications, such as analyzing customer feedback or sentiment on social media.

By following these steps, you now have a solid foundation for working with LSTM networks on text-based data!