Sentiment Analysis with Logistic Regression: A Hands-On Approach
Sentiment analysis has become an essential tool for businesses to understand customer opinions and feedback. In this article, we’ll explore a simple implementation of sentiment analysis using logistic regression. The goal is to classify movie reviews as either positive or negative. We will use the IMDB dataset, train a logistic regression model, and evaluate its performance.
Step 1: Setting Up the Environment
Before we dive into the code, we need to set up the environment. Follow these commands in your terminal:
# Step 1: Set up the environment (only run these commands in the terminal)
python -m venv venv
./venv/Scripts/activate
python.exe -m pip install --upgrade pip
pip install scikit-learn pandas --cache-dir "D:\internship\supervised_learning\sentiment_analysis\.cache"
Additionally, you can download datasets like:
Step 2: Importing Necessary Libraries
We’ll use popular Python libraries like pandas
, scikit-learn
, and others to load the data, process it, and build our model.
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd
Step 3: Loading and Preprocessing Data
Here, we load the IMDB dataset, and preprocess it by mapping the sentiment to binary values (1
for positive, 0
for negative). We then split the data into training and testing sets.
# Load dataset (IMDB Reviews)
data = pd.read_csv('IMDB Dataset.csv')
# Data Preprocessing
X = data['review']
y = data['sentiment'].map({'positive': 1, 'negative': 0}) # Mapping positive to 1, negative to 0
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 4: Vectorizing the Text Data
We use TF-IDF (Term Frequency-Inverse Document Frequency) vectorization to convert the text into numerical features that a machine learning model can work with.
# TF-IDF Vectorization
tfidf = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)
Step 5: Model Training and Evaluation
We will use Logistic Regression, a simple yet effective model for binary classification tasks. After training the model, we predict the sentiment on the test data and evaluate its performance.
# Model Training
model = LogisticRegression()
model.fit(X_train_tfidf, y_train)
# Prediction & Evaluation on Test Data
y_pred = model.predict(X_test_tfidf)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
Step 6: Predicting Sentiment for New Inputs
Finally, we can create a function to predict the sentiment of any new input review. This allows us to interactively predict whether a review is positive or negative.
# Function to Predict Sentiment for New Input
def predict_sentiment(review):
review_tfidf = tfidf.transform([review]) # Transform input review using trained TF-IDF vectorizer
prediction = model.predict(review_tfidf) # Predict sentiment (0 or 1)
if prediction == 1:
return "Positive"
elif prediction == 0:
return "Negative"
else:
return "Neutral"
# Input from user
while True:
user_input = input("\nEnter a sentence for sentiment analysis (or type 'exit' to quit): ")
if user_input.lower() == 'exit':
break
sentiment = predict_sentiment(user_input)
print(f"The sentiment of the sentence is: {sentiment}")
Results and Performance
After training the model, the accuracy score on the test data can be printed along with a classification report that provides more detailed performance metrics like precision, recall, and F1-score.
Complete code
# Step 1: Set up the environment (only run these commands in the terminal)
# python -m venv venv
# ./venv/Scripts/activate
# python.exe -m pip install --upgrade pip
# pip install scikit-learn pandas --cache-dir "D:\internship\supervised_learning\sentiment_analysis\.cache"
# https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?resource=download
# https://www.kaggle.com/datasets/kazanova/sentiment140
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd
# Load dataset (IMDB Reviews)
data = pd.read_csv('IMDB Dataset.csv')
# Data Preprocessing
X = data['review']
y = data['sentiment'].map({'positive': 1, 'negative': 0}) # Mapping positive to 1, negative to 0
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# TF-IDF Vectorization
tfidf = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)
# Model Training
model = LogisticRegression()
model.fit(X_train_tfidf, y_train)
# Prediction & Evaluation on Test Data
y_pred = model.predict(X_test_tfidf)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
# Function to Predict Sentiment for New Input
def predict_sentiment(review):
review_tfidf = tfidf.transform([review]) # Transform input review using trained TF-IDF vectorizer
prediction = model.predict(review_tfidf) # Predict sentiment (0 or 1)
if prediction == 1:
return "Positive"
elif prediction == 0:
return "Negative"
else:
return "Neutral"
# Input from user
while True:
user_input = input("\nEnter a sentence for sentiment analysis (or type 'exit' to quit): ")
if user_input.lower() == 'exit':
break
sentiment = predict_sentiment(user_input)
print(f"The sentiment of the sentence is: {sentiment}")
Sentiment Analysis of Twitter Data Using Logistic Regression: Key Differences and Insights
Sentiment analysis on social media platforms like Twitter is crucial for understanding public opinion and customer sentiment. In this article, we will explore a code implementation for analyzing the sentiment of tweets using Logistic Regression and compare it with a previously discussed sentiment analysis workflow on movie reviews. We will focus on differences in the dataset structure, data preprocessing, and handling of sentiments, all while keeping the same core model training process intact.
Step 1: Setting Up the Environment
Before diving into the code, let’s set up the environment by installing the required libraries. This remains the same as in the previous implementation.
Step 2: Loading and Preprocessing the Twitter Data
Unlike the previous IMDB dataset which had a clearly defined structure, Twitter data often comes without headers, and we need to specify the encoding manually (in this case, ISO-8859-1
). Let’s load the data and understand its structure:
import pandas as pd
# Load dataset with encoding specified and no header (since the file may not have a header row)
data = pd.read_csv('Twitter.csv', encoding='ISO-8859-1', header=None)
# Check the first few rows to understand the structure
print(data.head())
Key Difference: Dataset Structure
- IMDB Reviews: In the previous implementation, the IMDB dataset had well-defined columns, where one column contained the movie reviews and another column contained the sentiment.
- Twitter Data: In this case, we do not have predefined headers, so the code assumes that the tweet text is in the 6th column (index 5), and sentiment is in the 1st column (index 0).
# Assuming the tweet text is in the 6th column (index 5) and sentiment in the 1st column (index 0)
X = data[5] # This column seems to contain the tweet text
y = data[0] # Assuming this column contains the sentiment
Step 3: Handling Missing Data
Missing or NaN
values are always a challenge in real-world datasets. Here, we check for missing values and drop them accordingly.
# Check for NaN values
print("Missing values in X:", X.isna().sum())
print("Missing values in y:", y.isna().sum())
# Drop rows where X or y is NaN
data.dropna(subset=[5, 0], inplace=True)
# Re-assign X and y after dropping NaNs
X = data[5]
y = data[0]
Key Difference: Data Preprocessing
- IMDB Reviews: The IMDB dataset was clean, and there was no need to handle missing values. The sentiment values were either “positive” or “negative,” which were easily mapped to binary values (1 or 0).
- Twitter Data: Here, we manually check for and remove any rows containing missing values in the tweet or sentiment columns. The sentiment labels may also need more complex handling since tweets often use a broader range of sentiment expressions.
Step 4: Vectorizing the Text Data
Similar to the IMDB implementation, we use TF-IDF Vectorization to convert tweet text into numerical features that can be used for model training.
# TF-IDF Vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)
Step 5: Model Training and Evaluation
We use Logistic Regression to train the model and evaluate its performance. This process remains consistent with the previous implementation.
# Model Training
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train_tfidf, y_train)
# Prediction & Evaluation on Test Data
from sklearn.metrics import accuracy_score, classification_report
y_pred = model.predict(X_test_tfidf)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
Key Difference: Sentiment Labels
- IMDB Reviews: The sentiment labels were binary—
positive
(1) andnegative
(0)—which made the classification task straightforward. - Twitter Data: Twitter sentiment labels are typically more nuanced. In this dataset, the sentiment might be represented by numbers like
0
(negative),4
(positive), and possibly other values for neutral sentiment.
Step 6: Predicting Sentiment for New Input
Here, we write a function to predict the sentiment of a new tweet entered by the user. The logic is slightly adjusted to handle multiple sentiment labels, with 4
representing a positive tweet and 0
representing a negative tweet.
# Function to Predict Sentiment for New Input
def predict_sentiment(review):
review_tfidf = tfidf.transform([review]) # Transform input review using trained TF-IDF vectorizer
prediction = model.predict(review_tfidf) # Predict sentiment (0 or 1)
if prediction == 4:
return "Positive"
elif prediction == 0:
return "Negative"
else:
return "Neutral"
# Input from user
while True:
user_input = input("\nEnter a sentence for sentiment analysis (type 'exit' to exit): ")
if user_input.lower() == 'exit':
break
sentiment = predict_sentiment(user_input)
print(f"The sentiment of the sentence is: {sentiment}")
Conclusion: Differences Between IMDB and Twitter Sentiment Analysis
- Data Structure: Twitter datasets often lack headers, requiring manual assignment of columns, unlike structured datasets like IMDB movie reviews.
- Data Preprocessing: Missing values and noise are more common in social media data, so handling missing values is an important step.
- Sentiment Labels: While IMDB reviews have simple binary labels (positive/negative), Twitter data can have more diverse sentiment labels, requiring more careful handling of classification.
- Text Length: Tweets are typically short and to the point, which can affect the quality and reliability of text vectorization methods like TF-IDF compared to the longer, more detailed IMDB reviews.
Both implementations follow the same steps of data preparation, vectorization, and training, but the nature of the datasets influences how we preprocess and handle the data for sentiment analysis.
Complete code
# Step 1: Set up the environment (only run these commands in the terminal)
# python -m venv venv
# ./venv/Scripts/activate
# python.exe -m pip install --upgrade pip
# pip install scikit-learn pandas --cache-dir "D:\internship\supervised_learning\sentiment_analysis\.cache"
# https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?resource=download
# https://www.kaggle.com/datasets/kazanova/sentiment140
import pandas as pd
# Load dataset with encoding specified and no header (since the file may not have a header row)
data = pd.read_csv('Twitter.csv', encoding='ISO-8859-1', header=None)
# Check the first few rows to understand the structure
print(data.head())
# Assuming the tweet text is in the 6th column (index 5) and sentiment in the 1st column (index 0)
X = data[5] # This column seems to contain the tweet text
y = data[0] # Assuming this column contains the sentiment
# Check for NaN values
print("Missing values in X:", X.isna().sum())
print("Missing values in y:", y.isna().sum())
# Drop rows where X or y is NaN
data.dropna(subset=[5, 0], inplace=True)
# Re-assign X and y after dropping NaNs
X = data[5]
y = data[0]
# Train-Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# TF-IDF Vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)
# Model Training
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train_tfidf, y_train)
# Prediction & Evaluation on Test Data
from sklearn.metrics import accuracy_score, classification_report
y_pred = model.predict(X_test_tfidf)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
# Function to Predict Sentiment for New Input
def predict_sentiment(review):
review_tfidf = tfidf.transform([review]) # Transform input review using trained TF-IDF vectorizer
prediction = model.predict(review_tfidf) # Predict sentiment (0 or 1)
if prediction == 4:
return "Positive"
elif prediction == 0:
return "Negative"
else:
return "Neutral"
# Input from user
while True:
user_input = input("\nEnter a sentence for sentiment analysis (type 'exit' to exit): ")
if user_input.lower() == 'exit':
break
sentiment = predict_sentiment(user_input)
print(f"The sentiment of the sentence is: {sentiment}")
By understanding these nuances, you can tailor your sentiment analysis approach depending on the platform and dataset you’re working with!