Introduction to Natural Language Processing in Python
In today’s digital age, sentiment analysis plays a crucial role in understanding the opinions and feelings expressed in written content. From customer reviews to social media posts, sentiment analysis helps businesses and individuals gain insights into public opinion, enabling data-driven decisions. One of the most common applications of sentiment analysis is in the movie industry, where analyzing movie reviews can provide valuable insights into audience feedback.
1. What is NLP?
- Definition: NLP is a field of AI that enables machines to understand, interpret, and generate human language. It bridges the gap between computers and human languages, making it possible for machines to perform tasks like translation, sentiment analysis, and summarization.
- Real-world applications: Chatbots (e.g., Siri, Alexa), sentiment analysis, machine translation, automated customer support, content generation.
2. Key Concepts in NLP
- Tokenization: The process of breaking text into smaller units, like words or sentences. For example, the sentence “I love AI” becomes [“I”, “love”, “AI”].
- Types: Word-level tokenization, sentence-level tokenization.
- Stop Words: Commonly used words (e.g., “the”, “is”, “in”) that are usually removed because they carry little meaning in many tasks.
- Stemming and Lemmatization:
- Stemming: Reducing words to their base or root form (e.g., “running” → “run”).
- Lemmatization: More accurate form of stemming, converting words to their dictionary form (e.g., “better” → “good”).
- Bag of Words (BoW): Represents text as a collection (bag) of words without considering the order. It creates a frequency distribution of words in the text.
- TF-IDF (Term Frequency-Inverse Document Frequency): A statistical measure used to evaluate how important a word is in a document relative to a collection of documents (corpus). It helps reduce the impact of common words and emphasizes important ones.
3. Text Processing Techniques
- Text Preprocessing:
- Lowercasing: Convert all text to lowercase to standardize words (e.g., “AI”, “ai”, and “Ai” become “ai”).
- Removing Punctuation: Helps in cleaning the text by getting rid of special characters like commas, periods, etc.
- Removing Numbers: In many cases, numbers are not useful and can be removed.
- Removing Stop Words: Eliminates frequent words that don’t add much meaning to the text.
- Word Embeddings:
- Word2Vec/GloVe: Methods that map words into continuous vector space where semantically similar words have similar vectors.
- Example: “King” and “Queen” would have vectors that are closely related.
- Sentence Embeddings: Represent an entire sentence as a single vector, used in tasks like sentence classification or similarity matching.
4. NLP Pipeline
- Step 1: Text Preprocessing: Apply the tokenization, stop word removal, stemming/lemmatization.
- Step 2: Feature Extraction:
- Bag of Words or TF-IDF for traditional methods.
- Word Embeddings for semantic understanding.
- Step 3: Model Building: Use a machine learning or deep learning model depending on the task (e.g., Naive Bayes for text classification, LSTMs for sequence modeling).
- Step 4: Evaluation: Use metrics like accuracy, precision, recall, and F1 score to evaluate the model’s performance.
5. Common NLP Tasks
- Text Classification: Categorizing text into predefined classes (e.g., spam detection in emails, sentiment analysis of reviews).
- Named Entity Recognition (NER): Identifying entities in text such as names, locations, organizations, etc.
- Part-of-Speech (POS) Tagging: Assigning word types to each token (e.g., noun, verb, adjective).
- Machine Translation: Automatically translating text from one language to another.
- Question Answering: Building systems that can answer questions posed in natural language (e.g., search engines).
- Sentiment Analysis: Determining the emotional tone of a text (e.g., positive, negative, neutral reviews).
6. Popular NLP Libraries
- NLTK (Natural Language Toolkit): A powerful Python library for text processing, supports tasks like tokenization, stemming, tagging, etc.
- spaCy: A fast and industrial-strength NLP library used for large-scale processing.
- Transformers (HuggingFace): State-of-the-art library for deep learning-based NLP models like BERT, GPT.
- Gensim: Specialized in word embeddings and topic modeling.
Let’s build a simple text classification project using the movie review dataset to classify movie reviews as positive or negative. We’ll use TF-IDF for feature extraction and the Naive Bayes classifier from scikit-learn
. This will give a good introduction to NLP concepts, text processing, and model building.
We’ll use the following Python libraries:
- Pandas: For data manipulation.
- Scikit-learn: For machine learning models and TF-IDF vectorization.
- NLTK: For text preprocessing like stop word removal.
Here’s the code for a basic movie sentiment classification project:
# python -m venv env
# env/Scripts/activate
# python.exe -m pip install --upgrade pip
# pip install pandas scikit-learn nltk --cache-dir "D:/internship/movie_reviews/.cache"
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
import nltk
from nltk.corpus import stopwords
import string
# Download stopwords from NLTK
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
# Example dataset (you can replace this with the IMDB dataset or any other)
# Here's a sample for illustration
data = {
'review': [
"I loved the movie, it was fantastic!",
"The movie was horrible, I hated it.",
"Amazing movie with stunning visuals.",
"Worst movie I've seen in years, boring.",
"It was an okay movie, not the best.",
"What a waste of time, terrible plot.",
"I enjoyed the film, great acting.",
"This movie is awful, wouldn't recommend."
],
'sentiment': ['positive', 'negative', 'positive', 'negative', 'neutral', 'negative', 'positive', 'negative']
}
# Convert the data into a DataFrame
df = pd.DataFrame(data)
# Preprocessing function to clean the text
def preprocess_text(text):
# Lowercase the text
text = text.lower()
# Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))
# Remove stop words
tokens = [word for word in text.split() if word not in stop_words]
return ' '.join(tokens)
# Apply the preprocessing to the 'review' column
df['cleaned_review'] = df['review'].apply(preprocess_text)
# Splitting data into features (X) and labels (y)
X = df['cleaned_review']
y = df['sentiment']
# Convert labels into binary (positive = 1, negative/neutral = 0)
y = y.apply(lambda x: 1 if x == 'positive' else 0)
# Step 1: TF-IDF Vectorization
tfidf = TfidfVectorizer(max_features=500)
X_tfidf = tfidf.fit_transform(X)
# Step 2: Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)
# Step 3: Naive Bayes Classifier
model = MultinomialNB()
model.fit(X_train, y_train)
# Step 4: Make predictions
y_pred = model.predict(X_test)
# Step 5: Evaluation
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
print("Classification Report:\n", classification_report(y_test, y_pred, zero_division=1))
Project Explanation:
- Data Loading: A sample dataset is created using a dictionary. You can replace this with the actual IMDB dataset or any other review dataset in CSV format.
- Preprocessing: We convert the reviews to lowercase, remove punctuation, and eliminate stopwords using NLTK. This step ensures that the text is clean for the model to process.
- TF-IDF Vectorization: We convert the cleaned text data into numerical features using TF-IDF. This transformation emphasizes important words and reduces the influence of common ones.
- Model Building: We use a Naive Bayes classifier to train the model on the transformed features.
- Evaluation: The model’s accuracy is printed, and a detailed classification report (precision, recall, F1-score) is generated.
Output
[nltk_data] Downloading package stopwords to
[nltk_data] C:\Users\shubh\AppData\Roaming\nltk_data...
[nltk_data] Package stopwords is already up-to-date!
Accuracy: 50.00%
Classification Report:
precision recall f1-score support
0 1.00 0.50 0.67 2
1 0.00 1.00 0.00 0
accuracy 0.50 2
macro avg 0.50 0.75 0.33 2
weighted avg 1.00 0.50 0.67 2
Dataset Note:
You can use a larger dataset for more comprehensive training, such as:
- IMDB movie reviews dataset -> https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
- Sentiment140 dataset -> https://www.kaggle.com/datasets/kazanova/sentiment140
Complete code for training and saving model using IMDB dataset
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
import nltk
from nltk.corpus import stopwords
import string
import joblib
# Download stopwords from NLTK
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
# Load your dataset from CSV (update the path to your actual dataset location)
df = pd.read_csv("D:/internship/movie_reviews/datasets/IMDB Dataset.csv")
# Preprocessing function to clean the text
def preprocess_text(text):
# Lowercase the text
text = text.lower()
# Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))
# Remove stop words
tokens = [word for word in text.split() if word not in stop_words]
return ' '.join(tokens)
# Apply the preprocessing to the 'review' column
df['cleaned_review'] = df['review'].apply(preprocess_text)
# Convert sentiment labels to binary (assuming 'positive' or 'negative' sentiment values)
df['sentiment'] = df['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)
# Splitting data into features (X) and labels (y)
X = df['cleaned_review']
y = df['sentiment']
# Step 1: TF-IDF Vectorization
tfidf = TfidfVectorizer(max_features=500)
X_tfidf = tfidf.fit_transform(X)
# Step 2: Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)
# Step 3: Naive Bayes Classifier
model = MultinomialNB()
model.fit(X_train, y_train)
# Step 4: Save the model and the TF-IDF vectorizer
joblib.dump(model, 'naive_bayes_model.pkl') # Save the trained model
joblib.dump(tfidf, 'tfidf_vectorizer.pkl') # Save the TF-IDF vectorizer
# Step 5: Make predictions
y_pred = model.predict(X_test)
# Step 6: Evaluation
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
print("Classification Report:\n", classification_report(y_test, y_pred, zero_division=1))
# Function to load model and vectorizer, and generate prediction from new data
def predict_sentiment(new_review):
# Load the saved model and vectorizer
model = joblib.load('naive_bayes_model.pkl')
tfidf = joblib.load('tfidf_vectorizer.pkl')
# Preprocess the new review
cleaned_review = preprocess_text(new_review)
# Transform the new review using the saved TF-IDF vectorizer
X_new = tfidf.transform([cleaned_review])
# Predict the sentiment
prediction = model.predict(X_new)
# Return the predicted sentiment (1 for positive, 0 for negative)
return "positive" if prediction[0] == 1 else "negative"
# Example usage of the prediction function
new_review = "This movie was fantastic! I really enjoyed the story and the acting."
predicted_sentiment = predict_sentiment(new_review)
print(f"The predicted sentiment for the review is: {predicted_sentiment}")
Complete code for just getting the output from pre-trained model
import joblib
import string
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
def preprocess_text(text):
text = text.lower()
text = text.translate(str.maketrans('', '', string.punctuation))
tokens = [word for word in text.split() if word not in stop_words]
return ' '.join(tokens)
def predict_sentiment(new_review):
model = joblib.load('naive_bayes_model.pkl')
tfidf = joblib.load('tfidf_vectorizer.pkl')
cleaned_review = preprocess_text(new_review)
X_new = tfidf.transform([cleaned_review])
prediction = model.predict(X_new)
return "positive" if prediction[0] == 1 else "negative"
new_review = "This movie was fantastic! I really enjoyed the story and the acting."
predicted_sentiment = predict_sentiment(new_review)
print(f"The predicted sentiment for the review is: {predicted_sentiment}")
generate_output.py file for load testing
import joblib
import string
from nltk.corpus import stopwords
import time
stop_words = set(stopwords.words('english'))
def preprocess_text(text):
text = text.lower()
text = text.translate(str.maketrans('', '', string.punctuation))
tokens = [word for word in text.split() if word not in stop_words]
return ' '.join(tokens)
def predict_sentiment(new_review):
model = joblib.load('naive_bayes_model.pkl')
tfidf = joblib.load('tfidf_vectorizer.pkl')
cleaned_review = preprocess_text(new_review)
X_new = tfidf.transform([cleaned_review])
prediction = model.predict(X_new)
return "positive" if prediction[0] == 1 else "negative"
for i in range(0, 1000):
start_time = time.time()
new_review = "This movie was fantastic! I really enjoyed the story and the acting."
predicted_sentiment = predict_sentiment(new_review)
print(f"The predicted sentiment for the review is: {predicted_sentiment}")
end_time = time.time()
time_difference = end_time - start_time
print(time_difference)
generate_output_flask.py
# pip install Flask
import joblib
import string
from nltk.corpus import stopwords
from flask import Flask, request, jsonify
# Initialize the Flask app
app = Flask(__name__)
# Load the pre-trained model and vectorizer
model = joblib.load('naive_bayes_model.pkl')
tfidf = joblib.load('tfidf_vectorizer.pkl')
# Set of stopwords
stop_words = set(stopwords.words('english'))
def preprocess_text(text):
"""Preprocess the input text by lowercasing, removing punctuation, and stop words."""
text = text.lower()
text = text.translate(str.maketrans('', '', string.punctuation))
tokens = [word for word in text.split() if word not in stop_words]
return ' '.join(tokens)
def predict_sentiment(new_review):
"""Predict sentiment using the loaded model and preprocessed review text."""
cleaned_review = preprocess_text(new_review)
X_new = tfidf.transform([cleaned_review])
prediction = model.predict(X_new)
return "positive" if prediction[0] == 1 else "negative"
@app.route('/predict', methods=['POST'])
def predict():
"""API endpoint to predict sentiment from user input."""
data = request.get_json()
if 'review' not in data:
return jsonify({"error": "Review text is required"}), 400
new_review = data['review']
predicted_sentiment = predict_sentiment(new_review)
return jsonify({"sentiment": predicted_sentiment})
if __name__ == '__main__':
# Run the Flask app
app.run(debug=True)
CURL
curl --location 'http://127.0.0.1:5000/predict' \
--header 'Content-Type: application/json' \
--data '{
"review": "Movie was great"
}'