Transcribing Video Files Using Whisper and Python

Shubham Gupta 2 Comments September 24, 2024

Transcribing Video Files Using Whisper and Python

With the increasing use of multimedia content, automatic speech recognition (ASR) has become crucial in many applications, from generating subtitles to converting speech to text for various forms of analysis. In this guide, we demonstrate how to use OpenAI’s Whisper model to transcribe a video file, extract the audio, and generate subtitles in the form of an SRT file.

This method can be applied to different types of content, whether you’re working on educational videos, podcasts, or any other type of speech-heavy video content.

Step-by-Step Process

Set Up the Environment
Before diving into the script, make sure your environment is set up with the necessary packages.Use the following commands to create a virtual environment and install the necessary dependencies:

python --version
python -m venv venv
venv/Scripts/activate
python.exe -m pip install --upgrade pip
pip install torch==2.2.1 transformers==4.40.0 datasets==2.18.0 moviepy==1.0.3 accelerate==0.30.1 librosa soundfile
pip uninstall numpy
pip install numpy<2

2. Environment Variables Set up the environment variables needed for processing:

import os
os.environ['HF_HOME'] = "D:/internship/speech_to_text/.cache"
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = "max_split_size_mb:128"

3. Loading the Model The code utilizes the Whisper model (openai/whisper-small) for transcribing speech to text. The transformers library provides the necessary utilities for loading and using the model:

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

# Setting up device (GPU if available)
device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "openai/whisper-small"

# Load model and processor
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id).to(device)
processor = AutoProcessor.from_pretrained(model_id)

4. Speech-to-Text Pipeline The Whisper model is wrapped in a pipeline to make transcription easier:

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=25,
    batch_size=8,
    return_timestamps=True,
    device=device,
)

5. Processing the Video The video is loaded using the moviepy library, which extracts the audio portion:

import moviepy.editor as mp

def run(videoPath, audioPath, srtPath, pipe):
    video = mp.VideoFileClip(videoPath)
    audio = video.audio
    audio.write_audiofile(audioPath)
    video.close()

6. Generating Subtitles (SRT) The audio file is passed through the ASR pipeline, which generates text and timestamps for each chunk of audio. These results are written into an SRT file, which is compatible with most video players:

An ASR pipeline refers to a structured sequence of operations that automatically transcribe speech into text using Automatic Speech Recognition (ASR) technology. In the context of machine learning and natural language processing (NLP), a pipeline is a predefined workflow that processes data (in this case, audio) through several stages to achieve a desired output (such as a transcript).

Components of an ASR Pipeline

An ASR pipeline typically involves the following steps:

Audio Input:
The first step in the pipeline is to input the audio data that contains the speech. This audio could come from a video file, a podcast, a real-time recording, or any other speech-containing media.
Preprocessing (Feature Extraction):
Before the audio can be transcribed, it is preprocessed to extract important features, such as:
- Spectrograms: Visual representations of the audio frequencies over time.
- MFCC (Mel-frequency cepstral coefficients): A representation of the short-term power spectrum of sound, widely used in ASR systems.
These features are important because they capture the patterns in the audio that correspond to different phonemes (the smallest units of sound in speech).
Speech Model:
The core of the ASR pipeline is the speech recognition model. This model is trained on large datasets of speech and text pairs, and its task is to convert the audio input into textual output. The model:
- Processes the audio features.
- Identifies patterns that represent speech sounds.
- Generates textual representations based on these sounds.
Example models: The Whisper model by OpenAI, Wav2Vec2, or DeepSpeech.
Text Generation:
Once the speech model processes the audio input, it outputs a sequence of words or sentences, which are the transcription of the spoken content. This text is generated based on the model’s learned relationships between speech sounds and corresponding words.
Post-Processing:
After the text is generated, additional steps may be applied to improve the readability and accuracy of the transcription. These may include:
- Punctuation correction: Adding punctuation marks like commas, periods, or question marks.
- Capitalization: Ensuring proper use of upper and lower case.
- Time Alignment (Optional): Assigning timestamps to each transcribed word or phrase for subtitle generation or detailed analysis.
Output:
The final step of the ASR pipeline is to deliver the transcription, which may be plain text, formatted for subtitles (such as SRT), or output into any custom format depending on the use case.

Why Use an ASR Pipeline?

ASR pipelines are essential for converting spoken language into written text, which has a wide range of applications:

Subtitles for videos: Automating the process of creating subtitles for videos.
Speech-to-text conversion: For dictation software, note-taking, or transcription services.
Voice-controlled applications: In smart assistants like Siri or Alexa.
Accessibility: Helping individuals with hearing impairments understand audio content through text.
Language translation: When combined with machine translation models, ASR pipelines can enable real-time translation of spoken languages.

Example of an ASR Pipeline in Python

In the context of your code, the ASR pipeline is built using the transformers library, where the Whisper model from OpenAI is used for transcription. Here’s how it works:

from transformers import pipeline

# Define an automatic speech recognition pipeline using the Whisper model
pipe = pipeline(
    "automatic-speech-recognition",  # The task type: speech-to-text
    model="openai/whisper-small",     # The pre-trained Whisper model
    tokenizer="openai/whisper-small", # The tokenizer corresponding to the model
    feature_extractor="openai/whisper-small",  # Audio preprocessing features
    device=device  # Specify whether to run on CPU or GPU
)

# Use the pipeline to transcribe an audio file
result = pipe(audio_path, return_timestamps=True)

In this pipeline:

Input: An audio file (audio_path).
Feature Extraction: The pipeline extracts features from the audio that are relevant to the Whisper model.
Model Inference: The audio is passed through the Whisper model, which transcribes the speech into text.
Output: The transcribed text and timestamps, which can be used for subtitles or other applications.

Key Benefits of ASR Pipelines

Automated Speech-to-Text: Provides a fast and accurate way to convert spoken language into text without manual transcription.
Multi-language Support: Many ASR pipelines, such as Whisper, support multiple languages and dialects.
Accuracy: ASR pipelines have significantly improved, even in noisy environments or with varying accents.
Customizable: Pipelines can be tailored to specific use cases, including generating subtitles, real-time transcriptions, or even integrating into voice-controlled systems.

In summary, an ASR pipeline is a sequence of processes designed to take an audio input and return its corresponding text transcription. This can be used for a variety of applications, including subtitles, voice-to-text services, and improving accessibility.

result = pipe(audioPath, generate_kwargs={"language": "english"})

with open(srtPath, "w") as f:
    for i, chunk in enumerate(result['chunks']):
        start = chunk["timestamp"][0]
        end = chunk["timestamp"][1]
        f.write(f"{i+1}\n")
        f.write(f"{seconds_to_srt_time(start)} --> {seconds_to_srt_time(end)}\n")
        f.write(chunk['text'].strip() + "\n\n")

7. Example Usage To transcribe a video and generate subtitles, call the run function like so:

run('video.mp4', 'audio.wav', 'audio.srt', pipe)

This setup allows you to efficiently convert speech in your videos into subtitles, making your content more accessible to a wider audience. The Whisper model is highly accurate for various types of speech, including noisy environments and multiple speakers.

Complete Code

# python --version
# python -m venv venv
# venv/Scripts/activate
# python.exe -m pip install --upgrade pip

# pip install torch==2.2.1 transformers==4.40.0 datasets==2.18.0 moviepy==1.0.3 accelerate==0.30.1 librosa soundfile --cache-dir "D:/internship/speech_to_text/.cache"
# pip uninstall numpy
# pip install numpy<2

# Install cuda toolkit 11.8
# Check cuda installed version | nvcc --version
# Visual c++ for creating build for TTS
# python.exe -m pip install --upgrade pip
# https://developer.nvidia.com/cuda-downloads
# nvidia-smi (check graphics processor command)
# https://pytorch.org/get-started/locally/ 
# pip uninstall torch torchvision torchaudio

# pip install "D:/internship/speech_to_text/gpu_requirements/torch-2.4.1+cu121-cp312-cp312-win_amd64.whl"

import os
# Environment setup
os.environ['HF_HOME'] = "D:/internship/speech_to_text/.cache"
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = "max_split_size_mb:128"
import torch
torch.cuda.empty_cache()

import moviepy.editor as mp
import time

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

# Setting up device and model
device = "cuda" if torch.cuda.is_available() else "cpu"
# device = 'cpu'
print(device)

torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-small"

# Load the model and processor
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id).to(device)
processor = AutoProcessor.from_pretrained(model_id)

# Creating the pipeline for automatic speech recognition
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=25,
    batch_size=8,
    return_timestamps=True,
    device=device,
)

# Example dataset, but you can ignore this if you're just running on your video
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

def seconds_to_srt_time(seconds):
    """Convert time in seconds to SRT timestamp format (HH:MM:SS,ms)."""
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    seconds = int(seconds % 60)
    milliseconds = int((seconds % 1) * 1000)
    return f"{hours:02}:{minutes:02}:{seconds:02},{milliseconds:03}"

def run(videoPath, audioPath, srtPath, pipe):
    # Load the video file
    video = mp.VideoFileClip(videoPath)
    # Extract the audio from the video
    audio = video.audio
    # Write the audio to a WAV file
    audio.write_audiofile(audioPath)
    # Close the video file
    video.close()
    
    start_time = time.time()
    
    # Run the speech-to-text pipeline on the extracted audio
    result = pipe(audioPath, generate_kwargs={"language": "english"})
    
    end_time = time.time()
    elapsed_time = end_time - start_time
    print(f"Elapsed time: {elapsed_time} seconds")
    
    # Write the results to an SRT file
    with open(srtPath, "w") as f:
        for i, chunk in enumerate(result['chunks']):
            start = chunk["timestamp"][0]
            end = chunk["timestamp"][1]
            
            # Format the timestamps in SRT time format
            start_time_formatted = seconds_to_srt_time(start)
            end_time_formatted = seconds_to_srt_time(end)
            
            # Write the SRT index, timestamps, and text
            f.write(f"{i+1}\n")
            f.write(f"{start_time_formatted} --> {end_time_formatted}\n")
            f.write(chunk['text'].strip() + "\n\n")

# Example call to the run function
run('video.mp4', "audio.wav", "audio.srt", pipe)