How to Convert Video to Text Using Python

In today’s fast-paced world, video meetings, webinars, and tutorials have become a part of everyday life. With the growing demand for video content, having a text version of these videos is a necessity, especially for accessibility, indexing, and search engine optimization (SEO) purposes. In this guide, we will explore how to convert video files, such as MKV, into text using Python and the Google Speech-to-Text API. This method is efficient, reliable, and works seamlessly, even for different languages like Bengali.

Why Convert Video to Text?

Before diving into the process, let’s explore some of the key reasons to convert video content into text:

Accessibility: Text transcripts allow individuals with hearing disabilities to access video content.
SEO Benefits: Search engines can index text, making your content easier to discover.
Improved User Experience: Providing both video and text options allows users to consume content in their preferred format.
Content Repurposing: You can use the text version of your video for blog posts, eBooks, or social media snippets.

Now that we understand the importance of converting video to text, let’s walk through the step-by-step process using Python and the Google Speech-to-Text API.

Requirements for Converting Video to Text

Before starting, you will need the following:

Python: Installed on your computer. You can download it from the official Python website.
Google Cloud Account: Set up Google Speech-to-Text API and create service credentials in JSON format.
Required Python Libraries:
- google-cloud-speech
- moviepy
- pydub

These can be installed using pip:

pip install google-cloud-speech moviepy pydub

Step 1: Setting Up Google Cloud Speech-to-Text API

The Google Cloud Speech-to-Text API allows us to convert audio to text. It supports multiple languages, including Bengali, making it an ideal solution for our task.

Enable the API: Go to the Google Cloud Console, create a project, and enable the Speech-to-Text API.
Create Service Account Key: Go to the “Credentials” section, create a new service account, and download the key as a JSON file. This key file is essential for authentication.

Step 2: Extract Audio from the MKV Video File

Since Google Speech-to-Text works with audio files, the first step is to extract the audio from the MKV video file. For this, we will use the moviepy library.

Here’s how to extract audio from the MKV video file and save it as a WAV file:

from moviepy.editor import VideoFileClip

def extract_audio_from_mkv(video_file, output_audio_file):
    """
    Extract audio from an MKV video file and save it as a WAV file.
    """
    video = VideoFileClip(video_file)
    audio = video.audio
    audio.write_audiofile(output_audio_file)
    print(f"Audio extracted and saved as {output_audio_file}")

# Example usage:
extract_audio_from_mkv("meeting_video.mkv", "output_audio.wav")

This function extracts the audio from your MKV file and saves it as a WAV file, which is necessary for transcription.

Step 3: Authenticate Google Cloud Speech-to-Text API

For the Google API to work, you need to authenticate your application. Set an environment variable pointing to your JSON key file:

For Linux/macOS:

export GOOGLE_APPLICATION_CREDENTIALS="path_to_your_service_account_json.json"

For Windows:

set GOOGLE_APPLICATION_CREDENTIALS=path_to_your_service_account_json.json

Step 4: Transcribe Audio Using Google Cloud Speech-to-Text API

Now that the audio is extracted and you are authenticated, the next step is to send the audio file to Google Speech-to-Text API for transcription.

Here is the Python code for this step:

import os
from google.cloud import speech

def transcribe_audio_google(audio_file):
    """
    Transcribe audio using Google Cloud Speech-to-Text API.
    """
    client = speech.SpeechClient()

    # Load the audio into memory
    with open(audio_file, "rb") as audio_file_content:
        audio_data = audio_file_content.read()

    # Configure audio settings for the request
    audio = speech.RecognitionAudio(content=audio_data)
    config = speech.RecognitionConfig(
        encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
        sample_rate_hertz=16000,
        language_code="bn-BD"  # Bengali language
    )

    # Transcribe the audio
    response = client.recognize(config=config, audio=audio)

    # Extract and return the transcription
    transcription = ""
    for result in response.results:
        transcription += result.alternatives[0].transcript

    return transcription

# Example usage:
transcription = transcribe_audio_google("output_audio.wav")
print(transcription)

# Optionally, save the transcription to a file
with open("transcription.txt", "w", encoding="utf-8") as f:
    f.write(transcription)

Step 5: Putting It All Together

Let’s combine everything into one script. This script will take an MKV video file, extract the audio, send it to the Google API for transcription, and save the result as a text file.

from moviepy.editor import VideoFileClip
from google.cloud import speech
import os

# Step 1: Extract audio from the MKV video file
def extract_audio_from_mkv(video_file, output_audio_file):
    """
    Extract audio from an MKV video file and save it as a WAV file.
    """
    video = VideoFileClip(video_file)
    audio = video.audio
    audio.write_audiofile(output_audio_file)
    print(f"Audio extracted and saved as {output_audio_file}")

# Step 2: Transcribe audio using Google Cloud Speech-to-Text API
def transcribe_audio_google(audio_file):
    """
    Transcribe audio using Google Cloud Speech-to-Text API.
    """
    client = speech.SpeechClient()

    # Load the audio into memory
    with open(audio_file, "rb") as audio_file_content:
        audio_data = audio_file_content.read()

    # Configure audio settings for the request
    audio = speech.RecognitionAudio(content=audio_data)
    config = speech.RecognitionConfig(
        encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
        sample_rate_hertz=16000,
        language_code="bn-BD"  # Bengali (Bangladesh)
    )

    # Transcribe the audio
    response = client.recognize(config=config, audio=audio)

    # Extract and return the transcription
    transcription = ""
    for result in response.results:
        transcription += result.alternatives[0].transcript

    return transcription

# Step 3: Convert MKV video to text
def convert_mkv_to_text(video_file, audio_file, credentials_file):
    """
    Convert an MKV video to text by extracting the audio and transcribing it using Google Cloud.
    """
    # Set Google credentials environment variable
    os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = credentials_file

    # Step 1: Extract audio from the MKV video
    extract_audio_from_mkv(video_file, audio_file)

    # Step 2: Transcribe the extracted audio to text using Google Cloud
    transcription = transcribe_audio_google(audio_file)

    # Step 3: Save the transcription to a text file
    with open("transcription.txt", "w", encoding="utf-8") as f:
        f.write(transcription)
    
    print("Transcription saved to transcription.txt")

# Example usage:
convert_mkv_to_text("your_video_file.mkv", "output_audio.wav", "your_service_account_json.json")

Conclusion

Converting video to text using Python and Google Speech-to-Text API is a powerful solution for creating transcripts for meetings, webinars, and other content. This guide walked you through the entire process, from extracting audio from an MKV file to transcribing the audio and saving the transcription to a file. By following these steps, you can quickly generate text versions of your video content in multiple languages, including Bengali, for better accessibility and SEO.

FAQs

1. What video formats are supported?

While this guide uses MKV files, you can use formats like MP4, AVI, and others by making slight modifications to the code.

2. How accurate is Google Speech-to-Text for Bengali?

Google Speech-to-Text is highly accurate for many languages, including Bengali. However, the accuracy may vary depending on audio quality and background noise.

3. Can I use this method for live transcription?

Yes, but live transcription requires the use of Google’s Streaming API, which allows real-time audio processing.