Building a Video Description App

Introduction

In this tutorial, we will build a Streamlit app that allows users to upload a video, processes it to understand its content, and then generates a text description of it using OpenAI's GPT-4 model. This app is a perfect example of integrating AI in web applications.

Prerequisites

Before we begin, make sure you have the following:

Python (version 3.8 or later) installed on your system.
An API key from OpenAI. Sign up at OpenAI's website to obtain one.
Basic understanding of Python and working with APIs.

Step 1: Setting Up Your Environment

Create a Project Directory: Open your terminal and run the following commands to create a new directory for your project:

mkdir video-understanding-app
cd video-understanding-app

Install Required Libraries: In the same terminal window, install Streamlit, OpenCV, and other required libraries by running:

pip install streamlit opencv-python-headless base64 tempfile openai requests

Step 2: Writing the App

Create a Python Script: In the project directory, create a file named app.py. This file will hold your app's code.
Import Libraries: At the top of app.py, import the necessary libraries:

import streamlit as st
import cv2
import base64
import tempfile
import openai
import os
import requests

Initialize OpenAI Key: Set your OpenAI API key. Replace 'your_api_key' with your actual key.

os.environ['OPENAI_API_KEY'] = 'your_api_key'
openai.api_key = os.getenv('OPENAI_API_KEY')

Define the Main Function: Create a main function where the core logic of the app will reside.

def main():
    st.title("Video Understanding App")
    # App logic will go here

Video Upload Section: Inside the main function, add a section for users to upload their video.

uploaded_video = st.file_uploader("Upload a video", type=["mp4", "avi", "mov"])

Process Uploaded Video: Write code to handle the uploaded video, save it temporarily, and display its first frame.

if uploaded_video is not None:
    tfile = tempfile.NamedTemporaryFile(delete=False)
    tfile.write(uploaded_video.read())
    tfile.close()

    base64_frames = video_to_base64_frames(tfile.name)
    st.image(base64.b64decode(base64_frames[0]), caption="First frame of the video")

Generate Video Description: Add functionality to generate a description of the video using OpenAI's API.

description = generate_description(base64_frames)
st.write("Description:", description)

Clean Up Temporary Files: Ensure the temporary files are deleted after use.

os.unlink(tfile.name)

Run the Main Function: At the end of app.py, add the following to ensure the main function is called when the script is executed:

if __name__ == "__main__":
    main()

Step 3: Additional Functions

Convert Video to Base64 Frames

This function will convert the video into a series of base64-encoded frames. This encoding allows us to send frame data over the internet for AI analysis.

Function Definition: Define the function video_to_base64_frames, which takes the path of the video file as an argument.

def video_to_base64_frames(video_file_path):
    # Function logic will be added here

Initialize Video Capture: Use OpenCV to capture the video from the given file path.

    video = cv2.VideoCapture(video_file_path)
    base64_frames = []

Read and Encode Frames: Read each frame of the video, encode it as JPEG, and then convert it to a base64 string.

    while video.isOpened():
        success, frame = video.read()
        if not success:
            break
        _, buffer = cv2.imencode('.jpg', frame)
        base64_frame = base64.b64encode(buffer).decode('utf-8')
        base64_frames.append(base64_frame)
    video.release()
    return base64_frames

Generate Video Description

This function interacts with OpenAI's API to generate a text description of the video frames.

Function Definition: Create the function generate_description that takes the base64-encoded frames as input.

def generate_description(base64_frames):
    # Function logic will be added here

Setup API Request: Construct the request to be sent to OpenAI's API. This includes a prompt for the AI and the encoded frames.

    try:
        prompt_messages = [
            {
                "role": "user",
                "content": [
                    "Generate a description for this sequence of video frames.",
                    *map(lambda x: {"image": x, "resize": 768}, base64_frames[0::15]),
                ],
            },
        ]

Let’s break this down:

The `try` Block

try is used in Python to catch and handle exceptions (errors) that may occur in the block of code within the try statement. If an error occurs, the code execution will move to the except block, allowing the program to handle the error gracefully.

The `prompt_messages` List

prompt_messages is a list that contains a dictionary. This dictionary is structured to conform to the requirements of the OpenAI API, specifically for generating descriptions of images (in this case, frames from a video).

The Dictionary Inside `prompt_messages`

"role": "user": This key-value pair specifies the role for the message, in this case, a "user" sending a request to the model.
"content": This key holds a list of items that form the content of the request.

The Content of the Request

"Generate a description for this sequence of video frames.": This is a text instruction to the AI model, telling it what is expected. It's a prompt for the model to generate a description of the video frames.
map(lambda x: {"image": x, "resize": 768}, base64_frames[0::15]): This is a more complex part of the code. Let's break it down:

base64_frames[0::15]: This is a slicing operation on the base64_frames list. base64_frames contains the base64-encoded strings of the video frames. The slicing [0::15] means "start from the first frame (index 0), and then take every 15th frame." This reduces the number of frames being processed to a manageable amount, as processing every single frame might be unnecessary and computationally expensive.
map(lambda x: {"image": x, "resize": 768}, base64_frames[0::15]): The map function applies the given lambda function to each item in base64_frames[0::15]. The lambda function takes a frame (x) and returns a dictionary with two keys: "image", which is set to the frame itself, and "resize", which is set to 768. This indicates that each image should be resized (probably for consistency and to manage the processing load).

Make API Call: Use the openai.ChatCompletion.create method to send the request to OpenAI.

        response = openai.ChatCompletion.create(
            model="gpt-4-vision-preview",
            messages=prompt_messages,
            max_tokens=400,
        )
        return response.choices[0].message.content
    except openai.error.OpenAIError as e:
        st.error("OpenAI API error encountered. Please try again.")
        st.write("Error details:", e)
        retry_button = st.button("Retry Request")
        if retry_button:
            return generate_description(base64_frames)
        return None

Handle API Response: The response from OpenAI is processed and the generated description is returned. Error handling is included to manage any issues with the API call.

With these functions, your app will be able to process video files and use AI to generate descriptions of their content. This tutorial should provide a clear guide for users to follow along and create their own video understanding app using Streamlit and OpenAI.

Step 4: Running the App

Start the App: In your terminal, navigate to the project directory and run:

streamlit run app.py

Interact with the App: Your default web browser should open with the Streamlit interface. Upload a video and see the app generate a description!

Conclusion

Congratulations! You've built an AI-powered app that understands videos and generates descriptions.

import streamlit as st
import cv2
import base64
import tempfile
import openai
import os
import requests

openai.api_key = os.environ['OPENAI_API_KEY'] = 'YOUR_API_KEY'
# Initialize OpenAI key

def main():
    st.title("Video Understanding App")

    # Video upload section
    uploaded_video = st.file_uploader("Upload a video", type=["mp4", "avi", "mov"])

    if uploaded_video is not None:
        # Save the uploaded video to a temporary file
        tfile = tempfile.NamedTemporaryFile(delete=False)
        tfile.write(uploaded_video.read())
        tfile.close()

        # Process the video and display the first frame
        base64_frames = video_to_base64_frames(tfile.name)
        st.image(base64.b64decode(base64_frames[0]), caption="First frame of the video")

        # Generate a description of the video
        description = generate_description(base64_frames)
        st.write("Description:", description)

        # Clean up temporary file
        os.unlink(tfile.name)

def video_to_base64_frames(video_file_path):
    video = cv2.VideoCapture(video_file_path)
    base64_frames = []
    while video.isOpened():
        success, frame = video.read()
        if not success:
            break
        _, buffer = cv2.imencode('.jpg', frame)
        base64_frame = base64.b64encode(buffer).decode('utf-8')
        base64_frames.append(base64_frame)
    video.release()
    return base64_frames

def generate_description(base64_frames):
    try:
        prompt_messages = [
            {
                "role": "user",
                "content": [
                    "Generate a description for this sequence of video frames.",
                    *map(lambda x: {"image": x, "resize": 768}, base64_frames[0::15]),
                ],
            },
        ]
        response = openai.ChatCompletion.create(
            model="gpt-4-vision-preview",
            messages=prompt_messages,
            max_tokens=400,
        )
        return response.choices[0].message.content
    except openai.error.OpenAIError as e:
        # Display an error message and prompt the user to retry
        st.error("OpenAI API error encountered. Please try again.")
        st.write("Error details:", e)
        retry_button = st.button("Retry Request")
        if retry_button:
            return generate_description(base64_frames)
        return None


if __name__ == "__main__":
    main()

Here is what the full code should look like.