Introduction
In this tutorial, we will build a Streamlit app that allows users to upload a video, processes it to understand its content, and then generates a text description of it using OpenAI's GPT-4 model. This app is a perfect example of integrating AI in web applications.
Prerequisites
Before we begin, make sure you have the following:
- Python (version 3.8 or later) installed on your system.
- An API key from OpenAI. Sign up at OpenAI's website to obtain one.
- Basic understanding of Python and working with APIs.
Step 1: Setting Up Your Environment
- Create a Project Directory: Open your terminal and run the following commands to create a new directory for your project:
- Install Required Libraries: In the same terminal window, install Streamlit, OpenCV, and other required libraries by running:
mkdir video-understanding-app
cd video-understanding-app
pip install streamlit opencv-python-headless base64 tempfile openai requests
Step 2: Writing the App
- Create a Python Script:
In the project directory, create a file named
app.py
. This file will hold your app's code. - Import Libraries:
At the top of
app.py
, import the necessary libraries: - Initialize OpenAI Key:
Set your OpenAI API key. Replace
'your_api_key'
with your actual key. - Define the Main Function:
Create a
main
function where the core logic of the app will reside. - Video Upload Section:
Inside the
main
function, add a section for users to upload their video. - Process Uploaded Video: Write code to handle the uploaded video, save it temporarily, and display its first frame.
- Generate Video Description: Add functionality to generate a description of the video using OpenAI's API.
- Clean Up Temporary Files: Ensure the temporary files are deleted after use.
- Run the Main Function:
At the end of
app.py
, add the following to ensure the main function is called when the script is executed:
import streamlit as st
import cv2
import base64
import tempfile
import openai
import os
import requests
os.environ['OPENAI_API_KEY'] = 'your_api_key'
openai.api_key = os.getenv('OPENAI_API_KEY')
def main():
st.title("Video Understanding App")
# App logic will go here
uploaded_video = st.file_uploader("Upload a video", type=["mp4", "avi", "mov"])
if uploaded_video is not None:
tfile = tempfile.NamedTemporaryFile(delete=False)
tfile.write(uploaded_video.read())
tfile.close()
base64_frames = video_to_base64_frames(tfile.name)
st.image(base64.b64decode(base64_frames[0]), caption="First frame of the video")
description = generate_description(base64_frames)
st.write("Description:", description)
os.unlink(tfile.name)
if __name__ == "__main__":
main()
Step 3: Additional Functions
Convert Video to Base64 Frames
This function will convert the video into a series of base64-encoded frames. This encoding allows us to send frame data over the internet for AI analysis.
- Function Definition:
Define the function
video_to_base64_frames
, which takes the path of the video file as an argument. - Initialize Video Capture: Use OpenCV to capture the video from the given file path.
- Read and Encode Frames: Read each frame of the video, encode it as JPEG, and then convert it to a base64 string.
def video_to_base64_frames(video_file_path):
# Function logic will be added here
video = cv2.VideoCapture(video_file_path)
base64_frames = []
while video.isOpened():
success, frame = video.read()
if not success:
break
_, buffer = cv2.imencode('.jpg', frame)
base64_frame = base64.b64encode(buffer).decode('utf-8')
base64_frames.append(base64_frame)
video.release()
return base64_frames
Generate Video Description
This function interacts with OpenAI's API to generate a text description of the video frames.
- Function Definition:
Create the function
generate_description
that takes the base64-encoded frames as input. - Setup API Request: Construct the request to be sent to OpenAI's API. This includes a prompt for the AI and the encoded frames.
def generate_description(base64_frames):
# Function logic will be added here
try:
prompt_messages = [
{
"role": "user",
"content": [
"Generate a description for this sequence of video frames.",
*map(lambda x: {"image": x, "resize": 768}, base64_frames[0::15]),
],
},
]
Let’s break this down:
The try
Block
try
is used in Python to catch and handle exceptions (errors) that may occur in the block of code within thetry
statement. If an error occurs, the code execution will move to theexcept
block, allowing the program to handle the error gracefully.
The prompt_messages
List
prompt_messages
is a list that contains a dictionary. This dictionary is structured to conform to the requirements of the OpenAI API, specifically for generating descriptions of images (in this case, frames from a video).
The Dictionary Inside prompt_messages
"role": "user"
: This key-value pair specifies the role for the message, in this case, a "user" sending a request to the model."content"
: This key holds a list of items that form the content of the request.
The Content of the Request
"Generate a description for this sequence of video frames."
: This is a text instruction to the AI model, telling it what is expected. It's a prompt for the model to generate a description of the video frames.map(lambda x: {"image": x, "resize": 768}, base64_frames[0::15])
: This is a more complex part of the code. Let's break it down:base64_frames[0::15]
: This is a slicing operation on thebase64_frames
list.base64_frames
contains the base64-encoded strings of the video frames. The slicing[0::15]
means "start from the first frame (index 0), and then take every 15th frame." This reduces the number of frames being processed to a manageable amount, as processing every single frame might be unnecessary and computationally expensive.map(lambda x: {"image": x, "resize": 768}, base64_frames[0::15])
: Themap
function applies the given lambda function to each item inbase64_frames[0::15]
. The lambda function takes a frame (x
) and returns a dictionary with two keys:"image"
, which is set to the frame itself, and"resize"
, which is set to 768. This indicates that each image should be resized (probably for consistency and to manage the processing load).
- Make API Call:
Use the
openai.ChatCompletion.create
method to send the request to OpenAI. - Handle API Response: The response from OpenAI is processed and the generated description is returned. Error handling is included to manage any issues with the API call.
response = openai.ChatCompletion.create(
model="gpt-4-vision-preview",
messages=prompt_messages,
max_tokens=400,
)
return response.choices[0].message.content
except openai.error.OpenAIError as e:
st.error("OpenAI API error encountered. Please try again.")
st.write("Error details:", e)
retry_button = st.button("Retry Request")
if retry_button:
return generate_description(base64_frames)
return None
With these functions, your app will be able to process video files and use AI to generate descriptions of their content. This tutorial should provide a clear guide for users to follow along and create their own video understanding app using Streamlit and OpenAI.
Step 4: Running the App
- Start the App: In your terminal, navigate to the project directory and run:
- Interact with the App: Your default web browser should open with the Streamlit interface. Upload a video and see the app generate a description!
streamlit run app.py
Conclusion
Congratulations! You've built an AI-powered app that understands videos and generates descriptions.
import streamlit as st
import cv2
import base64
import tempfile
import openai
import os
import requests
openai.api_key = os.environ['OPENAI_API_KEY'] = 'YOUR_API_KEY'
# Initialize OpenAI key
def main():
st.title("Video Understanding App")
# Video upload section
uploaded_video = st.file_uploader("Upload a video", type=["mp4", "avi", "mov"])
if uploaded_video is not None:
# Save the uploaded video to a temporary file
tfile = tempfile.NamedTemporaryFile(delete=False)
tfile.write(uploaded_video.read())
tfile.close()
# Process the video and display the first frame
base64_frames = video_to_base64_frames(tfile.name)
st.image(base64.b64decode(base64_frames[0]), caption="First frame of the video")
# Generate a description of the video
description = generate_description(base64_frames)
st.write("Description:", description)
# Clean up temporary file
os.unlink(tfile.name)
def video_to_base64_frames(video_file_path):
video = cv2.VideoCapture(video_file_path)
base64_frames = []
while video.isOpened():
success, frame = video.read()
if not success:
break
_, buffer = cv2.imencode('.jpg', frame)
base64_frame = base64.b64encode(buffer).decode('utf-8')
base64_frames.append(base64_frame)
video.release()
return base64_frames
def generate_description(base64_frames):
try:
prompt_messages = [
{
"role": "user",
"content": [
"Generate a description for this sequence of video frames.",
*map(lambda x: {"image": x, "resize": 768}, base64_frames[0::15]),
],
},
]
response = openai.ChatCompletion.create(
model="gpt-4-vision-preview",
messages=prompt_messages,
max_tokens=400,
)
return response.choices[0].message.content
except openai.error.OpenAIError as e:
# Display an error message and prompt the user to retry
st.error("OpenAI API error encountered. Please try again.")
st.write("Error details:", e)
retry_button = st.button("Retry Request")
if retry_button:
return generate_description(base64_frames)
return None
if __name__ == "__main__":
main()
Here is what the full code should look like.