Back to feed
Renewal·마흔의 생활코딩

LLM | GPT 4o API Hands-On Beginning - 3. Video + Audio based QA

NS
normalstory
cover image

A tutorial for using GPT-4o (Omni Model) through OpenAI's Vision and Text APIs. 

MY GPT 4o API Beginning Course
1. Image( multi modal)
2. Summary( Video + Audio)
?  3. QA( Video + Audio based chat)

 

This time, here are the hands-on code samples for a Video + Audio based QA (chat) setup, where the LLM (GPT 4o) can answer questions based on what it has learned from video and audio, with the API configured accordingly. 

 

 

Hands-on 1.  Video based Q&A
: Video-based Q&A 

- resource 

       *The same Audio and Video files used in the previous post's hands-on example      

- code

# Load environment variables (API key) from .env
from dotenv import load_dotenv
load_dotenv() 

# Load required libraries — OpenAI client library for interacting with the API
from openai import OpenAI 
# Reuse code from earlier; *base64Frames is the list of video frames converted to Base64 strings
from C04_Summary_Video import base64Frames

client = OpenAI()  # Initialize the client to connect to the OpenAI API.
MODEL="gpt-4o"  # Specify the model to generate responses (replace with the actual model name or ID).

# Define the question we want answered
QUESTION = "Question: Why do you emphasise the importance of demonstrating to really understand the capabilities of the Macintosh?"

# Generate a chat response using the OpenAI client library (chat.completions.create())
qa_visual_response = client.chat.completions.create(
    # Specify the model, system, and user messages for this request
    model=MODEL,
    messages=[
    {"role": "system", "content": "Use the video to answer the provided question. Respond in Markdown."},
    {"role": "user", "content": [
        "These are the frames from the video.",
        *map(lambda x: {"type": "image_url", "image_url": {"url": f'data:image/jpg;base64,{x}', "detail": "low"}}, base64Frames),
        QUESTION
        ],
    }
    ],
    # 사실적인 응답으로 설정
    temperature=0,
)

# 생성된 채팅 응답의 내용(message.content 속성)을 마크다운 텍스트(system content)로 출력
print("\\n\\nVideo QA:\\n" + qa_visual_response.choices[0].message.content)

- result

 

 

 

GPT 4o API Hands-on 2. Audio based Q&A
: Audio-based Q&A 

- resource

       *The same Audio and Video files used in the previous post's hands-on example      

- code

# Load environment variables (API key) from .env
from dotenv import load_dotenv
load_dotenv() 

# Load required libraries — OpenAI client library for interacting with the API
from openai import OpenAI  
# Reuse code from earlier — transcription is a function that converts audio to text
from C05_Summary_Audio import transcription 

client = OpenAI()  # Initialize the client to connect to the OpenAI API
MODEL="gpt-4o"  # Specify the model to generate responses

# Define the question we want answered
QUESTION = "Question: Why do you emphasise the importance of demonstrating to really understand the capabilities of the Macintosh?"

# Generate a chat response using the OpenAI client library (chat.completions.create())
qa_audio_response = client.chat.completions.create(
    # 이 요청에 대한 모델과 시스템, 사용자 메시지 지정
    model=MODEL,
    messages=[
    {"role": "system", "content":"""Use the transcription to answer the provided question. Respond in Markdown."""},
    {"role": "user", "content": f"The audio transcription is: {transcription.text}. \\n\\n {QUESTION}"},
    ],
    temperature=0,
)

# 생성된 채팅 응답의 내용 출력
print("\\n\\nAudio QA:\\n" + qa_audio_response.choices[0].message.content)

- result

 

 

 

GPT 4o API Hands-on 3. Audio &Video based Q&A
: Combined video + audio Q&A 

- resource

       *The same Audio and Video files used in the previous post's hands-on example      

- code

# Load environment variables (API key) from .env
from dotenv import load_dotenv
load_dotenv()  

# Load required libraries
from openai import OpenAI  # OpenAI client library for interacting with the API
from C04_Summary_Video import base64Frames  # Reuse — function to turn video into text
from C05_Summary_Audio import transcription  # Reuse — function to turn audio into text

client = OpenAI()   # Initialize the client to connect to the OpenAI API
MODEL="gpt-4o"  # Specify the model to generate responses
QUESTION = "Question: Why do you emphasise the importance of demonstrating to really understand the capabilities of the Macintosh?"

qa_both_response = client.chat.completions.create(
    model=MODEL,
    # As in the C06_Summary_VideoAudio example, use map() to combine the audio and video content into one
    messages=[
        {
            "role": "system", 
            "content":"""Use the video and transcription to answer the provided question."""
        },
        {
            "role": "user", 
            "content": [
                "These are the frames from the video.",
                *map(
                        lambda x: {
                            "type": "image_url", 
                            "image_url": {"url": f'data:image/jpg;base64,{x}', "detail": "low"}
                        }, 
                        base64Frames
                    ),
                    {
                        "type": "text", 
                        "text": f"The audio transcription is: {transcription.text}"
                    },
                QUESTION
            ],
        }
    ],
    temperature=0,
)
print("\\n\\nBoth QA:\\n" + qa_both_response.choices[0].message.content)

- result

 

 

 

Personally, as someone with a humanities background and not a developer, the process of understanding code isn't easy. So whenever I can, I tend to first run things to see the result, then restructure the content into code I can repeat as much as possible, so that — based on my own experience — I can take it in as something "familiar". 

If you can read material you don't yet know, you start to think about it; if you keep thinking about it, you come to understand it; and once you understand it, it becomes interesting. To experiment with that approach, the GPT 4o API hands-on content I split across three posts is also built around the same basic skeleton: I broke very similar code samples into smaller pieces, modifying them only slightly so that the result of each change can be seen, and then went back to add detailed comments to each one. 

 

Originally I was scheduled to post some LangGraph hands-on code... but this OpenAI announcement was so impressive that I switched the order and put up the related practice content instead.

This English version was translated by Claude.

친절한 찰쓰씨
Written by
친절한 찰쓰씨

Pleasant Charles — UI/UX researcher at AIT. Keeping notes on design, planning, and slow days here since 2010.

More on the author's page

Keep reading

Renewal

Steadily, for the long haul, without burning out

Mar 31, 2026·9 min
Renewal

Tech-life balance

Feb 7, 2026·3 min
Renewal

Humanality, by Park Jeong-ryeol

Feb 7, 2026·11 min