Renewal·마흔의 생활코딩

LLM | GPT-4o API Practice Beginning - 1. Image (Multi-Modal)

May 15, 2024·4 min read

cover image

A tutorial for trying out GPT-4O (Omni Model) through OpenAI's Vision and Text APIs.

MY GPT 4o API Beginning Course
? 1. Image( multi modal)
2. Summary( Video + Audio)
3. QA( Video + Audio based chat)

After OpenAI, Spring Update(May 13, 2024) , online media including YouTube and the news has literally turned into a melting pot of shock. Presentations and demo videos are spreading virally as if they were content, getting copy-pasted again and again. I too was just stunned for a while, unable to believe the demonstrations. Then I tracked down the related code and immediately tried out a GPT-4o API hands-on session.

Setting Up the Practice Environment

01) Add the API key and the GitHub configuration files to the root folder

Write the .env file
OPENAI_API_KEY=YourOpenAIIssuedAPIKey
Write .gitignore
Reference link - https://docs.github.com/ko/get-started/getting-started-with-git/ignoring-files

02) Set up and run a separate virtual environment to avoid dependency conflicts with packages from other local projects.

Create virtual environment
python -m venv your-preferred-venv-name
Activate virtual environment
mac - source your-preferred-venv-name/bin/activate
win - your-preferred-venv-name\bin\activate

03) Install dependency packages
pip install -U openai opencv-python moviepy python-dotenv

GPT 4o API Practice 1. Basic Chat Setup

- code

# Load the load_dotenv function from the dotenv package from dotenv import load_dotenv # Run the function that loads environment variables from the .env file load_dotenv() # Add the OpenAI package from openai import OpenAI # Create an OpenAI client client = OpenAI() # Specify the chat model to use MODEL="gpt-4o" # Create the completion object that stores the chat response returned by the OpenAI API completion = client.chat.completions.create( model=MODEL, # model to use messages=[ # pass message objects in chronological order {"role": "system", "content": "You are a helpful assistant. Help me with my math homework!"}, {"role": "user", "content": "Hello! Could you solve 2+2?"} ] ) # Print the first choice from the chatbot response in the "choices" property print("Assistant: " + completion.choices[0].message.content)

- result

GPT 4o API Practice 2. Multi-Modal - Local

: After recognizing and interpreting a local image, solve a math problem

- resource

GPT 4o API study - multi modal(Image) input

- code

# Load environment variables from dotenv import load_dotenv load_dotenv() # Import the image-processing library and the OpenAI library from openai import OpenAI import base64 client = OpenAI() # Initialize the OpenAI client MODEL="gpt-4o" # specify the model IMAGE_PATH = "resource/triangle.png" # specify the image path # Function definition: encode an image file as base64 def encode_image(image_path): with open(image_path, "rb") as image_file: return base64.b64encode(image_file.read()).decode("utf-8") # Encode the image base64_image = encode_image(IMAGE_PATH) # Create a conversational request, similar to a chat window. # There are system and user roles, each holding a message. # The user message includes a triangle image attachment. response = client.chat.completions.create( model=MODEL, messages=[ {"role": "system", "content": "You are a helpful assistant that responds in Markdown. Help me with my math homework!"}, {"role": "user", "content": [ {"type": "text", "text": "What's the area of the triangle?"}, {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{base64_image}"} } ]} ], temperature=0.0, ) # Print the response print(response.choices[0].message.content)

- result

GPT 4o API study - multi modal(Image) output

GPT 4o API Practice 3. Multi-Modal - Online

: After recognizing an image at an online URL, solve a math problem

- resource

GPT 4o API study - multi modal(sketch, url) input

- code

# Load environment variables from dotenv import load_dotenv load_dotenv() # Import the image-processing library and the OpenAI library from openai import OpenAI client = OpenAI() MODEL="gpt-4o" # Use the client to create a conversation # There are system and user roles, each holding messages # Provide the system message to define the chatbot's role # The user message includes a core question and a URL to an image of the triangle response = client.chat.completions.create( model=MODEL, messages=[ {"role": "system", "content": "You are a helpful assistant that responds in Markdown. Help me with my math homework!"}, {"role": "user", "content": [ {"type": "text", "text": "What's the area of the triangle?"}, {"type": "image_url", "image_url": {"url": ""} } ]} ], temperature=0.0, ) # Print the response print(response.choices[0].message.content)

- result

GPT 4o API study - multi modal(url) output

Personally, while the math problem-solving was impressive, the image recognition and interpretation truly shocked me. The first image had been created with Google Slides, but the second one was hand-drawn on a Post-it note and even included the surrounding area, yet the model still recognized everything beautifully and solved the problem.

With the arrival of ChatGPT, I understand that the English education market has been undergoing major changes. The LLM that previously couldn't even do arithmetic well... has now apparently been substantially upgraded. No, more than that... this isn't simply a matter of advancement, it has become mainstream. It is clear that other LLMs will soon level up to meet this standard... and through this, I expect significant changes in the math-related product market this time around.

This English version was translated by Claude.