Renewal·마흔의 생활코딩

LLM | Ollama Part 5: Applying Image Recognition

February 25, 2024·5 min read

cover image

LLM | Ollama Part 5. Applying image recognition

- Ollama Part 1. Running it from a local terminal: Linux (wsl 2), MacOS
- Ollama Part 2. Running it in a local browser : open-webui
- Ollama Part 3. Running it in an online browser (on my own domain)
- Ollama Part 4. Applying Retrieval-Augmented RAG
? Ollama Part 5. Applying image recognition
- (in preparation) Ollama Part 6. Applying the MOE (Mixture of Experts) approach

Image recognition model: 13b

Among the models Ollama provides, the LLM model for image recognition (vision) is Llava. To put it in a slightly more concrete way, the basic capability you can offer through Llava is essentially an Image Annotator App. The model also comes in three sizes: llava:7b, llava:13b, and llava:34b. (Unlike the regular language models we usually use, in the case of Llava — which is an image model — ) the 7b version doesn't perform as well as you'd hope, so in many cases people end up using the 13b.

1. Work Flow
1) get the file list from a folder
2) load the file and convert to bytes
3) Send the file to Llava 1.6 via Ollama
4) save the results back to the DataFrame
5) Save the dataframe to csv

2. The code

from ollama import generate
import glob
import pandas as pd
from PIL import Image
import os
from io import BytesIO

def load_or_create_dataframe(filename):
    if os.path.isfile(filename):
        df = pd.read_csv(filename)
    else:
        df = pd.DataFrame(columns=['image_file', 'description'])
    return df

# CSV 파일에서 데이터프레임을 로드하거나 파일이 없는 경우 새 파일을 만듭니다.
df = load_or_create_dataframe('image_descriptions.csv')

def get_png_files(folder_path):
    return glob.glob(f"{folder_path}/*.png")

# 처리하려는 폴더의 이미지 파일 목록을 가져와서 오름차순으로 정렬합니다 
image_files = get_png_files("./images") 
image_files.sort()
print(df.head())

# 이미지 처리(by.PIL) + LLM로 설명 생성 + 데이터 프레임에 행(레코드 또는 row) 추가 프로세스  
def process_image(image_file):
    print(f"\\nProcessing {image_file}\\n")
    with Image.open(image_file) as img:
        with BytesIO() as buffer:
            img.save(buffer, format='PNG')
            image_bytes = buffer.getvalue()

    full_response = ''
    # 이미지에 대한 설명 생성하기
    for response in generate(model='llava:13b', 
                             prompt='describe this image and make sure to include anything notable about it (include text you see in the image):', 
                             images=[image_bytes], 
                             stream=True):
        # 콘솔에 응답을 인쇄하고 전체 응답에 추가합니다.
        print(response['response'], end='', flush=True)
        full_response += response['response']

    # 데이터프레임에 새 행을 추가합니다.
    df.loc[len(df)] = [image_file, full_response]

## (이미지 폴더 안에 있는) 이미지 파일 목록 중에 
for image_file in image_files:
    # '새로운 파일 이름'이 발견되면 
    if image_file not in df['image_file'].values:
        # 이미지 설명을 생성하는 'process_image('새로운 파일이름')'함수를 실행  
        process_image(image_file)

# 데이터프레임을 CSV 파일로 저장합니다.
df.to_csv('image_descriptions.csv', index=False)

*By the way, there are two well-known ways to read images: using OpenCV, and using PIL. This example uses PIL. (Side note 2: more recently, there's apparently also a way to read it as Binary (byte units) and process it that way, so it's worth checking out.)

3. The result

(1) : The case of running image recognition for the first time.
- Process
1) Create an empty csv file (a dataframe — an empty table (or columns) that will hold records (rows))
2) Take the 'image description' generated by Llava, pair it with the 'image name', and create a dataframe row. Then take that content.
3) Put it into the csv.
- Image description
1) (It varies a bit each time you run it, but generally...)
The description is split into two angles: the content (depiction) of the image and the data (type) aspect of the image.

(2) : When the same image is in the folder and you run it again repeatedly
- Process
1) It prints out the records previously saved in the csv.

+ : When you add a new image to the folder, it generates a description for that image and saves it as an additional record.

(For reference, here are some of the image description results.. quite something. Translated to Korean via DeepL...)

This image features an illustration of a character that appears to be inspired by the style of the popular internet meme "Pepe the Frog." The character is characterized by distinctive facial features, including large eyes and a big nose. The character is in a thoughtful pose, with one hand raised near the chin, suggesting contemplation or deep thought. There is no text in the image itself.

This image shows a character from the film series "The Hunger Games." The character is portrayed by Jennifer Lawrence, who played Katniss Everdeen in the films. With a serious expression, she is dressed in a dark tunic with leather-like detailing and an outfit with a quiver of arrows on her back. Her hair is styled in a way that resembles Katniss's character in the books and films. This image is likely promotional material for the film series, or an illustration related to "The Hunger Games" series.

This image is a colorful illustration depicting Goku, a character from the animated and comic series "Dragon Ball." He is depicted in his Super Saiyan form, characterized by his pointed golden hair. Goku is wearing his signature orange martial arts outfit, with black bracelets on each wrist and a blue belt around his waist. Goku is in an action pose, looking forward with both arms outstretched to either side. The background of the image is a plain white, focused on the character. There is no text in the image.

This image features a man sitting at a desk in what appears to be an urban environment. He is wearing a bright red shirt and black trousers, with his hands resting on the desk in front of him. His expression is focused, and he appears to be looking down or at something off-frame.
Judging by the lighting and composition, this setting appears to be from a film or television production, which often indicates the professional staging of such media. There is blurred lighting in the background, indicating an indoor space with artificial lighting. On the desk is a computer monitor with a red icon visible, but the contents of the screen cannot be seen.
On the desk are also other items commonly seen in office environments, such as a pen holder, paper, and electronic devices or cables. The overall mood of the image appears somber or contemplative, due to the man's expression and the surrounding lighting.

This English version was translated by Claude.

#llava:13b #mage Annotator #ollama llava #ollama vision #PIL #이미지 인식 #이미지 처리

Written by

친절한 찰쓰씨

Pleasant Charles — UI/UX researcher at AIT. Keeping notes on design, planning, and slow days here since 2010.

Keep reading

Renewal

LLM | Ollama Part 5: Applying Image Recognition