Back to feed
Slow Days·말로만 듣던 마흔

Image Classification & Captioning Effect

NS
normalstory
cover image

Service Business Modeling 

 - B2B, B2G technical sales (refined DATA, solutions, APIs, etc.) 

 - B2BC user-participation reward platform 

 - Hyper-personalized (non-ad) value-added services

 

Service Scalability

 - Apply OCR to improve performance and expand the range of use

 - Read the text-converted interpretation out loud by voice. 

 - Record comments on the photo by voice. The recorded content is fed back into model training and captioning. 

 - Expand the scope to sets of photos (videos).  

 - Accumulated data will eventually contribute to converting text into images that carry varied interpretations and intentions. 

 

Service Core Value 

 - The core is turning unstructured data into text (strings). 

 - Meaning-based digitalization 

 


 

The image classification & captioning effect starts from this: once data is extracted as strings, it becomes classifiable and searchable.

That is, material can be converted into data. 

This is similar in context to the effect you get from converting voice into text.

 

Phenomenon or situation -> recognition / recording / saving -> unstructured material -> becomes structured data.

And the moment that data meets the person who needs it -> it gets turned into information. 

 

Records people transcribed by hand, facts shot with a smartphone (camera), situations filmed with a smartphone (camera) — all of these become searchable as data.

The key point is not simple digitalization, but being able to find it when you need it. 

The emphasis is that it's not merely an array of 0s and 1s — meaning can be assigned to each array.

 

To sum up so far: through this, a touchpoint has been created between image/video information and people.

Add point-in-time and location, and it truly becomes personal optimization. 

 

As you know, photos already carry location information. 

You just need the right timing... since photos already contain not only location and 'time' but also people (social networking) and surrounding context.

 

What's even more striking is that the effect named in the title is only just getting started.

First, images (=videos) are liked and disliked differently by different people. And they can be read differently depending on the situation. In other words, they carry a very distinct taste-based, yet unclear, interpretability. Objective interpretation is impossible, but they become a very precise indicator of personal taste.

Second, images (=videos) are different from sentences. That is, turning a sentence into an image and turning an image into a sentence carry different nuances. The number of possible conversions from each side to the other may both approach infinity, but the number of ways each piece of information lands on the reader (viewer) is the opposite. I think this can be summed up by the saying "a picture is worth a thousand words."

These two effects spread divergence and convergence along the x, y, z... axes. 

 

 

 


The solution can also be used for model training. Perhaps the most natural case would be data collected through children's picture books. On top of images (classification, interpretation), the parent's voice and the child's feedback get recorded. Many different parents' voices and many different children's feedback converge on the same image. 

 

 

Oh, wow.

So AI (?) can dream too..!

This English version was translated by Claude.

친절한 찰쓰씨
Written by
친절한 찰쓰씨

Pleasant Charles — UI/UX researcher at AIT. Keeping notes on design, planning, and slow days here since 2010.

More on the author's page

Keep reading

Slow Days

With a single thought, a whole world arises.

Apr 30, 2026·1 min
Slow Days

부유함은 상태가 아니라 감정이다

Feb 16, 2026·1 min
Slow Days

Reading a book next to my coding

Jan 18, 2026·1 min