Automation6 min read

GPT-4o: What the Omni Model Means for Commerce

OpenAI's new model processes text, audio, and images natively in a single pass. The voice demos got all the coverage. The more interesting story for retail is narrower and more actionable.

Sarah Chen

Senior Editor

—20 May 2024

OpenAI announced GPT-4o on 13 May 2024 and the demos were genuinely impressive. The live voice conversations felt different in kind from anything that had come before. Someone held up a maths problem on paper and the model read it and explained it in real time. Someone else showed it a piece of code and it debugged it verbally while the person asked follow-up questions.

The coverage was predictably dominated by the voice stuff, because voice is easy to demo and impressive to watch. But if you work in ecommerce and you watched those announcements, there is a different set of questions worth asking.

The "o" in GPT-4o stands for "omni". According to OpenAI's announcement, the model doesn't just also handle images. It was trained end-to-end across text, vision, and audio as a single neural network, processing all inputs and outputs through the same model. That is architecturally different from the previous approach, which ran three separate models in sequence: one to transcribe audio, GPT-4 to reason, and another to convert text back to speech. The pipeline lost tone, emotional cues, and anything that wasn't words. GPT-4o collapses that into one pass.

What this means for commerce is still emerging. But here is where the near-term case is actually credible.

The Visual Search Problem Has a Cleaner Path Now

Image-based product search ("I saw someone wearing this, what is it?") has been a promised feature of ecommerce for years. Google Lens and Pinterest Lens are the closest implementations we have had, and they work reasonably well for identifying category and style but struggle with specific product matching.

The reason is contextual richness. A good visual search result requires understanding not just "this is a bag" but something closer to: "this is a leather tote with brass fittings, styled casually, likely in a mid-to-premium price range." The gap between those two descriptions is where visual search has historically broken down.

Native multimodal processing doesn't close that gap on its own. OpenAI reports that GPT-4o sets new benchmarks on vision understanding, and the architectural change (understanding image and text in a single pass rather than sequentially) does produce richer contextual interpretation in testing. But building a good visual search experience requires considerably more than a good model. You need a structured product catalogue with high-quality, contextually rich imagery, and most large catalogues are not in good shape on either count. The model being better does not fix that.

The Customer Service Image Case Is More Immediately Actionable

The demo that stuck with me most from the Spring Update coverage wasn't the voice conversations. It was something that didn't make most headlines: the customer service use case.

A customer takes a photo of a damaged product and sends it during a returns or complaints chat. Previously, an agent (human or AI) had to work from a text description of what went wrong. "Faulty" or "not as expected" covers an enormous range of actual situations: a seam split, a dye fault, wear and tear that shouldn't have happened yet, damage that was clearly prior to receipt. A text description of any of these is often vague and loses information.

Mid-market UK apparel retailers handle significant returns volumes where the stated reason is vague. Image-based intake, where the model can assess what is actually wrong from a photograph and route or resolve accordingly, is a concrete operational improvement. Not revolutionary, but the kind of thing that would have a measurable effect on handling time and routing accuracy.

Whether that requires GPT-4o specifically, or whether a well-tuned earlier model would do the same job adequately, is a fair question. The point is that the capability is now available at a cost and speed that makes it worth evaluating.

Product Photography Has a New Reader

This is the angle I see discussed least, but it matters for how retailers think about catalogue investment.

The standard advice for product photography (clean backgrounds, multiple angles, consistent lighting) was optimised for human eyes and for image classification systems that were essentially asking "what category is this thing in?" The implicit goal was: show the product clearly so the customer can recognise it.

A model that understands context, texture, scale relationships, and use-case cues from an image is reading your product photography in a substantially richer way. The question shifts from "can I see what this is?" to "can I understand what this is for, and who it is for?"

A product shot that shows a jumper flat on a white background tells a different story from the same jumper worn in a loosely styled way that signals it is an oversized, casual-fit piece. A model like GPT-4o will understand the difference. Whether your catalogue's imagery communicates what you want it to is worth checking, because increasingly that imagery is being read by models, not just people.

The Speed and Cost Change the Economics

OpenAI's announcement stated that GPT-4o is 2x faster and half the price of GPT-4 Turbo in the API. That sounds like an incremental performance metric. In ecommerce, where the volume of interactions (product queries, content generation, search queries, customer service chats) is large, those numbers change the economics of what is viable to automate.

A process that was borderline cost-effective at GPT-4 Turbo pricing may now genuinely pencil out. Several things that were "we'd need to see costs come down" conversations in early 2024 may now be worth revisiting.

What Is Worth Watching

The practical near-term opportunities are narrower than the demos suggest, but they are real.

Image-based returns triage is the most immediately actionable for mid-market retailers. The operational improvement is concrete and the model capability is now available at a workable cost point.

Richer product description generation is a lower-risk first step for most teams: using the model to generate descriptions that are more contextually specific and use-case aware. What good AI-era product content actually looks like was becoming clearer through 2024 and into 2025.

Visual search for consumers is the most interesting long-term direction but requires catalogue infrastructure most retailers are not ready for. The model being better doesn't change that.

The underlying shift is real: a model can now process images and text as a genuinely integrated thing rather than separate tasks bolted together. The demos were impressive for good reason. The question, as always, is what you can actually build with it today versus what looks good in a controlled environment.

That gap is still meaningful. It is just smaller than it was.