Discovery6 min read

AI Content at Scale: When Good Enough Isn't

AI-generated product content is now standard for large catalogues. The tooling works. The quality variance is the problem nobody planned for, and on a 50,000-SKU catalogue, even a 1% error rate is 500 wrong product descriptions.

Sarah Chen

Senior Editor

—26 January 2026

I want to talk about the part that doesn't make it into the vendor case studies.

AI content generation for product catalogues works. The speed advantages are real. A retailer that previously spent three weeks getting product descriptions written, reviewed, and approved for a new range can now get a first draft in hours and spend the remaining time on quality control rather than generation. For large catalogues, this has been genuinely transformational for time-to-market and for teams that were previously drowning in backlog.

And then there's the other part.

Earlier generations of AI tools had meaningfully higher error rates when producing product specifications: wrong materials, invented certifications, dimensions that didn't match reality. Current frontier models, in 2026, have brought that down considerably, with informed estimates putting the figure at around 1 to 2 percent. Which sounds like an enormous improvement, and it is. But on a catalogue of 50,000 SKUs, 1 percent is still 500 product descriptions that contain something factually incorrect. Not obviously wrong (not "this television can fly" wrong), but quietly wrong in the specific detail that makes a product description actually useful. The wrong material composition. An invented certification. A size that doesn't exist. "Italian leather" on something that is not Italian and is not leather.

The model is optimising for "plausible product description", not "accurate product description." Those are not the same thing, and the distinction matters rather a lot when customers are making purchasing decisions based on what you've written.

The Blandification Effect

The specification problem is the one that gets you into trouble with trading standards and customer returns. But there's a subtler quality problem that's arguably more damaging for brand-led retailers: AI copy regresses to the mean.

The training data for these models is the entire corpus of product description writing that exists on the internet, most of which is neither particularly good nor particularly distinctive. The output reflects that. Quirky brands produce content that sounds like every other quirky brand. Technical brands lose the specific precision that makes their content credible to their actual audience. The distinctive voice that someone spent years developing gets sanded smooth because the model has learned that "smooth" is what product descriptions sound like.

This is not a problem you can solve by prompting alone, though prompt engineering helps at the margins. The deeper issue is that brand voice is accumulated institutional knowledge. Knowing which words the brand would never use. Knowing the rhythm of a sentence that fits the aesthetic. Knowing when to be funny and when not to be. That knowledge doesn't transfer cleanly to a context window.

Some brands have addressed this by building brand voice frameworks into their generation workflow: style guides, approved vocabulary lists, examples of copy the brand would never write, and specific guidance on how the brand handles technical specifications. This helps significantly with voice consistency. The brands that make it work have usually spent years producing good human-written copy and have enough of it to establish real patterns. The ones starting from scratch with no archive and no voice documentation are generating into a void.

The Editorial Gap

The thing that often gets cut when AI content generation goes into production at scale is the editorial layer. The logic is understandable: if you're generating content ten times faster, you should be able to review it ten times faster too, so the net headcount is the same. This is not how it works in practice.

Reviewing AI-generated content for factual accuracy requires domain expertise, not just editorial instinct. A human reviewer who isn't a subject matter expert can catch obvious errors and clunky phrasing, but won't necessarily catch the confidently stated wrong specification. You need someone who actually knows what the product is meant to do and what it's made of: a buyer, a product manager, or someone who's been briefed specifically. Not just someone with good grammar and a deadline.

The retailers who are getting this right have generally built their workflows around an exception-based model. AI generates; a rule-based system flags anything that deviates from expected ranges (materials that don't match the product type, dimensions that seem implausible, certifications that require verification); and human reviewers focus their time on the flagged items rather than reading everything. The unflagged items still get spot-checked, as the error rate is low enough that random sampling catches most issues, but the systematic review is targeted.

This is a more expensive workflow than "AI generates, light human review, publish." It's also considerably cheaper than the original all-human process, and it produces higher-quality output than "AI generates, auto-publish" — which some teams are running in the name of speed.

The Search and AI Visibility Dimension

There's a machine-readability angle to this that I wrote about last week. AI-generated content that contains errors and inconsistencies scores poorly on the machine readability metrics that determine whether AI discovery systems surface your products accurately. This creates a compounding problem: not only do you have wrong product descriptions visible to customers who land on your site, but those wrong descriptions also degrade your visibility in the AI-mediated discovery layer that's driving an increasing share of your traffic.

A product page that says "Italian leather" when the product is synthetic microfibre gets served to people looking for leather goods, who then return it, or worse, review it poorly. It also potentially gets served to people asking an AI assistant for Italian leather products. Which is a separate version of the same problem.

The quality problem radiates outward from the content itself.

What This Actually Requires

None of the solutions here are exotic. They're mostly just discipline:

Clean source data in the PIM before generation. If your raw product data has inconsistencies and gaps, the AI will fill those gaps with plausible-sounding invented content. The generation step cannot fix upstream data quality problems.

Verification workflows that are proportionate to product category risk. High-value products, technically complex products, and products with regulatory requirements (food, cosmetics, children's products) need more rigorous review. Sports accessories need less.

Periodic audit of published AI-generated content against actual product specifications. Not just before publishing but after, because product specifications change and generated content doesn't update itself.

Brand voice documentation that's specific enough to be useful in a prompt. Not "write in a warm, conversational tone" but examples of approved sentences, lists of words the brand would and wouldn't use, and specific guidance on how the brand handles technical specifications.

The AI content machine is running. For most retailers at scale, it needs to keep running; going back to fully manual content production isn't the answer. But "it's running" and "it's running well" are meaningfully different, and the gap between them is not going away on its own.

AI Content at Scale: When Good Enough Isn't

Sarah Chen

Senior Editor

—26 January 2026

I want to talk about the part that doesn't make it into the vendor case studies.

And then there's the other part.

The Blandification Effect

The Editorial Gap

The Search and AI Visibility Dimension

The quality problem radiates outward from the content itself.

What This Actually Requires

None of the solutions here are exotic. They're mostly just discipline: