How can we tell good AI from bad?

Among the many steps along the road to high-performance AI, one of the most important was taken in 2007 by Fei-Fei Li, then an assistant professor in Princeton’s computer science department. Using Amazon’s Mechanical Turk service to amass many millions of small acts of human judgment, Li built a vast database of hand-labelled images.

“We settled on a goal of 1,000 different photographs of every single object category,” she writes in her autobiography The Worlds I See. “One thousand different photographs of violins. One thousand different photographs of German shepherds.”

The database, ImageNet, was released in 2009, and Li started a competition for researchers to build the best image-recognition algorithms. A few years later, a graduate student named Alex Krizhevsky, advised by AI pioneer Geoffrey Hinton, trained a neural network on ImageNet — and blew the competition away.

Neural networks had been languishing for decades: a clever idea, but the computers had been too slow and the datasets were too small. But Li’s dataset was different. It had seemed foolishly, grandiosely, uselessly large; it turned out to be the perfect input for a neural net. This was an indication of the power of data combined with the power of neural nets. It was also vindication for Li’s idea of using human judgment to apply millions of labels to a vast collection of images. The lesson: if you can measure it, you can automate it.

But image recognition neural nets proved brittle in unexpected ways. A 2015 paper, “Deep Neural Networks are Easily Fooled”, asked a state-of-the-art system to classify example after example of pure static. “Robin,” said the network, with more than 99.5 per cent certainty, as it looked at random noise. “Armadillo.” “Peacock.” The problem was that the network had only ever seen meaningful images, and confidently identified meaning where there was none.

This is an example of the “jagged frontier” of AI capability, a term referring to the fact that AI models can be stunningly good at one task and then gravely disappointing at another, as with neural nets confronted with static.

That jagged capability is not a problem in itself. “All technologies are good at some things and bad at others,” says Joshua Gans, economist and co-author of Prediction Machines. It’s best to use can openers to open soup cans and hammers to drive nails into walls, and not the other way round. But, adds Gans, “the difficulty is that with AI, we don’t know which is which”.

This raises the question: how do we know that the AI is doing a good job? It was easy to see the problem when a neural net was labelling static an armadillo. But how impressive is the response to that request to create an image of Joan of Arc in the style of Edward Hopper? Did that agent actually make a restaurant reservation, or did it reserve nothing except a space in my calendar? Are the business plan and pitch deck I requested persuasive, or full of holes, or — perhaps the worst case — persuasive and full of holes?

The most problematic cases are the ones where it is hard to know whether the AI has done a good job, and expensive if it turns out that it has not. If AI writes buggy code or clumsy prose, that can be spotted and fixed. If the code contains hidden security vulnerabilities, the prose is packed with fabricated facts or plagiarised phrases, or the structural engineering calculations seem fine but the building will collapse in the first storm, that is a problem. It is still a problem even if the mistakes are rare and the average quality excellent. These difficulties only become more acute as AI becomes more capable, because more challenging tasks are often more difficult to evaluate.

Two new working papers address the tricky issue of verifying quality. In “Some Simple Economics of AGI”, Christian Catalini, Xiang Hui and Jane Wu (assisted, sometimes gratingly, by generative AI) propose the inevitable 2×2 matrix in which economic activity can be easy to automate, easy to verify, both, or neither. Automatable, verifiable output is the stuff that computers do for us. The non-automatable stuff remains reassuringly artisanal.

The difficult quadrant is where tasks seem easy to complete but are hard to check. Catalini, Hui and Wu call this the “runaway risk zone”. It is not a reassuring label and it is not meant to be. The problem of verifying quality is not a new one: think about building contractors, second-hand cars or a restaurant in a tourist hotspot. In such contexts, low quality often takes over the market like knotweed, because the best providers struggle to prove that they are the best.

Solutions include reviews, word of mouth, or long-trusted brands. (Not for nothing do familiar brands such as Durex and Trojan dominate the market for condoms. Nobody wants an unpredictable condom.) In big projects with high stakes, it can help to have the option to sue some counterparty with deep pockets. But none of these solutions is ideal, and the danger is that AI produces such vast vats of plausible slop that they outpace our capacity to check. Create enough hallucinated legal arguments, flawed engineering calculations and backdoor-ridden code, and the slop vats fill faster than our capacity to tell good work from bad.

In the second paper, “A Model of Artificial Jagged Intelligence”, Joshua Gans offers an analogy in which asking AI to perform a task is like trying to cross a river over a network of planks supported by occasional pylons. The jagged frontier is represented by the fact that some planks are long and wobbly, while others are short and sturdy. Problem one: even if the planks are typically sturdy, the wobbly planks will require most of your time and attention. Problem two: if you can’t predict in advance which planks will let you down, you may quite sensibly prefer to eschew the AI entirely and row yourself across the old-fashioned way.

As Gans rightly points out, Silicon Valley’s AI firms have mostly been trying to raise the average performance of AI systems — that is, to make all the planks sturdier. It might be better, instead, to focus on stiffening the wobbliest ones. But that assumes you know which they are, which points to a third approach: improve the predictability of the system. If you know in advance where the wobbly planks are, they’re not nearly as dangerous.

If.

Written for and first published in the Financial Times on 18 March 2026.

I’m running the London Marathon in April in support of a very good cause. If you felt able to contribute something, I’d be extremely grateful.

What's Hot

Millicom’s Next Growth Is Built On Colombian Simplicity And Chilean Optionality (NASDAQ:TIGO)

Optimize Required Expenses & Giving

Allbirds: Laying Tracks Ahead Of The Train (NASDAQ:BIRD)

MiB: Philippe Bouchaud, Founder/Chief Scientist, Capital Fund Management

Thank You, San Francisco!

This Tech Announcement From Bigme Was so Bad, the Company Apologized

Tech Troubleshooting in Space – Econlib

The Digital Noose Tightens | Economic Prism

Cautionary Tales – Run, Switzer, Run: The Women Who Broke the Marathon Taboo (Classic)

A Golden Era For Income Investing – 2 High Yield Picks I Like

Small Cap Polaryx Picks Research Partner, Stock Surges on Phase 2 Trial Plans – Polaryx Therapeutics (NASDAQ:PLYX)

Top 10 AI Stocks for 2026: Seeking Alpha Quant Picks & Analysis

FM Nirmala Sitharaman picks Kanjeevaram silk as she steps out to present Union Budget 2026

Millicom’s Next Growth Is Built On Colombian Simplicity And Chilean Optionality (NASDAQ:TIGO)

Optimize Required Expenses & Giving

Allbirds: Laying Tracks Ahead Of The Train (NASDAQ:BIRD)

Undercovered Dozen: Western Midstream, Applied Digital, The Trade Desk, And More

Popular Categories

Trending News

Useful Links

Subscribe to Updates