OpenAI Says GPT-5.5 Cuts Hallucinations 52%. Independent Testers Measured an 86% Rate. Both Are Correct.

OpenAI reported that GPT-5.5 produces 52.5% fewer hallucinated claims than its predecessor. Artificial Analysis independently tested GPT-5.5 and measured an 86% hallucination rate. Every outlet covered the first number. Not one covered both numbers and explained why they're compatible.

They are compatible. That's the problem.

What OpenAI Actually Measured

OpenAI's 52.5% claim comes from its internal evaluation on ChatGPT conversation logs that users had flagged for errors — specifically in medical, legal, and financial contexts — comparing GPT-5.5 Instant to GPT-5.3 Instant. It is a relative improvement figure on a narrow proprietary dataset measuring user-identified errors in high-stakes prompts.

This is a real thing. It is not a hallucination rate.

Artificial Analysis's AA-Omniscience benchmark measures something different: whether a model makes up false facts when answering questions about real-world entities it should know but receives no context for. On that test, GPT-5.5 hallucinated 86% of the time. Gemini 3.1 Pro came in around 50%. Claude Opus 4.7 came in at 36%.

This is also a real thing. It is also a hallucination rate.

The two measurements are testing genuinely different failure modes. OpenAI is measuring "did GPT-5.5 make fewer errors in conversations where users were complaining about errors." Artificial Analysis is measuring "does GPT-5.5 invent facts when it doesn't know the answer." A model can improve dramatically on one without moving at all on the other — or can improve on one by being trained to be more cautious in certain contexts while becoming more confident (and wrong) on ungrounded factual queries.

Both can be simultaneously true. OpenAI's statement is technically accurate. The coverage was incomplete.

The Terminal-Bench Win Is Real

Let me be clear about what GPT-5.5 actually accomplished, because the hallucination story should not obscure it.

Terminal-Bench 2.0 is governed by Stanford and the Laude Institute — an academic benchmark with no commercial stake in the result. Multiple frontier labs, including OpenAI, Anthropic, and Google, provided input during development, which is imperfect but meaningfully better governance than benchmarks one lab designed and funded. GPT-5.5 scored 82.7%. Claude Opus 4.7 scored 69.4%. The 13.3-point gap is the largest single-benchmark gap between today's frontier coding models.

Terminal-Bench evaluates multi-step agentic coding tasks in a real bash environment — not synthetic problems but actual tool-use chains. An agent that runs 13 points better on real command-line workflows is a practically meaningful improvement.

The competitive picture, though, is less clean than the headline suggests. Claude Opus 4.7 leads GPT-5.5 on SWE-Bench Pro — 64.3% vs. 58.6% — which tests real GitHub issues and is the benchmark most predictive of what enterprise engineering teams actually care about. Opus 4.7 also leads on MCP Atlas (79.1% vs. 75.3%) and Humanity's Last Exam (46.9%; GPT-5.5's score on this was not reported in OpenAI's materials). The honest headline is not "GPT-5.5 leads" but "GPT-5.5 takes Terminal-Bench; Opus 4.7 keeps SWE-Bench Pro."

Both models are good. They win on different evaluations. Coverage chose a narrative.

The Free Tier Is the More Strategically Significant Move

The Terminal-Bench lead is real. The more durable competitive action is that GPT-5.5 Instant became the default model for all free ChatGPT users on May 5 — roughly 400 million people who now run a frontier-class model at zero cost.

This raises the bar for every competing free tier. Anthropic's free Claude, Google's free Gemini, any other consumer AI product now competes with a frontier model that costs the user nothing. The competitive moat from free distribution compounds: more users → more usage data → faster iteration. OpenAI has been losing consumer market share to Gemini for months, but giving away frontier-class performance for free is the kind of move that stabilizes consumer retention regardless of benchmark comparisons.

GPT-5.5 Instant at $5/$30 per million tokens in the API is priced at parity with Anthropic on input, slightly more expensive on output ($30 vs. Opus 4.7's $25). Developers will choose based on which benchmark they trust more — Terminal-Bench or SWE-Bench Pro. That is a reasonable disagreement.

The Number Nobody Interrogated

GPT-5.5's reported ARC-AGI-2 score is 85.0%. ARC-AGI-2 is designed by François Chollet specifically to resist optimization — it tests novel pattern recognition that is supposed to be hard to train against. A score of 85.0% would be extraordinary: the kind of result that the AI community would have considered impossible 24 months ago.

It appeared in secondary coverage as a table entry. Zero outlets investigated whether it's independently verified. The brief I worked from noted it is unverified — no independent replication found.

I don't know if the ARC-AGI-2 claim is accurate. I do know that a claim this significant on a benchmark specifically designed to be hard to game deserves more scrutiny than a paragraph in a launch roundup.

The Broader Problem

The hallucination measurement gap is not specific to GPT-5.5. Every frontier lab uses different evaluation methodology to report reliability improvements. OpenAI measures flagged user conversations. Artificial Analysis measures ungrounded factual recall. Anthropic has its own internal reliability suite. Google measures factual consistency against verified corpora.

None of these is wrong. All of them measure something real. None of them is the same thing.

As reliability becomes the primary marketing axis for frontier AI — a trend that accelerates as models converge on capability benchmarks — this fragmentation will get worse. The next 12 months will see every major lab claiming to have "fewer hallucinations" than its competitors, using measurements that are technically accurate and mutually incomparable.

The field needs a standardized hallucination benchmark with independent governance, the same way Terminal-Bench 2.0 improved on SWE-Bench by introducing academic oversight. Until that exists, "fewer hallucinations" means whatever the lab that measured it decided it means.

GPT-5.5 is a good model. The 52.5% hallucination improvement is real in the narrow sense OpenAI defined. The 86% hallucination rate is also real in the broader sense AA-Omniscience defined. Readers deserve to know that both sentences are true at the same time.