GPT-5 'Predicted' an Immunology Experiment. The Word 'Predicted' Is Doing a Lot of Work.

OpenAI's most compelling science story this year is sourced exclusively from OpenAI.

Derya Unutmaz, an immunologist at The Jackson Laboratory for Genomic Medicine, used GPT-5 Pro to simulate a month-long experiment: engineering human CD8+ memory T cells to carry an anti-CD19 chimeric antigen receptor with a doxycycline-inducible activation mechanism targeting lymphoma. GPT-5 predicted the outcome — a boost in CD8+ cells' ability to kill lymphoma cells. Unutmaz says the prediction was "astonishingly accurate." OpenAI published this as the lead case in its "science acceleration" white paper. Every outlet covering it described this as AI predicting unpublished science.

Unutmaz is a real scientist with a real lab and real rigor. He runs parallel human-only analysis streams alongside his AI-assisted work — a genuine control, not a performance. He's not an AI hype merchant. The experiment is described with specificity that would survive basic scientific scrutiny. These are credibility signals that matter, and they distinguish this story from most "AI helps science" PR.

They don't settle the contamination question. Nobody is asking the contamination question.

What contamination means here

The claim is that GPT-5 predicted a result "before publication" — implying the result couldn't have been in the model's training data. That inference requires one additional step: establishing that the result wasn't derivable from published literature that was in the training data.

Anti-CD19 CAR-T therapy targeting B-cell lymphoma is one of the most extensively published treatment areas in oncology. Kymriah and Yescarta — commercial anti-CD19 CAR-T products — are cited in thousands of papers. CD8+ T cell activation dynamics, Tet-inducible expression systems, and doxycycline-mediated CAR engagement have individual bodies of literature running to hundreds of papers each. The intersection of these approaches in lymphoma models has been studied across multiple research groups for over a decade.

The question is not whether GPT-5 "saw" Unutmaz's specific unpublished result. It's whether the mechanistic knowledge required to predict his result was available in the published literature that GPT-5 was trained on. If a senior immunologist with deep CAR-T expertise and PubMed access could predict the same outcome from existing literature — without seeing GPT-5's response — then GPT-5's "prediction" is sophisticated retrieval and synthesis, not generalization beyond the training distribution.

That is a different capability than the one being described. It is still genuinely useful. It is not "AI reasons about biology from first principles."

OpenAI's white paper does not include a methodology for ruling out literature derivability. The blog post does not address it. No coverage has raised it. The verification chain is: Unutmaz said it was surprising → OpenAI published his account → coverage accepted the framing.

What independent verification would look like

The experiment is sufficiently described that a test is possible. Take three senior immunologists with deep CAR-T expertise, give them the experimental design Unutmaz described, and ask them to predict the outcome from their knowledge of the published literature alone. No GPT-5, no internet — just domain knowledge and whatever is in their head from years of reading in this area.

If they can't predict it reliably, GPT-5's result becomes substantially more interesting. The model would be demonstrating something beyond literature retrieval — mechanistic reasoning that outperforms expert intuition in a specialized domain.

If they can predict it at roughly the rates a well-informed expert would, the story changes: GPT-5 is matching expert intuition at expert quality for a narrower cost and faster turnaround. That is still commercially valuable — it means researchers with less CAR-T specialization can access expert-level predictions without an expert collaborator. But it's a different claim than "AI predicts unpublished science."

Neither outcome has been tested. Until one is, the coverage is epistemically ahead of the evidence.

Where this sits in the bigger picture

Last week I covered NatureBench — a benchmark showing that frontier AI coding agents beat published SOTA on 17.8% of real scientific tasks. I argued that the Unutmaz case represents the best 17.8% conditions: expert supervision, focused domain, curated context, guided workflow. Both things can be true simultaneously: AI performs well in expert-supervised, narrow-domain settings, and struggles across diverse autonomous research tasks.

The UCL study (November 2024) found that LLMs outperform human experts at predicting neuroscience experimental results in a systematic evaluation. That's a peer-reviewed finding with a blind protocol and a meaningful sample. It gives prior probability to Unutmaz's case being real. It doesn't establish that this specific case represents the same phenomenon.

AlphaFold's track record is the comparison that puts the Unutmaz case in context. DeepMind has published peer-reviewed papers on AlphaFold's predictions, submitted them to independent experimental validation, and built a systematic track record of protein structure predictions that held up. The verification chain runs from prediction → blind experimental test → publication. The Unutmaz case, as currently documented, runs from prediction → researcher says it was accurate → OpenAI says so too.

What would make this a landmark

Unutmaz running a peer-reviewed publication protocol — submitting the prediction before additional experiments, having an independent group verify the blind comparison, and publishing the methodology alongside the result — would change the epistemic status of this story completely. If that paper lands, it's the most significant single-lab demonstration of LLM scientific reasoning published so far.

If it doesn't reach peer review, the anecdote will remain compelling and unresolved. OpenAI selected it as their strongest case; if the strongest case is an N=1 with no blind protocol and no independent verification, the "AI accelerates science" thesis rests on softer ground than the coverage implies.

Unutmaz said the experience might cut discovery timelines "from years to weeks." I think that's plausible — and not just because GPT-5 predicted his result correctly, whatever the mechanism. The ability to run in-silico simulations, generate hypotheses quickly, and reduce the set of experiments worth running has genuine value even if the underlying capability is literature synthesis rather than biological reasoning. The economic case for AI in drug discovery doesn't require first-principles generalization. It requires accuracy high enough to reduce failed experiments.

But the epistemic case — "AI predicts unpublished science in a way that couldn't have been derived from existing literature" — requires the contamination question to be answered. It hasn't been.