Andrew Gelman on Uncertainty in AI Evaluations
Very serious organizations doing some unserious things
AI is full of “hard to measure” risk, i.e. uncertainty. So back in October, we published a post entitled: “Risk without Uncertainty? OpenAI would like us to think so,” which examined how AI model evaluations often fail to account for statistical uncertainty despite assessing hard to measure capabilities and catastrophic risks. I asked Andrew Gelman, whose lively Bayesian data analysis course I took during my economics Ph.D, what he thought of this weird omission. He said he would publish his thoughts on his blog in the coming months when a slot opened up, and that time has finally come.
The Problem of Neglected Uncertainty
In a nutshell, our argument was this: given how uncertain generative AI model outputs (and associated capabilities) are, and given how uncertain the existential risks being tested for are, the statistical tests conducted by AI companies are inadequate because they ignore uncertainty. Their evaluations of LLMs typically provide single-point estimates without any associated interval estimating uncertainty. This omission is particularly alarming when they try to evaluate extreme, low-probability risks like “The AI will gain autonomy”.
Accounting for uncertainty is statistics’ centerpiece: we work with samples rather than complete populations, and understanding complex, changing phenomena is inherently imprecise.
So what did Andrew Gelman have to say (emphasis added)?
I don’t have any answer of my own to the question, “Why don’t machine learning and large language model evaluations report uncertainty?”, for the simple reason that I don’t know enough about machine learning and large language model evaluations in the first place. I imagine they do report uncertainty in some settings. My general recommendation to people running machine learning models is to replicate using different starting points. This won’t capture all uncertainty, not even all statistical uncertainty — you can see this by considering a simple example such as linear least squares, which converges to the same solution no matter where you start it, so in that case my try-different-starting-values trick won’t get you anywhere at all—; rather, I think of this as a way to get an approximate lower bound on Monte Carlo uncertainty of your output. To get more than that, I think you need some explicit or implicit modeling of the data collection process. An explicit model goes into a likelihood function which goes into a Bayesian analysis which produces uncertainty. An implicit model could be instantiated by repeating the computation using different datasets as obtained by cross validation or bootstrapping.
Other Perspectives
As always, the comments responding to Gelman on his blog are instructive. That’s where the discussion gets rich. Here are two choice comments. The first from Ted Sanders, who works at OpenAI:
I don’t think there’s any deep answer – it’s a mix of bad historical convention and laziness and trying to write for a worldwide audience. Internally we of course measure error bars on our LLM evaluation results. But as with all errors, there can be nuance and confusion over what they mean. E.g., if you want to put error bars on a GPQA diamond result, in which you have, say:
– have 3 topics
– have 150 questions
– have 10 samples per question
– have 1 model trained.
Then even a notion of “sample size” can be ambiguous, and you need to make sure any error bars you publish is appropriately communicating your model of clustered errors, etc.
Jessica Hullman, from Northwestern University, points in the comments to a useful paper she co-authored with Andrew and others, which frames the issue more broadly as a reproducibility problem in machine learning and AI, requiring disclosure of a model’s training runs, training data, hyperparameters, and more (see comment). The paper argues that: “many of the errors recently discussed in ML expose the cracks in long held beliefs that optimizing predictive accuracy using huge datasets absolves one from having to consider a true data generating process or formally represent uncertainty in performance claims”. The concluding section 5 of the paper is worth a read for those fluent in the language of machine learning (ML) and statistics.
Signs of Progress?
Gelman’s recommended suggestions of “Replicate using different starting points” and reporting bootstrapped intervals seem to be slowly permeating into LLM evaluations. Shortly after our substack post Anthropic published a research paper entitled “Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations”. In the blogpost they call for:
Repeated sampling
Reporting robust standard errors
Accounting for sensitivity to how queries are phrased
Ensuring statistical power through sufficient test questions
But in general, AI model system cards still lack uncertainty intervals or measures of robustness. This great 2023 paper from scholars at The Hebrew University of Jerusalem and University of Haifa is a bit old but illustrates very effectively how “brittle” LLM evaluation results are to variations in query phrasing. More recently, an October 2024 paper by researchers at Apple demonstrates the fragility of LLM benchmarks (not robust at all), and the fragility of LLM reasoning capabilities. Taken together, this sort of model uncertainty and evaluation uncertainty should be reflected in how we evaluate and report LLM testing. But it mostly isn’t.
Seriously?
Where does this leave us? As an economist rather than an LLM engineer, I'm reminded of Paul Krugman’s concept of “very serious people” (VSP), used “to capture the way respectable opinion keeps demanding utterly foolish policies” often “no matter the evidence”. It’s also just people who say one thing and do another. The very serious people in this context are the AI companies who keep telling us just how seriously we all need to be taking AI’s future existential risks to humanity, while simultaneously conducting somewhat unserious tests,1 reporting somewhat unserious results, and constantly taking extreme risks themselves...seriously?!
Enthused marketing aside by AI companies (“we really believe in product safety”), it shouldn’t be their obligation to come up with thorough testing standards and methods for their own products. Nor do they have a strong incentive to, given that testing standards and methods are closer in character to a public good. That’s what commonly agreed upon, and eventually compulsory, industry standards are for, on top of which third-party regulatory markets can arise.2
Others agree. Technological industry bodies and AI advocacy groups are not happy about the gutting of NIST, the U.S. government’s standards body, noting in a recent letter that: “downsizing NIST or eliminating these initiatives will have ramifications for the ability of the American AI industry to continue to lead globally”. But NIST guidelines need to incorporate corresponding industry best practices in order to evolve into something concrete and useful. That’s why disclosures are such a crucial bridge in the AI regulatory landscape – connecting broad guidelines with actual best practices. So, instead of focusing on very serious people, discussing very serious existential risks, while sometimes doing unserious — and even irresponsible — things, maybe it’s time we develop robust industry best practices.
Redteaming and third-party testing are pretty serious, even if largely unconstrained, and undirected in its problem space. And reporting is often patchy.
Unless they don’t share their closed model with others for testing and auditing.