Can we have AI Model Risk Evaluation without Uncertainty?
AI model evaluations, such as those conducted by OpenAI in its GPT system cards, aim to quantify model risks but often fail to account for uncertainty.
AI model developers release system cards to evaluate their AI models for risks. For example, here is OpenAI’s system card for its latest “strawberry” model, also known as GPT o1-preview and o1-mini. If someone had no prior knowledge of AI model risk evaluation, they might reasonably expect the system card to cover one or more of the following types of risks:
Model risk as uncertainty. Uncertainty is “hard to measure” risk, as Nate Silver notes.
Model risk as technological capabilities. This also relates to risks stemming from AI as a “dual-use” (military) technology, along with “bad actor” risks.
Model risk as systemic connections and dependencies, including from or to the ecosystems built on top of it, connected networks, and infrastructural reliances.
Model risk as concentrated power. This is a form of ownership risk, stemming from concentrated corporate control of socially vital technologies. (We wouldn’t expect the corporate owners themselves to evaluate this!)
Model system cards are largely evaluations of item 2, “model risk as technological capabilities”. This tends to focus on highly uncertain future risks related to the AI model becoming autonomous (“its alive”), rather than on practical evaluations of the model’s performance in typical, often commercial, environments.
“Model risk as uncertainty” (item 1 above) is largely absent from AI model evaluation results, including OpenAI’s latest o1 model system card. How can we tell? Well, risk assessments in the o1 system card boil down to single figures most of the time, without any indication of how certain these estimates are, e.g. model accuracy is “0.38”, model hallucination rate is “0.61”. As Andrew Gelman puts it when discussing statistical research: “a lot of effort gets put into avoiding or denying uncertainty”.
This omission by OpenAI is especially notable because its system card tries to evaluate highly uncertain future risks, such as potential model autonomy, but based on the model’s current capabilities. Moreover, it tries to measure specific model behaviors, such as deception, that might only emerge in certain contexts or at certain frequencies.
LLM behavior is also inherently uncertain. The model’s responses are highly sensitive to factors like the query, hyperparameters, and context, all of which introduce variability in a model’s outputs. Using another LLM as the evaluator introduces further uncertainty, just as human evaluation introduces its own uncertainties (e.g., which humans, which topics, which contexts and what sample size?).
As an interesting aside, LLMs seem to be computationally deterministic in their outputs (even if practical stuff complicates this): Given the same input and conditions, the model should generate the same probabilities for the next token. The variability we see in outputs stems largely from the sampling methods applied on top of these probabilities, such as top-k sampling or temperature sampling. These techniques introduces randomness, producing different outputs for the same input.
But even without this sampling layer, uncertainty should persist in LLM evaluation results because it’s impractical to test all possible model input-output combinations. The space of potential queries is vast, and testing can only cover a small sample of interactions. The limited sampling of the model’s potential predictions, whether by humans or automated methods, inevitably introduces uncertainty into the model evaluation process.
Calculating and showing model uncertainty usually comes by providing an interval – a likely range of estimates – such as a confidence or credible interval (sometimes generated using bootstrapping), rather than just a single number. Another approach is to assess the model’s performance out of sample, using entirely new data not seen during training. This method relates to model calibration, which tries to ensure that the model’s predicted probabilities align with actual outcomes.
So, why the omission by OpenAI of uncertainty from most of its model evaluations? Maybe computer scientists aren’t always familiar with common statistical practice, something Rumman Chowdhury confirmed to Ilan based on her past experiences (she also sent this really useful applied discussion.) Ilan also checked the leading AI textbook by Stuart Russell and Peter Norvig (4th edition). There is an entire chapter on “Quantifying Uncertainty”, but devoted largely to uncertainty facing AI in the external environment.
Given how large LLMs are, is there a way to introduce measures of uncertainty into their evaluations? Somewhat mysteriously, Andrew Gelman generally recommends: “studying the process, not just the particular dataset,” including through regularization techniques (like partial pooling). By itself though, this isin’t enough. So I e-mailed Andrew to ask why he thinks it is that computer scientists, and LLM risk evaluations in particular, do not report on uncertainty levels in their model evaluations? I look forward to reading the response on his blog (which he said is forthcoming - we will link to it when its out.)
Lastly, one interesting possible approach to quantifying LLM risk was given to us by Michał Oleszak, and comes from finance’s Value at Risk (VaR) model, which tries to assess the left tail, or loss, of a financial portfolio over time, for a specific confidence level. Despite a VaR model’s much debated shortcomings, perhaps similar approaches can be designed to assess the potential for LLMs to produce harmful content or behaviours, within a certain time frame, based on historical prompt data? Seems worth exploring further.
Either way, to take AI model risk evaluations seriously, model developers need to introduce reliable methods to quantify model uncertainty. Ignoring how uncertain we are about a model’s range of potential outcomes doesn’t make those outcomes any less likely to occur — it just makes us worry less about them.