Is the Internet's Business Model Finally Bust? (Weekly Roundup)

Weekly Roundup 7 May, 2025: Matthew Prince makes a splash, OpenAI juices its benchmarks, 4o sycophant shows the need for post-deployment monitoring of behavioral risks, and more.

and

May 07, 2025

Good morning! This week’s roundup covers: Matthew Prince on the internet’s broken business model, Google’s latest model card (4.5 months later), OpenAI juicing its benchmarks, Chatbot Arena scandals, and ChatGPT going full sycophant without OpenAI monitoring for it.

Is the internet’s business model officially bust? Cloudflare CEO Matthew Prince slipped a few statistics into a recent discussion hosted by the Council on Foreign Relations that shows just how dramatically AI is breaking the fundamental social contract underpinning the internet. Cloudflare helps websites run faster and stay safe by acting as a middleman between visitors and website providers, Prince notes. Since an immense amount of network traffic is routed through their servers, they have unique insight into online activity. Prince begins by noting that (start at 36:30):

“The business model of the web for the last fifteen years has been search. One way or another, search drives everything that happens online. And if you look back 10 years ago, if you did a search on Google you got back a list of ten blue links. And we have data on how Google processed those ten blue links. And the answer was that for every two pages of a website that Google scraped they would send you one visitor, right? So scrape two pages, get one visitor. And that was the trade….
So it was two to one [pages scraped to visitors] 10 years ago for Google. It’s six to one today. What do you think it is for OpenAI? 250 to one. What do you think it is for Anthropic? 6,000 to one … The business model of the web can’t survive unless there’s some change, because more and more the answers to the questions that you ask won’t lead you to the original source, it will be some derivative of that source. And if content creators can’t derive value from what they’re doing, then they’re not going to create original content.”

It’s in everyone’s interest to start contending with the new economic reality of the internet under attack from AI services. These statistics highlight just how quickly it’s changing. Without new internet protocols to facilitate attribution and payment to content creators online, and without AI companies investing in these new business models, AI content online might explode, but human content might die.

Google gives us a model card (finally): disappearing disclosures, but not disappearing risks. Google finally updated its model card, releasing one for its Gemini 2.5 Pro Preview model, ending a 4.5-month disclosure drought since Gemini 1.5’s technical report (version 5). The 17-page document restores some transparency with detailed pre-deployment testing and benchmarking, but quietly drops the "persuasion & deception" risk category previously highlighted in DeepMind's own safety work. Instead, it substitutes it for "deceptive-alignment" — a risk conveniently assessed pre-deployment. Meanwhile, AI products like OpenAI’s 4o continue failing due to inadequate attention to post-deployment issues - exactly what our latest research paper addresses. Google’s new model card mentions risks like misinformation but omits details on existing guardrails. Safety reporting is now compressed into the Frontier Safety Framework's four domains, with Google apparently reserving dangerous-capability data for a "separate audit" — a convenient crutch. What’s clear is Google’s public commitments towards transparency and disclosure have not been translated into practice.

o3 isn’t that good after all — or benchmarks aren’t that good after all? Speaking of OpenAI scandals, this one’s a little late but we still think is worth highlighting. If you think way way back to December last year (about 5 million years ago in AI time), you might recall OpenAI’s newest model, o3, scoring a whopping 75%-88% on the notorious ARC-AGI test, a challenging benchmark designed to measure whether a model can learn new skills outside of its training data, which its creator says is the measure of true intelligence. OpenAI publicly released the o3 family of models a few months ago, and since then, independent third parties have not been able to replicate the reported benchmark scores. The best that the team who administers the ARC-AGI test could get using OpenAI’s model was 56%, around 20 percentage points lower than the reported score. EpochAI also found that o3 scored only 11% on FrontierMath, a challenging math benchmark, while OpenAI had reported it as 32%.
A generous interpretation of this discrepancy is that the science of evaluation is still young and unreliable. I think it would be good for everyone — AI hypesters, doomers, and pessimists alike — to put less faith in a benchmark score. But it is exceedingly fishy that OpenAI seems to be juicing up their performance by as much as 20 percentage points on them. All of this only further underscores the need for standardized, independent evaluations, that include rules for compute usage during testing, and importantly the scaffolding that the model can use — the external systems or code structures that enhance an LLM's capabilities without modifying its internal architecture or weights.
Chatbot Arena might not be so objective. If we want to keep talking about unreliable benchmarks and corporate meddling, look no further than the research paper released last week that calls into question the rankings on Chatbot Arena. Despite starting as essentially a side project, Chatbot Arena (also referred to as LMSYS, or LM Arena) is a respected resource that ranks LLMs based on blind user voting. Anyone can log in, put in a prompt, and then choose which model’s result they prefer. There has already been cause for concern. After Meta’s latest model shot to the top of the charts, it then fell after it was found that they provided a specialized model to Chatbot Arena for testing and instead had to switch it for the same vanilla release.

This new research further calls into question the objectivity of the rankings, finding: “undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired” and “overfitting to Arena-specific dynamics rather than general model quality.” While we think continuous, third-party testing from real users is important, rigorous oversight is necessary. If these last two stories mean anything at all, we have a long way to go in creating apples to apples evaluations for LLMs.
Share
4o gets over-optimized for flattery. In case you missed the thousands of unhinged ChatGPT screenshots circulating the internet last weekend, OpenAI released a new, bizarrely sycophantic model update to 4o — and then promptly rolled it back after users started complaining.

After rolling back, OpenAI released a full postmortem on the event — which was admirably transparent, although predictably light on the commercial incentives in releasing a more agreeable model. Of course going too far in the sycophantic direction just turns people off, but there is a fine line between agreeably compelling and unnervingly cringey. OpenAI also revealed some interesting details about their post-deployment monitoring — notably that they didn’t have any for sycophancy:

“We also didn’t have specific deployment evaluations tracking sycophancy. While we have research workstreams around issues such as mirroring and emotional reliance⁠, those efforts haven’t yet become part of the deployment process”.

We recommend this great write-up of the events from researcher Nathan Lambert. But we disagree with him that commercial incentives likely had nothing to do with it. As we noted in a recent post: “there’s a reason why this model update was pushed out so quickly, despite its pre-deployment testing processes showed conflicting results. And that’s because OpenAI is competing for market share. This “move fast and break things” market dynamic means that OpenAI increasingly lets competitive pressures dictate the pace of AI testing, often rushing past safety checks and exposing users — and society — to “commercialization risks”.

For Matthew Yglesias take on these risks, focusing on social media, see his recent blog post at Slow Boring.

Thanks for reading! If you liked this post subscribe now, if you aren’t yet a subscriber.

A guest post by

Ilan Strauss

Blogging at @asimovaddendum. Co-Director of the AI Disclosures Project (Social Science Research Council). Senior fellow at University College London. Visiting associate professor at University of Johannesburg.

Asimov’s Addendum

Discussion about this post