Our Co-Director, Tim O'Reilly, gave a talk last month for the second convening of the Oxford Consultation on Copyright and AI. The full pre-recorded talk is below. In just 15 minutes, Tim provides a panoramic view of the future of copyright and AI from the perspective of a publisher, drawing on his experience as founder and CEO of O'Reilly Media.
Monolithic Models
Tim challenges the notion that the future of the internet is necessarily a parasitic one, where model developers can’t possibly pay for their training data or disclose its sources, and are unable to undertake real-time attribution and compensation of sources during inference (answer) time. Per Tim:
I’d like to talk to you about what I call the participatory, content-aware web of AI. I urge you to reject the notion of large monolithic models as the only way forward. Protocols such as Anthropic’s MCP, Google’s A2A, and NLWeb (which was just released yesterday by Microsoft) point forward to a world of cooperating AIs. This could be the solution for AI and copyright.
If a user were to ask ChatGPT to create an image in the style of Hayao Miyazaki, as someone did recently, the answer is not to find a loophole in copyright law – which is what OpenAI did – and say, “We can’t do that, but we can do it in the house style of Studio Ghibli,” which amounts to the same thing but with a legal fig leaf, and then to shamelessly profit as if with a free endorsement deal to acquire millions of new users to the resulting viral storm.
What if instead an AI were to say, “Let me check with Studio Ghibli”? Protocols are the natural languages of networks. We need a protocol for copyright – an automated way for an AI to ask permission and for a content provider to offer it, or to refuse it, and to ask for payment.
Protocols
As mentioned previously by us on this substack, Microsoft’s newly released NLWeb tool is one of several new technical solutions that facilitate a more decentralized AI economy. NLWeb is an open-source protocol designed to transform websites into AI-powered, conversational interfaces. By leveraging existing structured data formats like Schema.org and RSS, NLWeb enables users to interact with web content using natural language, while also making sites accessible to AI agents through the Model Context Protocol (MCP). Essentially, it’s an easy way to make websites accessible to AI agents, and for publishers to leverage their own website’s data with a bespoke model for their content.
Inference Time: From clicks to participation
NLWeb’s emphasis rests on the important distinction between training data and inference data (as in, when a user makes a query and the model goes about trying to answer it). When publishers first started protesting about AI taking their data, it was all about the training data. Model developers continue to round up large swaths of the internet to train their models on. But as AI search and inference time techniques, such as Retrieval Augmented Generation (RAG), have gained popularity, many models are drawing on third-party data at inference time too. As Tim notes:
Large language models return a synthetic result that is drawn from everything they’ve trained on. Quotation, though, can only be made by reference to a specific work and author. I like to say: you can turn a sirloin steak into a pretty good hamburger, but once it’s ground, you can’t turn it back into steak.
…Now, it seems to me that AI companies need continuous access to content creators’ data for inference. Model training isn’t sufficient. That means that model providers don’t have all the power. If LLMs are able to quote, they are also able to attribute — and they could be required to provide compensation. But more than that, model providers should want to provide compensation, as I do, so that creators are incentivized to produce more of the content they depend on.
As Tim notes further, oftentimes models will now cite specific sources in their responses. Indeed, when you use an application like Perplexity or ChatGPT on search mode, having a citation in the response is common. So what’s the problem then? In forthcoming research, we examine exactly when models choose to cite at inference time and the relative frequency of this. We find notable “attribution gaps” – that is, the gap between the number of websites visited by the model when answering the user’s query and the number of websites they choose to explicitly cite in their response. How to fill this gap is the issue. And more citations alone is unlikely to be the answer. Many websites checked by the AI during inference might not be useful to the user, making other forms of attribution and compensation important.
Moreover, providing clickable citations in model responses is no guarantee that users will actually visit the source website. After all, a good model response may sum it up entirely for the user. Historically, websites have generated revenue from advertisements. But as we’ve been documenting, this business model is now expiring. Dotdash Meredith, the publisher behind sites like People and Food & Wine, reported during its latest earnings call that its Google-driven traffic has dropped to about half of what it was four years ago.1 Cloudflare CEO, Matthew Prince, detailed this clearly in a recent talk, looking at how page visits by AI crawlers no longer translate into clicks. And this is before Google announced that they are releasing an AI-only search mode, which eliminates the list of blue links entirely.
“If you book them, they will come”
Destroying the market for up-to-date information on the internet — a real tragedy of the commons — serves no one's interests. Access to current data has become a key competitive advantage — and in turn a point of reliance point — for integrated AI developers like OpenAI as they work to reduce hallucinations and compete with Google Search for market share. This ongoing reliance on real-time information by integrated model providers is precisely why the inference market may require consideration as a market separate to training data, with its own protocols and compensation model.
So, imaging a participatory content-aware web for AI is in fact “easy if you try” - but Tim’s point is that its even easier if we construct the protocols and scaffolding necessary to support a more open, participatory, architecture.
Thanks for reading! If you liked this post subscribe now, if you aren’t yet a subscriber.
Ignoring other factors that may be causing this shift.