AI overviews shouldn't be "one size fits all"
Establishing market-friendly norms for AI content attribution
The other day, I was looking for parking information at Dulles International Airport, and was delighted with the conciseness and accuracy of Google’s AI overview. It was much more convenient than being told that the information could be found at the flydulles.com website, visiting it, perhaps landing on the wrong page, and finding the information I needed after a few clicks. It’s also a win from the provider side. Dulles isn’t trying to monetize its website (except to the extent that it helps people choose to fly from there.) The website is purely an information utility, and if AI makes it easier for people to find the right information, everyone is happy.
An AI overview of an answer found by training or inference from Wikipedia is more problematic. The AI answer may lack some of the nuance and neutrality Wikipedia strives for. And while Wikipedia does make the information free for all, it depends on visitors not only for donations but also for the engagement that might lead people to become Wikipedia contributors or editors. The same may be true of other information utilities like Github and YouTube. Individual creators are incentivized to provide useful content by the traffic that YouTube directs to them and monetizes on their behalf.
And of course, an AI answer provided by illicitly crawling content that is behind a subscription paywall is the source of a great deal of contention, even lawsuits. So content runs a gamut from “no problem crawling” to “do not crawl.”
There are a lot of efforts to stop unwanted crawling, including Really Simple Licensing (RSL) and Cloudflare’s Pay Per Crawl. But we need a more systemic solution. Both of these approaches put the burden of expressing intent onto the creator of the content. It’s as if every school had to put up its own traffic signs saying “School Zone: Speed Limit 15 mph.” Even making “Do Not Crawl” the default puts a burden on content providers, since they must now affirmatively figure out what content to exclude from the default in order to be visible to AI.
Why aren’t we putting more of the burden on AI companies instead of putting all of it on the content providers? What if we asked companies deploying crawlers to observe common sense distinctions such as those that I suggested above? Most drivers know not to tear through city streets at highway speeds even without speed signs. Alert drivers take care around children even without warning signs. There are some norms that are self-enforcing. Drive at high speed down the wrong side of the road, and you will soon discover why it’s best to observe the national norm. But most norms aren’t that way. They work when there’s consensus and social pressure, which we don’t yet have in AI. And only when that doesn’t work do we rely on the safety net of laws and their enforcement.
As Larry Lessig pointed out at the beginning of the Internet era, starting with his book Code and Other Laws of Cyberspace, governance is the result of four forces: law, norms, markets, and architecture (which can refer either to physical or technical constraints.)
So much of the thinking about the problems of AI seems to start with laws and regulations. What if instead, we started with an inquiry about what norms should be established? Rather than thinking about what should be illegal, what if we were thinking about what should be normal? What architecture would support those norms? And how might they enable a market, with laws and regulations mostly needed to restrain bad actors, rather than preemptively limiting those who are trying to do the right thing?
I think often of a quote from the Chinese philosopher Lao Tzu, who said something like:
Losing the way of life, men rely on goodness.
Losing goodness, they rely on laws.
I like to think that “the way of life” is not just a metaphor for a state of spiritual alignment, but rather, an alignment with what works. I first thought about this back in the late 90s as part of my open source advocacy. The Free Software Foundation started with a moral argument, which it tried to encode into a strong license (a kind of law) that mandated the availability of source code. Meanwhile, other projects like BSD and the X Window System relied on goodness, using a much weaker license that asked only for recognition of those who created the original code. But “the way of life” for open source was in its architecture.
Both Unix (the progenitor of Linux) and the World Wide Web have what I call an architecture of participation. They were made up of small pieces loosely joined by a communications protocol that allowed anyone to bring something to the table as long as they followed a few simple rules. Systems that were open source by license but had a monolithic architecture tended to fail despite their license and the availability of source code. Those with the right cooperative architecture (like Unix) flourished even under AT&T’s proprietary license, as long as it was loosely enforced. The right architecture enables a market with low barriers to entry, which also means low barriers to innovation, with flourishing widely distributed.
Architectures based on communication protocols tend to go hand in hand with self-enforcing norms, like driving on the same side of the street. The system literally doesn’t work unless you follow the rules. A protocol embodies both a set of self-enforcing norms and “code” as a kind of law.
What about markets? In a lot of ways, what we mean by “free markets” is not that they are free of government intervention. It is that they are free of the economic rents that accrue to some parties because of outsized market power, position, or entitlements bestowed on them by unfair laws and regulations. This is not only a more efficient market, but one that lowers the barriers for new entrants, typically making more room not only for widespread participation and shared prosperity but also for innovation.
Markets don’t exist in a vacuum. They are mediated by institutions. And when institutions change, markets change.
Consider the history of the early web. Free and open source web browsers, web servers, and a standardized protocol made it possible for anyone to build a website. There was a period of rapid experimentation, which led to the development of a number of successful business models: free content subsidized by advertising, subscription services, and ecommerce.
Nonetheless, the success of the open architecture of the web eventually led to a system of attention gatekeepers, notably Google, Amazon, and Meta. Each of them rose to prominence because it solved for what Herbert Simon called the scarcity of attention. Information had become so abundant that it defied manual curation. Instead, powerful, proprietary algorithmic systems were needed to match users with the answers, news, entertainment, products, applications, and services they seek. In short, the great internet gatekeepers each developed a proprietary algorithmic invisible hand to manage an information market. These companies became the institutions through which the market operates.
They initially succeeded because they followed “the way of life.” Consider Google. Its success began with insights about what made an authoritative site, understanding that every link to a site was a kind of vote, and that links from sites that were themselves authoritative should count more than others. Over time, the company found more and more factors that helped it to refine results so that those that appeared highest in the search results were in fact what their users thought were the best. Not only that, the people at Google thought hard about how to make advertising that worked as a complement to organic search, popularizing “pay per click” rather than “pay per view” advertising and refining its ad auction technology such that advertisers only paid for results, and users were more likely to see ads that they were actually interested in. This was a virtuous circle that made everyone – users, information providers, and Google itself – better off. In short, enabling an architecture of participation and a robust market is in everyone’s interest.
Amazon too enabled both sides of the market, creating value not only for its customers but for its suppliers. Jeff Bezos explicitly described the company strategy as the development of a flywheel: helping customers find the best products at the lowest price draws more customers, more customers draw more suppliers and more products, and that in turn draws in more customers.
Both Google and Amazon made the markets they participated in more efficient. Over time, though, they “enshittified” their services for their own benefit. That is, rather than continuing to make solving the problem of efficiently allocating the user’s scarce attention their primary goal, they began to manipulate user attention for their own benefit. Rather than giving users what they wanted, they looked to increase engagement, or showed results that were more profitable for them even though they might be worse for the user. For example, Google took control over more and more of the ad exchange technology and began to direct the most profitable advertising to its own sites and services, which increasingly competed with the web sites that it originally had helped users to find. Amazon supplanted the primacy of its organic search results with advertising, vastly increasing its own profits while the added cost of advertising gave suppliers the choice of reducing their own profits or increasing their prices. Our research in the Algorithmic Rents project at UCL found that Amazon’s top advertising recommendations are not only ranked far lower by its organic search algorithm, which looks for the best match to the user query, but are also significantly more expensive.
As I described in Rising Tide Rents and Robber Baron Rents, this process of replacing what is best for the user with what is best for the company is driven by the need to keep profits rising when the market for a company’s once-novel services stops growing and starts to flatten out. In economist Joseph Schumpeter’s theory, innovators can earn outsized profits as long as their innovations keep them ahead of the competition, but eventually these “Schumpeterian rents” get competed away through the diffusion of knowledge. In practice, though, if innovators get big enough, they can use their power and position to profit from more traditional extractive rents. Unfortunately, while this may deliver short term results, it ends up weakening not only the company but the market it controls, opening the door to new competitors at the same time as it breaks the virtuous circle in which not just attention but revenue and profits flow through the market as a whole.
Unfortunately, in many ways, because of its insatiable demand for capital and the lack of a viable business model to fuel its scaling, the AI industry has gone in hot pursuit of extractive economic rents right from the outset. Seeking unfettered access to content, unrestrained by laws or norms, model developers have ridden roughshod over the rights of content creators, training not only on freely available content but ignoring good faith signals like subscription paywalls, robots.txt and do not crawl. During inference, they exploit loopholes such as the fact that a paywall that comes up for users on a human timeframe briefly leaves content exposed long enough for bots to retrieve it. As a result, the market they have enabled is of third party black or gray market crawlers giving them plausible deniability as to the sources of their training or inference data, rather than the far more sustainable market that would come from discovering “the way of life” that would balance the incentives of human creators and AI derivatives.
Here are some broad brush norms that AI companies could follow, if they understand the need to support and create a participatory content economy.
For any query, use the intelligence of your AI to judge whether the information being sought is likely to come from a single canonical source, or from multiple competing sources. For example, for my query about parking at Dulles Airport, it’s pretty likely that flydulles.com is a canonical source. Note however, that there may be alternative providers, such as additional off-airport parking, and if so, include them in the list of sources to consult.
Check for a subscription paywall, licensing technologies like RSL, do not crawl or other indication in robots.txt, and if any of these things exists, respect it.
Ask yourself if you are substituting for a unique source of information. If so, responses should be context-dependent. For example, for long form articles, provide basic info but make clear there’s more depth at the source. For quick facts (hours of operation, basic specs), provide the answer directly with attribution. This is an area that really does call for nuance, though. For example, there is a lot of low quality how-to information online that buries useful answers in unnecessary material that is just there to provide additional surface area for advertising, or provides poor answers based on pay-for-placement. An AI summary can short-circuit that cruft. Much as Google’s early search breakthroughs required winnowing the wheat from the chaff, AI overviews can bring a search engine such as Google back to being as useful as it was in 2010, pre-enshittification.
If the site has high quality data that you want to train on or use for inference, pay the provider, not a black market scraper. If you can’t come to mutually agreed-on terms, don’t take it. This should be a fair market exchange, not a colonialist resource grab. AI companies pay for power and the latest chips without looking for black market alternatives. Why is it so hard to understand the need to pay fairly for content, which is an equally critical input?
Check whether the site is an aggregator of some kind. This can be inferred from the number of pages. A typical informational site such as a corporate or government website whose purpose is to provide public information about its products or services will have a much smaller footprint than an aggregator such as Wikipedia, Github, TripAdvisor, Goodreads, YouTube, or a social network. There are probably lots of other signals an AI could be trained to use. Recognize that competing directly with an aggregator with content scraped from that platform is unfair competition. Either come to a license agreement with the platform, or compete fairly without using their content to do so. If it is a community-driven platform such as Wikipedia or Stack Overflow, recognize that your AI answers might reduce contribution incentives, so in addition, support the contribution ecosystem. Provide revenue sharing, fund contribution programs, and provide prominent links that might convert some users into contributors. Make it easy to “see the discussion” or “view edit history” for queries where that context matters.
As a concrete example, let’s imagine how an AI might treat content from Wikipedia:
Direct factual query (”When did the Battle of Hastings occur?”): 1066. No link needed, because this is common knowledge available from many sites.
More complex query for which Wikipedia is the primary source (“What led up to the Battle of Hastings?) “According to Wikipedia, the Battle of Hastings was caused by a succession crisis after the death of King Edward the Confessor in January 1066, who died without a clear heir. [Link]”
Complex/contested topic: “Wikipedia’s article on [X] covers [key points]. Given the complexity and ongoing debate, you may want to read the full article and its sources: [link]”
For rapidly evolving topics: Note Wikipedia’s last update and link for current information.
Similar principles would apply to other aggregators. GitHub code snippets should link back to repositories, YouTube queries should direct to videos, not just summarize them.
These examples are not market-tested, but they do suggest directions that could be explored if AI companies took the same pains to build a sustainable economy that they do to reduce bias and hallucination in their models. What if we had a sustainable business model benchmark that AI companies competed on just as they do on other measures of quality?
Finding a business model that compensates the creators of content is not just a moral imperative, it’s a business imperative. Economies flourish better through exchange than extraction. AI has not yet found true product-market fit. That doesn’t just require users to love your product (and yes, people do love AI chat.) It requires the development of business models that create a rising tide for everyone.
Many advocate for regulation; we advocate for self-regulation. This starts with an understanding by the leading AI platforms that their job is not just to delight their users but to enable a market. They have to remember that they are not just building products, but institutions that will enable new markets and that they themselves are in the best position to establish the norms that will create flourishing AI markets. So far, they have treated the suppliers of the raw materials of their intelligence as a resource to be exploited rather than cultivated. The search for sustainable win-win business models should be as urgent to them as the search for the next breakthrough in AI performance.




Excellent, interesting post, but you lost me at the end as I don't believe a major AI shop would choose to follow your recommendations since they would believe that would put them at a competitive disadvantage as compared to their like competitors. While it may not materialize, I think legal or legislative regulation may be the only practical way forward. And I would be glad to be wrong.