Did OpenAI train on copyrighted book content?
In a new research paper, we find that OpenAI likely did.
We've just completed a study at the AI Disclosures Project examining whether OpenAI’s language models were trained on copyrighted content that sits behind paywalls, raising important questions about how AI corporations are developing their models, and why this has to change.
Our Key Findings
Using a legally obtained dataset of 34 copyrighted O’Reilly Media books, we applied the DE-COP membership inference attack method to investigate whether OpenAI’s models were trained on book content that wasn't publicly accessible. In addition, we compared their models’ knowledge of public, free to use samples, from each O’Reilly book, with paywalled (non-public) chapters from the same book. Here were our findings:
GPT-4o shows strong recognition of paywalled content: OpenAI’s most advanced model demonstrated significant familiarity with non-public, paywalled O'Reilly book content (82% AUROC score), compared to more modest recognition of publicly available content from the same books (64% score).** We would expect the opposite, since public data is more easily accessible and repeated across the internet.
This also highlights the strong value-add to a model's training from accessing paywalled high-quality content, such that model developers need to actively seek it out.
Earlier models were more selective: GPT-3.5 Turbo, trained two years earlier than GPT-4o, showed greater relative recognition of publicly accessible content than private content, suggesting a shift in OpenAI’s data acquisition approach over time.
Smaller models show less recognition: GPT-4o Mini, despite having the same training cutoff date as GPT-4o, showed minimal recognition of both public and private O'Reilly content, potentially due to its smaller size and more limited memory capacity. This indicates that testing smaller models using the methods we employ may be less useful.
Method. These findings were based on using something called a membership inference attack (DE-COP), which quizzes each of OpenAI’s LLM to assess its familiarity with a given human written (in this case O’Reilly book) text, relative to computer written variations of the same text, as a proxy for whether the LLM had seen the human written text before during its training (“pre-training” stage). We then calculated AUROC scores to ensure that our findings were not by chance. Our study design helped ensure that our test results were not simply because the model was good at guessing which text in the quiz was human written (instead of it being good at accurately identifying the human written text because it was familiar to it from its training).

Why This Matters
Our research provides empirical evidence that OpenAI may be training its models on non-public, copyrighted content without proper authorization. As AI companies increasingly rely on vast datasets to improve their models, questions about fair compensation for content creators become more urgent.
If AI systems extract value from content creators’ work without fair compensation, they risk depleting the very resources upon which their systems depend – potentially creating what we describe as an “extractive dead end” for the internet’s content ecosystem.
Looking Forward
Our findings underscore the need for increased transparency regarding training data sources. While technical detection methods like those we used are valuable, they are not a substitute for detailed disclosures from AI companies about their training data. Such disclosures should be built into workable commercial marketplaces for content, where data can be procured and paid for by AI model developers and services in a highly automated fashion — in a similar way to today’s online ads exchanges, where advertising space is bid and sold for using algorithms and established protocols for communication.
Given that paywalled and copyrighted content is often higher quality and up-to-date (very recent), AI model developers and services will continue to need access to this proprietary data in order for their models to deliver relevant response to users. This means that establishing commercial marketplaces for content used in AI training and inference is difficult to avoid eventually, despite current resistance from model developers.
Ensuring IP holders know when their work has been used in model training represents a crucial first step toward establishing robust AI markets for content creator data. Liability regimes could be a useful initial push to get such marketplaces up and running and soon as possible. But the mere fact that so much high quality data is still beyond the reach of model developers means that pressure already exists for such commercial arrangements to arise, with notable licensing arrangements having been agreed upon by OpenAI, Anthropic, and others. A centralized marketplace has certain advantages though that a string of individual licensing deals between AI model developers and large publishers and platforms does not have. (But more on that another time.)
The full research paper, “Beyond Public Access in LLM Pre-Training Data Non-public book content in OpenAI’s Models”, providing our detailed methodology and findings, and can be found: here.
Feedback is invited!
**50% is weak evidence, such that the result could be by chance.
When you say non-public works were detected in the likely training set, does this imply that the paywall was broken to reach the material, perhaps by accessing the O'Reilly library through a subscription and extracting the content?
One of the takeaways at the end of the report was about sustainability: "If left unaddressed, the current disregard for IP rights could ultimately harm AI developers themselves, even if its use is ruled legally permissible. Sustainable ecosystems need to be designed so that both creators and developers can benefit from generative AI. Otherwise, model developers are likely to rapidly plateau in their progress, especially as newer content becomes produced less and less by humans."
I wonder if you have or would be willing to produce an analysis of what effect the use of these 34 titles has had on expected sales over time? Or, given that payment is via licensing deals that are long-term enough that these changes wouldn't have any impact on revenue just yet, have you seen any changes in behind-the-paywall usage? E.g. a drop off in users accessing content because users can get it (or enough of it) outside the paywall via an AI system to satisfy their need? There is scant data available users' willingness to substitute paid-for access with AI tools and this data could be valuable.