There is a lot of hope that AI will advance the progress of science, but unfortunately, the collision between AI and scientific publishing has not gone well.
"AI is very good at extracting structured information from unstructured text". While this is true, the error rate is still high. AI is good at giving answers that look complete but it has been trained to value completeness over accuracy - the very same attribute that leads to hallucinations. Any AI contributions to science need to be properly vetted by humans and attributed.
When AI extractively builds on scientific infrastructure, the error rate only matters to the people and applications that rely on its output. When AI contributes to scientific infrastructure, the error rate could cause a feedback loop that poisons and progressively degrades the accuracy of science.
I worry that many people who are excited about AI use in science are excited about the potential volume of contribution, but they may not be thinking about the potential negative impacts if that volume causes a reduction in accuracy. AI contributions to science should be welcomed on the condition that they improve accuracy rather than simply increase the volume of contributions.
One thing missing from this discussion is the topic of asset identification.
AI companies usually do not have access to reliable title metadata, creator information, or legacy identifiers such as ISBN, ISRC, DOI, or other platform-specific records once content starts circulating downstream across the web. In practice, many assets arrive without trustworthy metadata at all.
That is where the International Standard Content Code (ISCC – ISO 24138:2024) becomes relevant. ISCC creates content-derived fingerprints directly from the media itself. Technically, ISCC codes can also be understood as lightweight vectors derived from the content, making them particularly well suited for AI-related use cases such as similarity search, dataset management, provenance tracking, deduplication, and large-scale discovery.
Without persistent, content-derived identification, it becomes extremely difficult to reliably reconnect content used downstream to rights, provenance, licensing, and AI transparency metadata or flags once files have been reformatted or stripped of metadata.
I want to add one other data point. The investment into AI companies has provided them with significant resources, and they are using that to crawl and scrape scholarly content at scale. At BMJ in the first half of April we had 5.7B bot hits, that's many more times the level of annual traffic that we might have expected pre-AI. Most of that is deflected at the cloudflare level, but managing this is consuming a lot of time and effort.
This is a good start to having AI provide better answers and how it can provide value to the services trying to keep the garbage out of the space.
When I ask Google's Gemini for an answer to a science question, I always include in teh prompt a request that every statement has a supporting reference, and how to locate the place that supports the claim, usually a sentence. It works best when providing context documents, which I have already selected for relevance and quality.
If AI could also find teh best documents using teh tools mentioned, this would be very encouraging. I see no reason why AI cannot help with tools to detect garbage papers, manipulated data in tables, and images, and report that back to humans for validation. AI might help by validating the logic of arguments made in conclusions based on the data and other supporting evidence. I can certainly see that good AI tools to support reviewers could reduce the burden on reviewers and improve the quality of peer review.
"Value flows in one direction". So true and that's the sentence that connects this piece to a broader argument.
You're describing extraction from scientific commons: AI trains on decades of peer-reviewed work, generates products worth billions, and contributes almost nothing back to the infrastructure that made the knowledge. The same logic applies to the human creative commons more broadly. AI was trained on everything humanity ever wrote, drew, composed, and recorded. The extraction-without-contribution pattern isn't limited to scientific publishing — it's the structure of the entire current moment.
The AI Pledge for Humanity tries to make the mechanism design argument you're gesturing at here: that there needs to be a public commitment to direct a share of AI's gains back toward the people whose work made it possible, in the form of universal basic income as an AI dividend. The question you close with, "ensuring everyone, not just AI companies, benefits from the result," is exactly the purpose of the pledge.
I agree that there needs to be a giveback, but I don't agree that it should be as an AI dividend, partly because there are no profits to share (and may never be) but more importantly because targeted investments in affordances for the human/AI economy are a better mechanism for sharing the gains. In addition, there's a whole lot of what AI was trained on that has been *given* value by AI, or could be given value were the right mechanisms in place. There is good historical precedent. Consider the video creator economy enabled by YouTube. When the music companies were asking YouTube to take down any content accompanied by copyrighted music, YouTube came up with a better answer: "How about we help you monetize it instead?" They made a new market where one didn't previously exist. There are so many opportunities for the same to happen with AI if the AI platforms rethink their approach.
I think that this needs to be framed inside agreed narratives that help clarify goals. Classical economics has much of what is consensus:
The division of labor is a product of science automating previously manual tasks
Labor as a commodity required agricultural revolution in techniques etc. We know all this.
The portion of human time dedicated to labor shrinks as the techniques automating tasks allows, and labor has always fought to benefit from that.
Less labor and more time is a societal good.
Society has evolved on top of these economic drivers and standards of living measured by PPP have grown. Services have evolved with good outcomes in college attendance, health, and beyond. But most striking, leisure.
AI is a continuation and acceleration of all of this. Labor as a commodity required for producing anything will shrink even more
Humans get this increased leisure time as a reward for all previous and current progress.
But… we need a plan for it. UBI is actually too small and possibly the wrong framing - and here opinion kicks in. The word ‘basic’ is the issue, associating it with help or welfare.
We need to approve of less labor, more time and access to the results of abundance as key planks of an accelerated capability created by us
Thank you for the thoughtful reply, Tim, but I don't think what you're suggesting is even close to a sufficient substitute, and I think we also need to acknowledge that for the past 50 years, productivity growth decoupled from employment income and started concentrating at the top. Jobs as a productivity distribution mechanism already broke down and we could and should have started a small UBI way back when to distribute that productivity growth directly instead of increasing inequality.
Yes, what YouTube did to help share revenue with creators was better than not doing it, but that can't be replicated for AI models. Take me for example, I know my entire blog went into LLM model training. How do you propose I be compensated? How can we ever in any way realistically measure my own contribution to the outputs that LLMs provide? We can't.
And what of all the unpaid care that enables all the paid work? That unpaid care subsidized everything that went into the models. How do we compensate that?
I don't understand why you don't look at this as a great opportunity to do what we should have long ago already done, simply because UBI works better than the existing safety net to enable far better outcomes across the board.
The 'one-way value flow' framing maps directly onto how energy infrastructure gets extracted in resource-poor regions. Same asymmetry: cost socialized, upside privatized, and the conversation reframed as 'inevitable'.
"AI is very good at extracting structured information from unstructured text". While this is true, the error rate is still high. AI is good at giving answers that look complete but it has been trained to value completeness over accuracy - the very same attribute that leads to hallucinations. Any AI contributions to science need to be properly vetted by humans and attributed.
When AI extractively builds on scientific infrastructure, the error rate only matters to the people and applications that rely on its output. When AI contributes to scientific infrastructure, the error rate could cause a feedback loop that poisons and progressively degrades the accuracy of science.
I worry that many people who are excited about AI use in science are excited about the potential volume of contribution, but they may not be thinking about the potential negative impacts if that volume causes a reduction in accuracy. AI contributions to science should be welcomed on the condition that they improve accuracy rather than simply increase the volume of contributions.
One thing missing from this discussion is the topic of asset identification.
AI companies usually do not have access to reliable title metadata, creator information, or legacy identifiers such as ISBN, ISRC, DOI, or other platform-specific records once content starts circulating downstream across the web. In practice, many assets arrive without trustworthy metadata at all.
That is where the International Standard Content Code (ISCC – ISO 24138:2024) becomes relevant. ISCC creates content-derived fingerprints directly from the media itself. Technically, ISCC codes can also be understood as lightweight vectors derived from the content, making them particularly well suited for AI-related use cases such as similarity search, dataset management, provenance tracking, deduplication, and large-scale discovery.
Without persistent, content-derived identification, it becomes extremely difficult to reliably reconnect content used downstream to rights, provenance, licensing, and AI transparency metadata or flags once files have been reformatted or stripped of metadata.
I want to add one other data point. The investment into AI companies has provided them with significant resources, and they are using that to crawl and scrape scholarly content at scale. At BMJ in the first half of April we had 5.7B bot hits, that's many more times the level of annual traffic that we might have expected pre-AI. Most of that is deflected at the cloudflare level, but managing this is consuming a lot of time and effort.
This is a good start to having AI provide better answers and how it can provide value to the services trying to keep the garbage out of the space.
When I ask Google's Gemini for an answer to a science question, I always include in teh prompt a request that every statement has a supporting reference, and how to locate the place that supports the claim, usually a sentence. It works best when providing context documents, which I have already selected for relevance and quality.
If AI could also find teh best documents using teh tools mentioned, this would be very encouraging. I see no reason why AI cannot help with tools to detect garbage papers, manipulated data in tables, and images, and report that back to humans for validation. AI might help by validating the logic of arguments made in conclusions based on the data and other supporting evidence. I can certainly see that good AI tools to support reviewers could reduce the burden on reviewers and improve the quality of peer review.
"Value flows in one direction". So true and that's the sentence that connects this piece to a broader argument.
You're describing extraction from scientific commons: AI trains on decades of peer-reviewed work, generates products worth billions, and contributes almost nothing back to the infrastructure that made the knowledge. The same logic applies to the human creative commons more broadly. AI was trained on everything humanity ever wrote, drew, composed, and recorded. The extraction-without-contribution pattern isn't limited to scientific publishing — it's the structure of the entire current moment.
The AI Pledge for Humanity tries to make the mechanism design argument you're gesturing at here: that there needs to be a public commitment to direct a share of AI's gains back toward the people whose work made it possible, in the form of universal basic income as an AI dividend. The question you close with, "ensuring everyone, not just AI companies, benefits from the result," is exactly the purpose of the pledge.
https://actionnetwork.org/petitions/the-ai-pledge-for-humanity
I agree that there needs to be a giveback, but I don't agree that it should be as an AI dividend, partly because there are no profits to share (and may never be) but more importantly because targeted investments in affordances for the human/AI economy are a better mechanism for sharing the gains. In addition, there's a whole lot of what AI was trained on that has been *given* value by AI, or could be given value were the right mechanisms in place. There is good historical precedent. Consider the video creator economy enabled by YouTube. When the music companies were asking YouTube to take down any content accompanied by copyrighted music, YouTube came up with a better answer: "How about we help you monetize it instead?" They made a new market where one didn't previously exist. There are so many opportunities for the same to happen with AI if the AI platforms rethink their approach.
I think that this needs to be framed inside agreed narratives that help clarify goals. Classical economics has much of what is consensus:
The division of labor is a product of science automating previously manual tasks
Labor as a commodity required agricultural revolution in techniques etc. We know all this.
The portion of human time dedicated to labor shrinks as the techniques automating tasks allows, and labor has always fought to benefit from that.
Less labor and more time is a societal good.
Society has evolved on top of these economic drivers and standards of living measured by PPP have grown. Services have evolved with good outcomes in college attendance, health, and beyond. But most striking, leisure.
AI is a continuation and acceleration of all of this. Labor as a commodity required for producing anything will shrink even more
Humans get this increased leisure time as a reward for all previous and current progress.
But… we need a plan for it. UBI is actually too small and possibly the wrong framing - and here opinion kicks in. The word ‘basic’ is the issue, associating it with help or welfare.
We need to approve of less labor, more time and access to the results of abundance as key planks of an accelerated capability created by us
Discuss :-)
Thank you for the thoughtful reply, Tim, but I don't think what you're suggesting is even close to a sufficient substitute, and I think we also need to acknowledge that for the past 50 years, productivity growth decoupled from employment income and started concentrating at the top. Jobs as a productivity distribution mechanism already broke down and we could and should have started a small UBI way back when to distribute that productivity growth directly instead of increasing inequality.
Yes, what YouTube did to help share revenue with creators was better than not doing it, but that can't be replicated for AI models. Take me for example, I know my entire blog went into LLM model training. How do you propose I be compensated? How can we ever in any way realistically measure my own contribution to the outputs that LLMs provide? We can't.
And what of all the unpaid care that enables all the paid work? That unpaid care subsidized everything that went into the models. How do we compensate that?
I don't understand why you don't look at this as a great opportunity to do what we should have long ago already done, simply because UBI works better than the existing safety net to enable far better outcomes across the board.
The 'one-way value flow' framing maps directly onto how energy infrastructure gets extracted in resource-poor regions. Same asymmetry: cost socialized, upside privatized, and the conversation reframed as 'inevitable'.