Discussion about this post

User's avatar
Jeff Lang's avatar

"AI is very good at extracting structured information from unstructured text". While this is true, the error rate is still high. AI is good at giving answers that look complete but it has been trained to value completeness over accuracy - the very same attribute that leads to hallucinations. Any AI contributions to science need to be properly vetted by humans and attributed.

When AI extractively builds on scientific infrastructure, the error rate only matters to the people and applications that rely on its output. When AI contributes to scientific infrastructure, the error rate could cause a feedback loop that poisons and progressively degrades the accuracy of science.

I worry that many people who are excited about AI use in science are excited about the potential volume of contribution, but they may not be thinking about the potential negative impacts if that volume causes a reduction in accuracy. AI contributions to science should be welcomed on the condition that they improve accuracy rather than simply increase the volume of contributions.

Sebastian Posth's avatar

One thing missing from this discussion is the topic of asset identification.

AI companies usually do not have access to reliable title metadata, creator information, or legacy identifiers such as ISBN, ISRC, DOI, or other platform-specific records once content starts circulating downstream across the web. In practice, many assets arrive without trustworthy metadata at all.

That is where the International Standard Content Code (ISCC – ISO 24138:2024) becomes relevant. ISCC creates content-derived fingerprints directly from the media itself. Technically, ISCC codes can also be understood as lightweight vectors derived from the content, making them particularly well suited for AI-related use cases such as similarity search, dataset management, provenance tracking, deduplication, and large-scale discovery.

Without persistent, content-derived identification, it becomes extremely difficult to reliably reconnect content used downstream to rights, provenance, licensing, and AI transparency metadata or flags once files have been reformatted or stripped of metadata.

7 more comments...

No posts

Ready for more?