Blog

Stay up-to-date with what we are doing

Blog Post

Data Commons for Generative Artificial Intelligence: Our Growing Repository of Use Cases – August Update

Posted on 5th of August 2025 by Claire Skatrud, Hannah Chafetz, Stefaan Verhulst

Data Commons for Generative Artificial Intelligence: Our Growing Repository of Use Cases – August Update
Data Commons for Generative Artificial Intelligence: Our Growing Repository of Use Cases – August Update

Data commons (collaboratively governed data ecosystems) are providing critical infrastructure in the age of AI. When designed responsibly, they can help provide access to high quality, AI-ready datasets for use in the public interest. Yet: What data commons currently exist? Where are they being developed? What data commons are needed most?

Over the last six months, The GovLab’s Open Data Policy Lab (ODPL) has sought to answer these questions by curating and documenting examples of data commons for AI from across the globe. Our Data Commons for Generative AI Repository now contains 60 real-world examples from over 20 countries across 5 continents. 

In the below, we provide a summary of the latest additions to the repository and the key themes that have emerged. 

Latest Additions

Our latest update includes 11 new data commons and updates to 2 data commons previously documented. These include: 

  • AI4Culture: An initiative supporting cultural heritage in the European Union through a collection of tools and open multimodal datasets for training AI. The resources include verified translations, speech and document transcriptions, and European artwork classification data

  • Common Pile v0.1: EleutherAI’s public domain dataset, previously known as “The Pile v2” in its development phase, was released in June 2025 as a successor to The Pile dataset

  • Data Foundry Scotland: An open data platform from the National Library of Scotland that makes its digital collections available in machine learning-ready formats

  • Data Vatika - Bhashini: A hub of high-quality language datasets for AI training, including crowdsourced content for all 22 of India’s official languages

  • EUCAIM (Cancer Image Europe): A platform for sharing de-identified cancer-related medical imaging across the European Union

  • European Open Science Cloud: A platform by the European Commission to make research data from across disciplines accessible and reusable throughout the European Union

  • FineWeb 2: A publicly available pretraining dataset of Common Crawl multilingual textual data

  • Institutional Books 1.0: An AI-ready dataset of nearly one million public domain books published by Harvard Law School Library as part of the Institutional Data Initiative

  • Medical Imaging & Data Resource Center (MIDRC) Data Commons: A repository of queryable AI-ready medical imaging data

  • Norwegian Colossal Corpus: A vast collection of publicly available textual data in Norway’s two official languages

  • OpenPLACSP: A network of open datasets about contracting bodies and tenders through Spain’s Public Sector Procurement Platform

  • Royal Spanish Academy’s Data Bank: Several Spanish language corpora with text and linguistic notations for contemporary and historical Spanish

  • The Public Interest Corpus: Formerly known as the Public-Interest Book Training Commons, the Public Interest Corpus aims to be a large-scale, structured book dataset for AI training. The repository now includes recent development updates, such as business model deliberation and partnership planning from a workshop earlier this year.

Key themes

We identified four underlying themes across these additions: 

  • Public Domain Books as Training Data: Several data commons consist of public domain books and other texts. A recent example is Institutional Books 1.0, a project by the Institutional Data Initiative that published nearly one million public domain books scanned from Harvard Library to enable large language model training and testing. Similarly, The Common Pile v0.1 was released by EleutherAI in June 2025 as one of the largest training sets of openly licensed and public domain text to date. EleutherAI developed The Common Pile v0.1 as a successor to its dataset of unlicensed text, The Pile, from five years earlier. 

  • Data for Lower-Resource Languages: Data commons are being developed to address gaps in training data for low-resource languages. For instance, Norwegian Colossal Corpus is compiling textual data in Norway’s two official languages. Similarly, Data Vatika is expanding access to data in all 22 of India’s official languages, enabling AI-powered translation and transcription tools on a national scale.

  • Libraries as Key Actors: Libraries–national and university-based–are critical players in the creation of data commons. For instance, Data Foundry Scotland hosts machine learning-ready data from the digitized collections of the National Library of Scotland. Other examples include Institutional Books 1.0 by the Harvard Law School Library and the Norwegian Colossal Corpus which includes textual data from the National Library of Norway.

  • Advancing Medical and Scientific Research: Initiatives like the Medical Imaging & Data Resource Center’s Data Commons and Cancer Image Europe (EUCAIM) pool anonymized medical data that can be used to advance AI-based medical modeling, diagnosis, and detection tools. Other initiatives including the European Open Science Cloud (EOSC) are providing FAIR data ready for AI development and collaboration through AI4EOSC.

***

Interested in setting up a data commons? Learn more by reading our Blueprint to Unlock New Data Commons for AI or exploring our ongoing New Commons Challenge, which will fund two data commons that enable AI development for local decision-making or humanitarian response.


Please email [email protected] if you know of a data commons that should be featured in our repository or if you are interested in collaborating with us.

Back to the Blog

Supported by