Blog Post
The Data Commons Landscape: An Analysis of our Data Commons for Generative Artificial Intelligence Repository
Posted on 7th of August 2025 by Claire Skatrud, Hannah Chafetz, Stefaan Verhulst
Over the last six months, The GovLab’s Open Data Policy Lab has documented use cases of data commons—collectively governed data ecosystems—that provide critical infrastructure for responsible AI development around the world. By generating access to high quality, AI-ready datasets, these initiatives are unlocking new possibilities for solving pressing public challenges. Through our Data Commons for Generative AI Repository, we have identified 60 examples of data commons ranging from cultural and language preservation initiatives to biomedical imaging archives for cancer research.
We conducted a quantitative analysis of the full repository (60 use cases) with the goal of understanding trends in existing efforts and where additional support is needed. Below we provide a summary of these trends. However, it is important to note that our search for data commons was conducted in English and likely excludes examples from countries where most initiatives are in non-English languages.
Geographic Scope
The majority of the data commons identified (almost 40% of the repository) aim to support AI development across the globe, with less focus on specific regions. Nonetheless, we also found a substantial number of data commons (21 use cases) for countries across Europe, the Middle East and Africa.
The countries represented in the repository
The regional breakdown of data commons by where they are implemented
The regional breakdown of data commons by where they are developed
Domains
Most use cases (34% of the repository) focus on culture, language, and knowledge preservation. Many of the earlier initiatives—from the 2000s or early 2010s—that have been adapted for AI development are also dedicated to culture, language, and knowledge preservation, or health and humanitarian work. However, we identified 29 use cases that focus on other domains including national statistics data and mobility and environmental data. While all data commons in our repository can support AI development, 13 use cases were created specifically to advance AI itself—designed by technical communities to improve training data, benchmarking, and model performance.
Data Commons by Domain. Some initiatives belong to several domains and are reflected in multiple categories above.
Data Types
Many of the data commons in our repository include multiple types and sources of data. Language and textual data are included in most examples, which reflects current training needs of large language models.
The count of data commons in the repository that contain each data type. Many data commons include multiple data types.
***
Read about our latest additions to the Data Commons for Generative AI Repository here.
Interested in setting up a data commons? Learn more by reading our Blueprint to Unlock New Data Commons for AI or exploring our ongoing New Commons Challenge, which will fund two data commons that enable AI development for local decision-making or humanitarian response.
Please email [email protected] if you know of a data commons that should be featured in our repository or if you are interested in collaborating with us.