BLOG POST
The Emergent Landscape of Data Commons: A Brief Survey and Comparison of Existing Initiatives
Posted on 18th of November 2024 by Stefaan Verhulst, Hannah Chafetz
With the increased attention on the need for data to advance AI, data commons initiatives around the world are redefining how data can be accessed, and re-used for societal benefit. These initiatives focus on generating access to data from various sources for a public purpose and are governed by communities themselves. While diverse in focus–from health and mobility to language and environmental data–data commons are united by a common goal: democratizing access to data to fuel innovation and tackle global challenges.
This includes innovation in the context of artificial intelligence (AI). Data commons are providing the framework to make pools of diverse data available in machine understandable formats for responsible AI development and deployment. By providing access to high quality data sources with open licensing, data commons can help increase the quantity of training data in a less exploitative fashion, minimize AI providers’ reliance on data extracted across the internet without an open license, and increase the quality of the AI output (while reducing mis-information).
Over the last few months, the Open Data Policy Lab (a collaboration between The GovLab and Microsoft) has conducted various research initiatives to explore these topics further and understand:
(1) how the concept of a data commons is changing in the context of artificial intelligence, and
(2) current efforts to advance the next generation of data commons.
In what follows we provide a summary of our findings thus far. We hope it inspires more data commons use cases for responsible AI innovation in the public’s interest.
A Patchwork
Each data commons represents a unique approach to solving critical issues through open data collaboration, For instance:
Empowering Communities with Language Data: Mozilla's Common Voice is revolutionizing AI-driven voice applications, focusing on underrepresented languages to ensure inclusivity. This initiative not only promotes linguistic diversity but also accelerates the adoption of AI technologies globally. Other efforts to advance multilingual AI technologies include data commons of written language phrases (the Lacuna Fund’s language data) and words (MLCommons Multilingual Spoken Words Dataset).
Advancing Health Research: Platforms like the All of Us Research Program, INSIGHT and Nightingale Open Science are leveraging biomedical data to propel precision medicine and improve healthcare. Their secure, tiered data access models ensure ethical data use while fostering innovation in the health domain. Also, existing data commons such as the National Cancer Institute’s (NCI) Cancer Research Data Commons are examining how to better prepare its data for AI biomedical research. Using crowdsourcing is helping to ensure the data is being made available while prioritizing community interests.
Transforming Mobility and Environment: Initiatives like Germany’s Mobilithek, MLCommons’ Cognata Dataset, Georgetown University's Environmental Impact Data Collaborative (EIDC), and OpenStreetMap harness mobility, environmental, and mapping data to drive sustainable solutions. These projects exemplify how data commons can enhance urban planning, geospatial analysis, automotive technologies, and environmental justice.
Augmenting Humanitarian Work: Humanitarian OpenStreetMap is crowdsourcing mapping data through its open mapping technology to aid in emergency responses. This demonstrates how data commons can be used in the context of humanitarian work and to improve risk management strategies.
Advancing Sustainable Food Production: The Lacuna Fund’s agriculture data initiative is focused on advancing AI solutions to support sustainable food production. Also, the Ministry of Science and ICT of South Korea’s Data Dam provides access to a range of agricultural data sources. These efforts aim to support local food production and share learnings on sustainable practices.
Preserving Knowledge: Wikidata provides a multitude of data across domains and geographies. Previous efforts to provide access to data from books have been widely criticized by the public, media, open data advocates and others for re-using copyrighted data unlawfully. Recent efforts such as the Common Corpus and EleutherAI’s The Pile v2 are seeking to make open licensed and public domain books and newspapers available to preserve knowledge and increase the variety of knowledge in AI training sets.
Fostering Innovation in AI: From the MLCommons datasets enabling machine learning benchmarks to PD12M’s image repository for AI to Papers with Code’s research repository to Google’s DataGemma’s open models (connecting with its Data Commons initiative) to the expansive BigScience language models, these projects showcase the potential of collaborative AI development, particularly in addressing gaps in AI readiness and performance.
Image of Mozilla Common Voice’s user interface to contribute voice data
The Backbone: Trust and Governance
Key to the success of these initiatives is their governance frameworks. Projects like the INSIGHT repository and Gaia-X Data Space emphasize data sovereignty and ethical access. With governance bodies composed of stakeholders ranging from patient representatives to industry leaders, these platforms ensure data use aligns with community values and legal standards.
Another important aspect is how these projects are providing access to data. Initiatives such as Nightingale Open Science and EIDC only allow researchers or approved community members to access and analyze data within their own secure platform. The AIDA Data Hub has similar access mechanisms but has associated membership fees. Commons-based licensing is another area of concern. UC Irvine’s Machine Learning Repository requires all published datasets to be licensed under a Creative Commons Attribution 4.0 International license (CC BY 4.0).
Other initiatives focus on providing value to the data subject. For instance, from 2020-2024 the All of Us Research Program provided data subjects with information about their own health and DNA based on the data they provided in addition to publishing research based on their anonymized data. While they are currently looking into new ways of providing the research outcomes to data subjects, this approach has been used to help increase motivation to contribute to the data commons and maintain the community of participants.
The image above summarizes INSIGHT’s requests to use its data for research purposes
Outcomes Across Regions
The measurable and anticipated impacts of these commons are promising. The Lacuna Fund has catalyzed data development in low- and middle-income countries, while India’s forthcoming IndiaAI Datasets Platform aims to empower startups with accessible non-personal data. Meanwhile, initiatives like the Catena-X Data Space are enhancing resilience and sustainability in the automotive sector.
Each platform demonstrates the potential of data commons to drive innovation, improve quality of life, and foster collaboration. From advancing cancer research through the NCI Cancer Research Data Commons to the multilingual and cultural inclusivity of CLARIN, these initiatives are creating ripple effects far beyond their immediate communities.
Emerging Horizons
The emergence of platforms like Health Data Nexus and PD12M underscores a growing commitment to open, FAIR (Findable, Accessible, Interoperable, Reusable) data principles. As more sectors embrace this ethos, data commons will likely expand into new domains, bridging gaps in knowledge and innovation across the globe.
Together, these initiatives illustrate the potential of collective data stewardship. Through shared effort, they lay the foundation for a more equitable and innovative future—one where data truly serves as a public good for all.
Domain | Data Commons Examples |
Culture, Language, and Knowledge Preservation | |
Health and Humanitarian Work | |
Mobility and Environmental | |
AI and Machine Learning | |
National Statistics Data |
The table above summarizes the examples of data commons in the context of AI identified thus far. These examples are organized based on the domain in which they most closely relate.
***
Over the coming months, we are planning to continue this work and solicit a series of use cases to unlock the next generation of data commons in the context of AI. Through this effort, we aim to develop a blueprint for new data commons for AI training and fine tuning, supported by various resources.
Are you working on data commons for AI or have you come across any interesting examples from the field? If you have any questions or feedback or are interested in collaborating, please contact us at [email protected].
Interested in learning more about the thinking behind this work? Read our blog: “Data Commons”: Under Threat by or The Solution for a Generative AI Era? Rethinking Data Access and Re-use