Blog Post
Why Responsible Data Access will determine the Future of AI: The Increased Importance of Data Commons
Posted on 8th of February 2025 by Andrew Zahuranec, Hannah Chafetz, Stefaan Verhulst
At this moment, the AI Action Summit is taking place in Paris, bringing together heads of state and government, international organizations, civil society, and CEOs to look at how artificial intelligence can be most effectively harnessed by society. This meeting will provide an opportunity to discuss the future of AI and what is needed to steer it so that innovation and competition will benefit society.
Yet, achieving these goals requires something more foundational— responsible data access. Without data to train, finetune, and augment them, there can be no AI systems.
Yet, access to data is in decline, or increasingly contentious. We find ourselves at risk of a “data winter”. As such, there is an urgent need to ensure continued and responsible access to unique, diverse, and high-quality datasets. Without access to data ready for AI, we risk ending the AI summer or developing AI technologies that produce biased and low quality outputs and do not address today’s societal needs. All this could stifle innovation and hinder economic success.
Figuring out how to responsibly pool data for AI is just as important as investing in computational resources and looking at the uses of AI itself.
The Value of Data Commons for AI
Over the last year, the Open Data Policy Lab (a collaboration between The GovLab and Microsoft) has been exploring these issues as part of its work on Fourth Wave of Open Data, an approach to data openness that explores intersections between open data from official sources and generative AI.
The way to unlock data responsibly for this fourth wave, we believe, lies with data commons—collaboratively governed data ecosystems designed to pool and provide responsible access to diverse, high-quality datasets across sectors.
A data commons approach focuses on providing access to data sourced in an ethical manner using participatory governance practices that can establish a social license for data re-use.
Data commons can benefit generative AI by:
Bringing together disparate and diverse datasets ready for AI training, improving the quality and reliability of the outputs of AI technologies;
Providing the infrastructure needed to standardize data in AI-ready formats to ensure datasets have the necessary metadata and documentation for constructive use;
Making data for and about underserved populations visible within AI systems, thereby reducing bias and more evenly distributing the benefits;
Increasing public participation in the development of AI technologies to ensure technology remains aligned with today’s most pressing needs.
There have been several examples where data commons for AI are already making an impact. For instance, data commons for voice data are being used for fuel voice applications in underrepresented languages, crowdsourced mapping data is being harnessed for humanitarian response efforts, and agricultural datasets are being used to develop AI applications in low- and middle-income countries.
These examples highlight the many types of data that can be valuable for data commons in the AI era, for instance:
Text: Written language and notes from print media, research, and other literature;
Audio: Environmental and physiological sounds, music, and speeches;
Images: Knowledge graphs, GIS/satellite data, photographs, artwork, and data visualizations;
Video: Streams, lectures, and environmental feeds;
Statistics: Data from national statistical agencies, sensors, and supply chains;
Generative AI data: Data about or generated by generative AI models.
Draft taxonomy. View the full map HERE
Accelerating Data Commons for AI
Over the past few months, we’ve worked together with multiple stakeholders toward seizing the opportunity offered by data commons, to understand what opportunities exist and what challenges need to be overcome.
Below we provide a few key takeaways from our work thus far. We hope these takeaways inspire further discussion about the potential of data commons at the AI Action Summit.
Increasing responsible access to AI-ready data: Much of the open data that exists focuses on tabular or structured data - not unstructured data that often is more appropriate for AI - and has not been published in a machine-understandable format. We need efforts toward making open data AI-ready and focus on developing a taxonomy of unstructured data that can be made available through data commons..
Understanding how to set up data commons: While there has been growing recognition of the value of data commons in the AI era, there has been limited research and guidance on what is involved in practically setting up a data commons. A blueprint for new data commons could be valuable in guiding organizations through these processes and the resources currently available.
The need for data commons use cases: While several examples of data commons are emerging, the concept of a data commons for AI remains nascent and more work is needed to accelerate use cases in high-demand areas —such as research and scientific discovery, education, training, and learning support, or content development and knowledge preservation. Moreover, practitioners should think about what AI functions a data commons can support—including pre-training, fine-tuning, or testing. A taxonomy of data commons use cases can be valuable in determining how exactly data commons can benefit society and where more work is needed.
***
We are currently working to operationalize all these takeaways. In the coming weeks, we will release
a blueprint for data commons along with
a curated list of potential data commons use cases and
a taxonomy of data sources, identifying ways organizations can act on the potential data commons provide.
If you would like to be informed when we release these deliverables, we encourage you to sign up for our Data Stewards Newsletter to stay informed.