Blog

Stay up-to-date with what we are doing

New Publication

NEW REPORT: A Fourth Wave of Open Data? Exploring the Spectrum of Scenarios for Open Data and Generative AI

In the Open Data Policy Lab's new report, the team provides a framework and recommendations to support open data providers and other interested parties in making open data “ready” for generative AI.

Posted on 8th of May 2024 by Hannah Chafetz, Stefaan Verhulst

Report Cover Page

Since late 2022, generative AI services and large language models (LLMs) have transformed how many individuals access, and process information. However, how generative AI and LLMs can be augmented with open data from official sources and how open data can be made more accessible with generative AI - potentially enabling a Fourth Wave of Open Data - remains an under explored area. 

For these reasons, The Open Data Policy Lab (a collaboration between The GovLab and Microsoft) decided to explore the possible intersections between open data from official sources and generative AI. Throughout the last year, the team has conducted a range of research initiatives about the potential of open data and generative including a panel discussion, interviews, and Open Data Action Labs – a series of design sprints with a diverse group of industry experts. 

These initiatives were used to inform our latest report, “A Fourth Wave of Open Data? Exploring the Spectrum of Scenarios for Open Data and Generative AI,” (May 2024) which provides a new framework and recommendations to support open data providers and other interested parties in making open data “ready” for generative AI.

A Spectrum of Scenarios for Open Data and Generative AI

The report outlines five scenarios in which open data from official sources (e.g. open government and open research data) and generative AI can intersect. Each of these scenarios includes case studies from the field and a specific set of requirements that open data providers can focus on to become ready for a scenario. These include:

  1. Pretraining: Training the foundational layers of a generative AI model on vast amounts of open data 

    • Quality requirements: High volume, diverse, and representative of the desired output domain and its stakeholders, unstructured data.

    • Metadata needs: Clear information on sourcing. 

  1. Adaptation: Fine-tuning or grounding a pre-trained model on specific open data for targeted tasks 

    • Quality requirements: High accuracy, relevance to the target task, balanced distribution, tabular and/or unstructured data. 

    • Metadata needs: Clear labels, metadata about collection and annotation process, standardized schemas, knowledge graphs. 

  1. Inference and Insight Generation: Using a generative AI model to make inferences and extract insights from open data 

    • Quality requirements: High quality, complete, and consistent tabular data. 

    • Metadata needs: Documented data collection methods, source information, and version control. 

  1. Data Augmentation: Leveraging open data to generate synthetic data or providing ontologies to expand training sets for specific tasks 

    • Quality requirements: Accurate representation of real data, adherence to ethical considerations, tabular and/or unstructured data. 

    • Metadata needs: Transparency about the generation process (including privacy compliance and ethical reviews) and potential biases. 

  1. Open-Ended Exploration: Expanding the potential of open-ended data exploration through generative AI 

    • Quality requirements: Diverse, comprehensive, and tabular and/or unstructured data. 

    • Metadata needs: Clear information on sourcing and copyright, understanding potential biases and limitations, entity reconciliation.

Using this approach, we aim to make progress towards these scenarios and in the long-term become ready for all possible scenarios.

Png Cover Page 26

Recommendations for Advancing Open Data and Generative AI

While developing the “Spectrum of Scenarios”, we learned a lot about the different requirements behind open data for generative AI. Drawing on our lessons learned, we developed the following data governance and management recommendations to help improve access to, gain greater insights from, open data.

  1. Enhance Transparency and Documentation: Improving transparency and documentation of open data can not only foster ethical and responsible use throughout the data lifecycle, but can also help data holders and users to better evaluate the lineage, quality and impact of the output. 

  2. Uphold Quality and Integrity: Upholding data quality and integrity are key when thinking about advancing generative AI for open data inference and insight generation, data augmentation, and open ended exploration.

  3. Promote Interoperability and Standards: Improving the interoperability of data and promoting the adoption of shared data and metadata standards would address many long standing pain points in the open data ecosystem that prevent the efficient and effective use of open data.

  4. Improve Accessibility and Usability: As the uses of open data expand in light of developments in the generative AI ecosystem, it is more important than ever before to take action to protect data subjects and prevent harm.

  5. Address Ethical Considerations: As the uses of open data expand in light of developments in the generative AI ecosystem, it is more important than ever before to take action to protect data subjects and prevent harm.

***

Learn more by visiting our project website HERE

Read the full report HERE.

If you have any questions or feedback or would like to share how this framework helped you, please contact us at [email protected].

Back to the Blog

Supported by