Toward a Fourth Wave of Open Data? Understanding how we can use generative AI for open data and how open data can inform generative AI
The summary of a discussion into the opportunities and challenges presented by generative AI tools for the open data community
Posted on 6th of June 2023 by Stefaan Verhulst, Sampriti Saxena, Hannah Chafetz, Andrew Zahuranec, Adrienne Schmoeker
Generative AI tools such as ChatGPT, Bard, and Claude have attracted enormous attention. Yet, questions remain about whether these tools can support existing open data and Data for Good programs. How can generative AI help people better interface with open data? How can open data inform generative AI? How can we address the many risks and challenges facing these systems?
On May 17, 2023, the Open Data Policy Lab (a collaboration between The GovLab and Microsoft) hosted a first-ever discussion on the intersections between generative AI and open data and the ways in which generative AI could alter our existing conception of a third wave of open data. Specifically, we aimed to answer the following questions:
- How can generative AI facilitate new ways of engaging with open data?
- What approaches can we adopt to promote responsible use of generative AI systems and open data?
- Will generative AI enable a Fourth Wave of Open Data?
We brought together panelists with a variety of experiences with AI and open data, including:
- Stefaan Verhulst, Co-Founder and Chief Research and Development Officer, The GovLab (Moderator)
- Oliver Wise, Chief Data Officer, US Department of Commerce
- Holly Krambeck, Program Manager, World Bank Data Lab
- Hubert Beroche, Founder, Urban AI
- Sonia Cooper, Assistant General Counsel, Microsoft
In what follows we provide several takeaways from the discussion as well as a set of actions to help advance responsible generative AI use for open data.
The panelists provided several insights about the value propositions of generative AI for open data, key use cases, and areas of concern. Below we provide a few of the insights that stood out most.
- Oliver Wise spoke to the opportunities and risks of using generative AI within the public sector. Wise explained that Generative AI provides a simple and intuitive interface that can help make open data more accessible and democratize insights. However, using generative AI on incomplete/low quality datasets could produce misinformation–which could impact trust in institutions. In order to promote a fourth wave of open data amongst institutions, Wise expressed the need for machine-interpretable data with metadata and models that would enable AI tools.
- Holly Krambeck discussed how generative AI tools can be used to bridge the skill gap between coders and non-coders. Krambeck explained that it can help lower collaboration barriers and increase opportunities for new stakeholders to get involved in data related processes. Additionally, Krambeck provided examples of when generative AI can be used to generate generic code (to be reviewed in the context of its use) and Large Language Models can help make code more searchable.
- Hubert Beroche focused on how generative AI tools could be implemented within cities. Beroche explained that AI technologies are governed locally and their use is highly contextualized. Megacities (e.g. NYC) might develop their own Generative AI systems while smaller cities might collaborate/form coalitions that share such systems or simply import those developed by other cities. Additionally, Beroche discussed the need for a social contract when implementing data-driven technologies at the local level.
- Sonia Cooper discussed Microsoft’s ‘co-pilot’ description of Generative AI as tools which will sit alongside us to make it easier to discover, process and analyze data. In this case human actors will have oversight and control over the use of the tools, allowing us to understand the provenance and to assess the outcomes. Cooper explained that using this approach Microsoft has implemented Generative AI across several functions including:
- Bing Chat: Sourcing datasets, asking questions about/analyzing datasets, and identifying data provenance;
- Excel: Automating the structure and enriching the data;
- Power BI: Discovering trends in datasets and generating visualizations.
Generative AI has broad applications, including but not limited to:
- Interface with Open Data Through AI Bots: Generative AI can facilitate the development of bots as an interface for user interaction with open data. These bots leverage natural language processing (NLP) and machine learning algorithms to provide answers to users' queries, thereby increasing data accessibility and usability. This minimizes the need for specialized skills or knowledge to access and interpret open data.
- Generation of Generic Code: Generative AI can create generic code, which needs to be reviewed contextually. Further, Large Language Models can assist in enhancing the searchability of the code.
- Automating Code Books: Generative AI can expedite the creation of code books or metadata documents describing datasets. This automation facilitates users to comprehend the data better, significantly reducing the time and effort needed to exploit open data.
- Generation of Synthetic Data: Generative AI models, notably Generative Adversarial Networks (GANs), can produce synthetic data emulating real-world data. This capability can be extremely beneficial where data privacy is essential or the available data is insufficient.
- Improving Data Processing: Generative AI can potentially tag data for analysis, create metadata, review or clean existing data, thus helping improve data processing.
While generative AI has the potential to be transformative, certain considerations and challenges demand attention:
- Improving Data Availability: Open data must be structured and indexed in a way that it's easily crawlable by AI algorithms for generative AI to operate optimally. Enhancements in data infrastructure and implementation of standards for data formatting and metadata are essential.
- Importance of High-Quality Data: Generative AI heavily relies on the quality of data. Errors, missing values, or bias in data can considerably impact the performance and reliability of generative AI. Therefore, maintaining data quality and integrity is imperative.
- The Role of Questions to Steer Prompt Engineering: Designing and optimizing the prompts given to an AI model, known as prompt engineering, significantly influences the AI's output. It is important to identify the questions that matter as to steer the prompts that guide the AI effectively.
- Documentation: Comprehensive documentation is critical for the trustworthiness and usability of leveraging generative AI for open data. Although automated tools can assist, they often struggle to provide the human touch of context.
- Data Provenance: Understanding the origin and quality of the data and how an AI model made a decision or prediction is crucial for user trust. Advancements in the field of explainable AI (XAI), which aims to make AI decision-making processes more transparent, are necessary to address this issue.
To conclude, the panelists provided a set of key actions to be prioritized in order to accelerate generative AI for open data. These include:
- Understanding the value propositions of Generative AI;
- Investigating the current use cases of Generative AI within open data;
- Understanding the requirements of a social contract around Generative AI;
- Including crawler-friendly data within open government data; and
- Promoting data integrity standards within industries.
The full recording can be found here. In the coming weeks, we plan to further process these insights and develop a detailed action plan towards generative AI for open data. Stay up-to-date on the latest developments of this work by signing up for the Data Stewards Network Newsletter.