Blog Post
The Intersections of Open Data and Generative AI: New Additions to the Observatory — April
Posted on 24th of April 2025 by Hannah Chafetz, Roshni Singh, Stefaan Verhulst
The Open Data Policy Lab’s Observatory of Examples of How Open Data and Generative AI Intersect provides real-world use cases of how open data from official sources intersects with generative artificial intelligence (AI), building on insights from our report, A Fourth Wave of Open Data? Exploring the Spectrum of Scenarios for Open Data and Generative AI.
The Observatory now features over 100 examples* from diverse domains and geographies, focusing on sectors like humanitarian response, local decision-making, and agriculture.
We recently added 19 new examples, highlighting several applications of generative AI. These include Ask ReliefWeb, a generative AI tool designed to assist humanitarian workers by retrieving information from ReliefWeb’s repository of reports; AuroraGPT, an AI model trained on scientific papers and computational data to support scientific research in areas like biology, cancer studies, and climate science; and InkubaLM, a language model developed to support AI applications for African languages with limited digital resources, such as Swahili, Yoruba, and Hausa, helping with tasks like text translation, sentiment analysis, and keyword recognition to improve AI inclusivity for underrepresented language communities.
What's New in the Observatory?
AgroLLM: a conversational tool that uses generative AI to provide farmers with advice on topics such as crop management, climate impact, and pest control, aiming to help them make informed decisions based on agricultural resources and open datasets.
ALIA-40b: a generative AI model trained on multilingual open datasets, including the Norwegian Colossal Corpus, the Estonian National Corpus, and the Danish Parliament Corpus. It is designed to perform tasks such as content generation, text summarization, conversational interactions, and translation in multiple languages.
Ask ReliefWeb: a conversational tool designed to assist humanitarian workers by retrieving information from ReliefWeb’s repository of humanitarian reports to support decision-making during crises.
AuroraGPT: a generative AI model trained on scientific papers and computational data for research support biology, cancer studies, and climate science.
Bielik 7B v0.1: a generative AI model trained on Polish language texts and other multilingual datasets, designed for natural language processing tasks (NLP) like text generation, sentiment analysis, and question answering in Polish and English.
BLOOM, a BigScience Initiative: an open-access, multilingual large language model (LLM) to support researchers and institutions with AI development in several languages.
Dolma Dataset: a three-trillion-token dataset made up of academic research, aimed at supporting transparency, risk mitigation, and reproducibility for responsible AI development.
Generative AI Chatbot for Drilling and Production: a generative AI chatbot designed to analyze historical drilling and production reports, perform structured query language (SQL), and provide recommendations in efforts to improve operations in the oil and gas industry worldwide.
GPT-SW3: an open-source generative AI model designed to support tasks for Nordic languages such as Swedish, Norwegian, Danish, and Icelandic, including content generation, translation, and digital assistant functions.
InkubaLM: a language model developed to support AI applications for African languages with limited digital resources, such as Swahili, Yoruba, and Hausa, helping with tasks like text translation, sentiment analysis, and keyword recognition in efforts to improve AI inclusivity for underrepresented language communities.
Inlook.AI: a conversational tool designed to help users access and visualize statistical data by querying official datasets using natural language, supporting multilingual queries for statistical offices and private companies.
Instruction Tuning for Low-Resource Languages: A Case Study in Kazakh: a dataset created to improve large language models’ ability to follow instructions in the Kazakh language, using open data from government and cultural sources to support better understanding of local governance and culture.
LawPal: a generative AI chatbot designed to make legal information more accessible by answering users' questions and providing insights on case law, statutes, and legal principles.
OLMo 2: an open-source generative AI language model designed to perform tasks such as instruction-following, text generation, and conversational AI, intended for a wide range of applications in research and development.
Queried: a research tool that uses generative AI to assist users in analyzing climate law and policy documents. It allows users to search for and extract specific information from documents in the Climate Change Laws of the World database, a database managed by the Climate Policy Radar, which includes laws and policies on energy, transport, land use, climate resilience, and low-carbon transitions.
SatGPT: a conversational tool that integrates generative AI with Earth observation data, designed to generate readable descriptions from satellite imagery for applications such as flood monitoring, agricultural assessments, and urban planning, aiming to provide insights to support decision-making in these areas.
Scholastic AI: a document analysis tool that uses retrieval-augmented generation (RAG) to help users extract and interpret information from various documents, such as PDFs. It allows users to upload their own files and generate responses based on the content within them.
SEA-LION: a family of open-source large language models trained on multilingual datasets from Southeast Asia, designed to improve cultural representation and support natural language processing (NLP) tasks like translation, summarization, and question answering in low-resource languages such as Thai, Vietnamese, and Bahasa Indonesia.
USAFacts: a platform that uses generative AI to process and standardize open government data from federal, state, and local sources, generating written content such as summaries and explanations based on official government data.
Key Themes
In what follows we provide a few key themes from these additions:
Multilingual and Cultural Inclusivity in Generative AI Solutions: We identified a few examples that seek to improve the capabilities of generative AI in underrepresented languages and cultures. For instance, SEA-LION’s LLMs focus onSoutheast Asian languages such as Thai and Bahasa Indonesia, while InkubaLM seek to support African languages like Swahili, Yoruba, and Hausa. Additionally, Instruction Tuning for Low-Resource Languages: A Case Study in Kazakh is a dataset created to improve language models' ability to follow instructions in the Kazakh language, using open data from government and cultural sources to support a better understanding of local governance and culture.
Advancements in Scientific and Environmental Research: Tools like AuroraGPT and SatGPT showcase how AI is being applied in scientific research and environmental monitoring, using open datasets to generate insights in areas such as climate science, biology, and agriculture. AuroraGPT is trained on scientific papers to assist with biology and cancer studies, while SatGPT uses satellite imagery for flood monitoring and urban planning.
Enhancing Accessibility and Service Delivery: Several of the initiatives identified are using generative AI to address specific service gaps. Inlook.AI is designed to provide responses and visualizations of statistical data based on user queries, helping policymakers and researchers analyze official datasets. Additionally, LawPal is helping aid individuals find legal information with the goal of making legal advice more accessible.
Improving Humanitarian Response: Generative AI tools are being developed to support humanitarian efforts by improving data access and decision-making in crisis situations. For example, Ask ReliefWeb is a tool designed to help humanitarian workers retrieve relevant information from ReliefWeb’s repository of humanitarian reports, in efforts to support real-time decision-making during disasters.
***
A screenshot of the New Commons Challenge website
Are you working to make open data ready for generative AI? Are you interested in setting up a data commons to enable AI ready data? Or do you know someone who is?
Aligned with the Observatory, the Open Data Policy Lab is excited to invite global changemakers to propose data commons for generative AI that serve the public interest through the New Commons Challenge.
This challenge aims to enhance data diversity, quality, and provenance, unlocking AI’s potential to solve complex challenges. We are looking for proposals of data commons concepts that support localized decision-making or strengthen humanitarian interventions.
Winners will receive US $100,000 in funding as well as mentorship and technical support. If you're interested in working on impactful data commons that aim to improve decision-making or humanitarian efforts, we encourage you to apply by submitting a concept note by June 2, 2025. For more information, visit the New Commons Challenge website.
Do you know of any real-world examples of generative AI and open data that should be included in the Observatory? Submit an example by visiting our Observatory.
Have any suggestions for improvements or are interested in collaborating? Please contact us at [email protected].
***
*The Observatory aims to capture ongoing efforts to integrate generative AI with official open data. It does not assess the effectiveness or practices of these applications. Many of the examples presented raise important ethical considerations that warrant further examination and discussion.