Event
Fourth Wave of Open Data Seminar: Making Open Data Conversational
Posted on 7th of May 2025 by Andrew Zahuranec
This blog is part of a continuing series on the Fourth Wave of Open Data. To read our first blog in the series, click the link here.
The Fourth Wave of Open Data, based around the combination of open data and generative AI, offers significant potential. When put together, open datasets can be made more open and conversational. Systems themselves can be better trained to answer questions.
These issues were the focus of The Open Data Policy Lab’s second panel on the Fourth Wave of Open Data, “Making Open Data Conversational”. Joined by a small group of AI practitioners—Anastasia Stasenko (CEO, pleias), Kari D’Elia (Head of Product, USAFacts), and Amra Dorjbayar (Co-Founder & CEO, Wobby.AI)—The GovLab’s Stefaan Verhulst discussed ways that AI might be made accessible and democratize data access.
Using Federal Data
The discussion began with a question to Kari D’Elia of USAFacts. Stefaan asked Kari to explain what her organization did and how the evolving role of generative AI has changed it.
Kari explained that USAFacts emerged when former Microsoft CEO Steve Balmer began focusing his philanthropy on ways to lift children out of poverty in the US. Being data-driven, Palmer wanted numbers to know where he could make the most difference, but found that getting that data was really hard.
“The data was there, but it was hard to read. It was hard to access,” Kari said.
USAFacts, then, emerged to very literally take the open, accessible data and make it available for the general public. This work began six years ago, with a simple website, database, and a search bar that could allow people to search for data. Over time, it incorporated AI tools that were, at first, not conversational but instead focused on enhancing analysis.
“We put all the data tables behind [on public education funding], we put a generative AI application on top of it that's trained in our tone of voice and how we do research, and it can actually create an analysis for all 19,000 school districts in the country,” Kari said.
Based on that first success, USAFacts has spent the last few months using that for all its analysis. It is not introducing a conversational layer that’s not just a chatbot but a tool that can allow readers to ask questions and change the analysis. This tool, they hope, will be launched in coming months. Much of the team is currently cleaning the data to get it in a format that AI systems can reliably use and deterministically work with.
Readiness and Interoperability
Stefaan then turned the discussion to Amra Dorjbayar, asking him to expand upon how Wobby started. Noting Amra’s data journalism background, he asked for an explanation of “what was the idea and where are you today?”
Amra responded by noting that he had learned a lot over the last two years about how AI works and the state of open data.
“For us, the ambition was initially very big. Our idea was, ‘Let's connect all the data in the world and make the data AI-ready,’” said Amra. “At the end of the last year, we kind of hit a wall. We realized that we were trying to solve too many problems at the same time.”
“So one problem to make open data conversational is to actually have access to all the data.”
As a small start-up in Belgium, Wobby.AI had to learn how to work at their scale. The organization offered its tools to users and journalists and realized there was a big difference in the types of questions that non-experts actually asked. It was often very general, which required Wobby to think critically about the kind of data it collected and processed.
“It needs a lot of context. It needs to understand what each column means, but also what each definition and category means. It needs to understand what the limitations of the data are, so you need to feed that as well. But more than that, we realized that even if you have good quality data and you have amazing metadata in place, it still does not reach 100% accuracy reliably.”
In response, Wobby pivoted to trying to make data analysis reliable, toward instituting “guardrails” that control what kinds of questions you can ask so that you can trust in the accuracy of the resulting analysis. This was a way of avoiding so-called “AI hallucinations” and other problematic outputs, which is an urgent need in the field.
Model Design
Stefaan then moved to Anastasia Stasenko to discuss Pleias and its efforts to democratize AI and ongoing efforts to develop it as a tool.
Anastasia noted that pleias is built exclusively on open data, open in a strong sense of the word. It trains its models on open data with permissive licenses. It has relied on what The Open Data Policy Lab has described as the “first and second waves of open data”, that being governmental open data.
Much of this work has fed into the Common Corpus, the biggest pre-training corpus for training large language models available. It consists of over two trillion tokens, is multilingual, has full permissions, licenses, and contains a large among of governmental open data from the French Open Data Program, US federal government, and others.
She noted how the difference in inputs distinguished Pleias’s models from other approaches, largely built on social media data.
“We actually taught them to kind of speak legalese and to speak the language of governmental data. We have been working with the French public services and others. They want factual AI.”
She also noted some difficulties present in the space due to a lack of similar openness from other AI developers and a refusal to expand access. These limiting the ability to design new, accurate AI systems.
“A lack of transparency is slowing us down,” she said. “Most data is actually locked in horrible PDFs. It's unstructured open data. For example, we have collected over 100 million, uh, governmental PDFs not only with laws, but, you know, financial reports, lots of interesting and very important stuff, both upstream and downstream. Others process it. They get value out of very factual and good data, but then they don't contribute back.”
She closed by noting the need to preserve access to data, even as interest and funding wanes.
“The very important part of this work is ensuring that open can stay open.”
***
These are just a few of the reflections offered in our first seminar on the Fourth Wave of Open Data. To follow the full discussion, watch the video.
We plan to host our third webinar on “Future-Proofing Open Data” in May. Stay tuned for the announcement of when it is scheduled.