New Publication
The Current State of Open Data and Data Stewardship: Global perspectives from the Data Stewards Alumni
Forecasting the openness of data in 2024, assessing the impact of generative AI on open data, and identifying trends in data stewardship
Posted on 17th of June 2024 by Adrienne Schmoeker
Since the advent of the Open Data Policy Lab in 2020, The GovLab has engaged dozens of guest faculty and students in a variety of courses on data stewardship via the Data Stewards Academy and its programs. Now that this community has grown to 100+ we felt it would be insightful to check the ‘pulse’ of this network to assess certain trends and the state of open data on a regular basis. This summary is the first recap of our inaugural Data Stewards Pulse survey issued in March 2024.
More than 20% of the community responded to five questions focused on trend forecasting for data openness in 2024, the impact of generative AI on the open data ecosystem and trends in data stewardship as a role and profession. Pulse contributors were anonymous if they did not want public attribution for their responses in summaries such as this one. We hope you enjoy this read and look forward to your thoughts and ideas for future iterations of the Data Stewards Pulse.
Summary of insights
The three subjects of this Data Stewards Pulse survey are all interrelated and bring a mixed bag to future prospects for open data and data stewardship. The arrival of generative AI as a more mainstream consumer product in November 2022 helped increase lay literacy around AI - as this understanding continues to grow, so too does an understanding of the importance of data and the need for roles like data stewardship to properly govern and enable opportunities for data. The conversation around open data in the “Age of AI” is also gradually gaining traction - our team published A Fourth Wave of Open Data? Exploring the Spectrum of Scenarios for Open Data and Generative AI in May of 2024 to share what we’ve learned, researched and observed as opportunities for open data in this Age of AI.
However, as awareness around AI increases, so too does a realization of the need for governance structures and accompanying principles such as commitments to transparency, a code of ethics, and collaboration. While this last point may present an opportunity for established solutions such as open data and exploring mechanisms for data sharing and data re-use, there is also a risk that “closed” data and/or regulation are seen as more prudent, and expedient, solutions to those wary of AI products and looking to protect the best interest of AI users and data subjects.
“It seems more attention than ever is on data as AI applications become commonplace across sectors and everyday life. However, with this rapid acceleration, attitudes towards open data feels like a mixed bag - in some cases industry is pushing for it, in others there is a sense that openness would create vulnerabilities for users or outside threats.” - Anonymous
“Governments are convinced open data is a good investment, but publish[ing] "quality" data is still a challenge for them.” — Jorge Alvarado, TrustRelay
Forecasting the openness of data for 2024
We asked Pulse contributors to account for trends they are seeing within their country or sector and to tell us how they foresee data openness evolving in 2024:
The average response was 3.4 out of 5, which we interpret as a forecast of data openness stagnancy, albeit advancing at a very slow rate. This is in line with what Dr Stefaan Verhulst of The GovLab has predicted in his January 2024 post about an impending Data Winter.
“Generative AI represents a double-edged sword in the context of data accessibility. On the one hand, it offers immense potential for innovation and democratizing access to data and knowledge. On the other hand, the fear of misuse and the lack of clear regulatory frameworks around AI data reuse are contributing to a more restrictive data environment. This paradox highlights the need for a balanced approach that safeguards against misuse while not stifling data flows essential for the public interest and social good.” — Dr Stefaan Verhulst
Pulse contributors accompanied their averaged 3.4 forecasted score with insights, from which we’ve gleaned that while commercial data ecosystems are becoming more closed, there is a trend towards data openness across the public sector. We can hypothesize based on other insights from Pulse contributors that the latter trend is in part due to the evolution of the data literacy and regulatory landscape - specifically as regards data privacy literacy and most notably, legislation in Europe. Europe was mentioned frequently as a first-mover on policy for data and AI, and the recent Data Act was cited as a catalyst in support of data re-use efforts. The impact of federal level open data policies were also mentioned as having a trickle-down-adoption effect to subnational governments, helping further advance open data in the public sector.
“The data privacy conversation is gradually becoming more nuanced. The need for regulation is very clear; the discussion about how open data can benefit public policy is further behind, but gaining some traction.” - Bapu Vaitla, Data2X Data Fellow
Beyond these trends in the private and public sectors, various enablers for open data were cited by Pulse contributors. Climate change and a thirst for data on the subject to facilitate collaboration and problem solving was mentioned by a few as a driver for data openness, while others mentioned user demand, availability of funding opportunities and availability of staff capacity as key drivers. Public scandals (such as data breaches) were shared as an impactful mechanism for forcing a conversation around data, and adding open data indicators to existing institutional reporting as another mechanism to spur open data adoption.
Lastly, the “AI effect” was mentioned as having an impact on the future state of data openness. Some mentioned that there is an appreciation that AI runs on data, but little more is known beyond that. Others framed this as an opportunity to introduce open data as the ‘responsible and ethical’ approach to some of the concerns being raised regarding how the digital technology industry currently operates their data and AI products. Finally, a comment was made predicting that once highly regulated industries adopt AI (such as banking, and law), this will drive policy for AI and subsequently open doors for data and open data alike.
Assessing the impact of generative AI on open data
It is important to note that as regards our series of questions about the impact of AI on the open data ecosystem and utilization of open data by generative AI, many Pulse contributors noted that it was too early to tell and that they did not know of any examples. Many cited the ongoing novelty of generative AI and associated hype - one respondent suggested that generative AI practices “fueled ‘anti-tech’ narratives by politicians warning of the issues with technology.”
Of those who did respond, statements echoed those mentioned above of the predicted “AI effect” on data openness in addition to suggesting that there will be a “policy race” to be the first to regulate generative AI in particular.
Other respondents shared that within the public sector they are seeing AI Task Forces and government chatbots as drivers of open data, and a positive effect of AI in the public sector in workforce efficiencies (when AI products are able to be used by government employees).
When asked for examples of how respondents are seeing generative AI use open data, a few examples were shared (see the end of this post for a summary of resources and examples) and others broadly stated that it depended on the type of open data. A Pulse contributor mentioned not seeing much uptake of geospatial data by generative AI; others mentioned that they saw opportunities for data with a focus on feature detection (ie - number and types of trees, number and types of buildings, etc.). Other observed use cases are focused on natural language model querying and summarization of open data or applying generative AI to support the development of data governance resources (translating or making culturally-relevant data sharing agreements for example).
“As the need for high quality, unbiased, and representative data increases in order to meaningfully address underserved populations and have fair and inclusive AI models, there will be a forced shift towards more open data with the correct guardrails in place.” - Anonymous
“I think open data needs to become far more publicized as a way of working that is safe and responsible. There could be a way to frame this as an ethical way of working or also in terms of being a more environmentally responsible tech practice. Thinking specifically about how to connect this with bigger trends that are sweeping the tech industry at the moment. If open data practices could become as well known as Gen AI, you'd see groundswell uptake.” - Lisa Talia Moretti, AND Digital
Identifying trends in data stewardship
When we asked Pulse contributors whether they felt the more modern definition of data stewardship was being more recognized the responses were across the spectrum, with most responding along the lines of “yes, however…”. When digging further into responses some notable trends emerged, mirroring those regarding trends in data openness: increased adoption in the public sector, increased adoption in Europe thanks to new regulation such as the Data Act and some increased awareness around the need for data stewardship thanks to generative AI.
“I perceive greater awareness of the importance of proper management of data beings and the value they entail. There is greater care in the application of legal restrictions associated with the exchange of information, as well as the ethical management of the information that is stored and processed for the generation of new knowledge.” - Sergio Carrera Riva Palacio, Digital Transformation Advisor
Not mentioned in other parts of the Data Stewards Pulse results, some comments emerged around the topic of data stewardship pertaining to an increased level of dialogue around data “ownership” particularly in areas of indigenous and government interaction and a need to reconcile the definition of data sovereignty at a global level.
A continued trend that was mentioned by a few centered around the terminology of “data stewardship,” one contributor noted that there are too many competing terms in the field of data management, while others pointed to the observation that many already perform data stewardship roles despite not using the title of a “data steward.” Challenges around talent retention in government were cited, in addition to continued gaps in pay equity in technology related roles more broadly. Finally - challenges in data interoperability, standardization and data sharing were mentioned as hindering the adoption of formalized data stewardship roles.
Resources and examples from Pulse contributors:
Advancements in data openness and resources for data openness tracking:
UK open data example: Blog Post: Using the power of linked data to understand factors preventing people from working
UK open data example: Postcode Address File
Slovenia has established a program for the acceleration of opening data in the public sector
The Swiss government has included data spaces as part of its strategy in domestic and international affairs
Notable advancements in Iran: 1. Data of large bank debtors on a quarterly and regular basis 2. Financial and accounting performance data of all public companies in a standard format with high granularity 3. Data on the status and number of government employees in various institutions 4. Parliament election data focusing on the state of citizens' participation in elections
Canada is making good progress to make equity, diversity, and inclusion data more available. For example, a new report here shares much information including a link to "a data visualization tool".
Inpher: They have developed a novel way to leverage cryptography to develop privacy-preserving systems so that companies can work on highly sensitive data without needing to reveal it, even to the teams working on a system.
Palestinian Central Bureau of Statistics (PCBS) Data Portal; data on this platform has become more open: microdata in PCBS was licensed data files and became public use data files, the user can register one time and download what he/she wants.
Open Data Inventory by Open Data Watch is a helpful resource for seeing trends and “openness” for various governments.
Open Earth Platform Initiative - International cooperation on shared climate action priorities enabled through a shared open policy and data standards.
UN Biodiversity Lab - a public/ private partnership to share open data around biodiversity.
Digital Public Infrastructure for Electoral Processes - building blocks for open data standards, portals and apps
Reductions in data openness:
Fears regarding data sovereignty (ie - TikTok)
Changes to access to social media data, such as Twitter/X or Reddit.
In Slovenia there are now restrictions in the transit sector, concrete data about car registration, where VIN number is now defined as personal data.
In Europe, the DESI index is thinking about no longer including open data in the measurement
Fine-grained COVID-19 case data has become harder to access.
In Madagascar: Stagnation in the openness of public finance data, due to the national election
Examples of open data use or trends towards openness by generative AI:
Paper outlining aligned work by multiple organizations trying to address the openness and ethical questions around AI systems from OSI, DPGA, Mozilla, Open Future and many others. TrustRelay is working with Swiss data and other data sources on the development of data sharing agreements across multiple languages and cultures.
Digital Green's use of farmer and extension-worker sourced data to create a chatbot to answer contextualized questions
An application called Pravko which helps end users understand legal requirements for certain situations, providing citations and built on open data
Data.org GenAI Challenge with Microsoft has 5 use cases
OTTAA (Chile) optimizes AI to ease communication for people with disabilities.
Afinidata (Guatemala) uses artificial intelligence to provide parents with a personal assistant that guides them with personalized and effortless early-childhood activities.
Avyantra (India) addresses the early diagnosis of neonatal sepsis by applying a machine learning model that captures data of newborns with neonatal sepsis conditions and their associated treatments.
SElf-SupERvised (SEER), developed by Meta, is a self-supervised computer vision model that can learn directly from random collections of images on the internet — without the careful data curation and labeling needed for conventional computer vision training — and then output an image embedding. SEER was developed by Meta’s AI Research team in 2023 with a focus on its ability to function in underrepresented locales worldwide. As a result, SEER delivers improved performance on fairness benchmarks across genders, apparent skin tones, and age groups, and understands images from around the world well enough to geo-localize them with unprecedented precision.
Southeast Asia’s sovereign LLM
Advancement of data stewardship:
Slovenia has established a Data Stewards Network
Curious to learn more and keep up with the Data Stewards Pulse?
Sign up for the Data Stewards Network newsletter, a weekly newsletter where The GovLab publishes insights, opportunities and news related to data stewardship and data re-use.