New Publication

“Data Commons”: Under Threat by or The Solution for a Generative AI Era ? Rethinking Data Access and Re-use

Posted on 10th of May 2024 by Stefaan Verhulst, Hannah Chafetz, Andrew Zahuranec

This blog was originally published on the Data & Policy blog on May 9th, 2024. The original piece can be found here.

One of the great paradoxes of our datafied era is that we live amid both unprecedented abundance and scarcity. Even as data grows more central to our ability to promote the public good, so too does it remain deeply — and perhaps increasingly — inaccessible and privately controlled. In response, there have been growing calls for “data commons” — pools of data that would be (self-)managed by distinctive communities or entities operating in the public’s interest. These pools could then be made accessible and reused for the common good.

Data commons are typically the results of collaborative and participatory approaches to data governance [1]. They offer an alternative to the growing tendency toward privatized data silos or extractive re-use of open data sets, instead emphasizing the communal and shared value of data — for example, by making data resources accessible in an ethical and sustainable way for purposes in alignment with community values or interests such as scientific research, social good initiatives, environmental monitoring, public health, and other domains.

Data commons can today be considered (the missing) critical infrastructure for leveraging data to advance societal wellbeing. When designed responsibly, they offer potential solutions for a variety of wicked problems, from climate change to pandemics and economic and social inequities. However, the rapid ascent of generative artificial intelligence (AI) technologies is changing the rules of the game, leading both to new opportunities as well as significant challenges for these communal data repositories.

On the one hand, generative AI has the potential to unlock new insights from data for a broader audience (through conversational interfaces such as chats), fostering innovation, and streamlining decision-making to serve the public interest. Generative AI also stands out in the realm of data governance due to its ability to reuse data at a massive scale, which has been a persistent challenge in many open data initiatives. On the other hand, generative AI raises uncomfortable questions related to equitable access, sustainability, and the ethical re-use of shared data resources. Further, without the right guardrails, funding models and enabling governance frameworks, data commons risk becoming data graveyards — vast repositories of unused, and largely unusable, data.

In what follows, we lay out some of the challenges and opportunities posed by generative AI for data commons. We then turn to a ten-part framework to set the stage for a broader exploration on how to reimagine and reinvigorate data commons for the generative AI era. This framework establishes a landscape for further investigation; our goal is not so much to define what an updated data commons would look like but to lay out pathways that would lead to a more meaningful assessment of the design requirements for resilient data commons in the age of generative AI.

Data Commons and Generative AI: Opportunities and Challenges

Scholars point to The Charter of the Forest, issued in 1217 as a companion document to the Magna Carta, that codified the concept of the commons. While the Magna Carta focused on the rights of barons and nobles, the Charter of the Forest addressed the rights of “commoners” to use common land for grazing livestock, collecting firewood, and harvesting other forest resources for their subsistence — laying the groundwork for the development of modern environmental and property laws.

Since then, the concept has been used for identifying and protecting common resources, such as cultural heritage et al. As such, the concept of a data commons is also not new. Arguably, it predates the Internet, extending to the start of scientists and researchers pooling data and evidence while publishing their findings to advance public knowledge, often according to the FAIR principles of open science. Earlier research on the commons tended to focus on non-digital (i.e., non-datafied) incarnations.

In 1968, in a groundbreaking article in Science titled “The Tragedy of the Commons,” Garrett Hardin drew attention to the problems that arise from public access to, and overuse of, a finite resource (e.g., a pasture, or the environment). More recently, the “tragedy” has also been applied to intangible or seemingly infinite resources -such as digital artifacts — where scarcity may reside more in the trust, resources and capacities needed to create, maintain or provide access to them.

In 2009, Elinor Ostrom shared the Nobel Prize in Economics for her work on governance of the commons, showing how finite resources can be effectively managed or conserved by innovative frameworks that offer an alternative to traditional regulation or market-based processes. She did this by expanding the traditional economic framework of goods classification, which typically focused on the dichotomy between private and public goods. She introduced two additional categories: club goods and common-pool resources (CPRs), each with distinct characteristics and implications for management and policy. Her research demonstrated that communities could sustainably manage CPRs through collective action and self-governance, effectively challenging the assumption that only top-down approaches like government or private ownership could manage resources effectively. She identified several design principles crucial for this success, including: clearly defined boundaries, congruence with local conditions, collective decision-making, effective monitoring, graduated sanctions, accessible conflict resolution, recognized rights to organize, and structured nested enterprises. More recently, these principles have been used by several organizations like the Mozilla Foundation and the Ada Lovelace Institute to define how data resources should be managed (as CPRs).

However, definitions of the commons — and how they should be governed — are not static. Time and technological change have forced us to rethink our notions. In recent years we have seen increasing references to the need for “data commons” that could help preserve the public interest potential of large common data resources. Data commons itself has gone through several iterations, from the early days of the Internet (and Wikipedia) to the birth of the open data movement and growing concerns over privacy and extraction. The latest technology to disrupt pre-existing notions of the data commons is AI, especially generative AI.

As always, disruption offers both challenges and opportunities. The task confronting policymakers, civil society, technologists, and other stakeholders today is to enable the sustainability of existing and the creation of new data commons in a way that maximizes the societal value of the data resources (including by leveraging it for generative AI in a responsible way) while minimizing the risks (and the tragedy of the data commons).

The challenges posed by AI to data commons are multifaceted and intertwined. They encompass issues related to governance, data quality, accessibility, and potential misuse. Two in particular stand out:

Traditional models of data commons and open data are being stretched and tested by generative AI’s propensity to rapidly parse, learn from, and exploit shared data. This leads to renewed concerns over a “tragedy of the commons,” in which data resources may become extracted or enclosed by just a few.
In addition, the sheer volume of data — often unstructured and of questionable quality or relevance — also poses challenges. Without adequate safeguards, there is a very real risk that the data commons could be transformed into a “data graveyard” — vast repositories of unused and unusable data.

The flip-side of risk is often opportunity. The challenges offer a chance to reimagine and restructure the data commons, in ways that could harness the power of generative AI and lead to new possibilities for the public good. Two opportunities are important to highlight:

The disruptions being forced by generative AI offer a chance for reinvention. In particular, the moment is ripe for innovative governance frameworks, new sustainable business models, and the deployment of advanced technologies to ensure better data security and privacy.
In particular, the aggregation of diverse data sources into training datasets prompts a reimagining of the “data commons,” where disparate data sets are collectively viewed as a unified, extensive resource due to their integration in AI training.
The time is also right for a new approach that could enhance the inclusivity and equity of data and the data ecology, preventing data asymmetries to become larger and more problematic.
The need for reinvention could result in new frameworks in which data no longer simply serves the interests of a few but instead contributes to the greater public good, empowering researchers, policymakers, and marginalized communities.

In short, we find ourselves today at a crossroads, one which obliges us to confront and update existing notions of what the data commons is and how it should be governed. For all these challenges, there is also a chance to reinvent our conception of the data commons, in the process assuaging some of our most contentious battlegrounds over data rights and usage, and helping ensure that shared digital resources serve as a wellspring of innovation and broad, equitably distributed social benefit.

In recent months, these challenges and opportunities have garnered a lot more attention from a variety of stakeholders eager to sustain responsible access to data for the greater good. Of particular significance is the recent white paper by Open Future, which outlines a commons-based approach for managing AI training data sets as public goods. It presents six principles designed to ensure that these data sets generate public value and resist commercial exploitation. They aim to balance openness with necessary restrictions, maintain transparency and documentation, respect the rights of data subjects and creators, protect the data commons, ensure data set quality, and involve trusted institutions and community engagement in governance.

However, while the contributions of Open Future and other initiatives like the AI and Data Commons project by the ITU have laid a robust groundwork, the journey towards fully addressing the challenges and opportunities of data commons is far from complete. Further exploration needs to involve a more structured approach that can diligently identify and address the nuanced complexities and unresolved issues that continue to pose challenges.

In this spirit, we propose a framework that scopes the challenges and pathways that need further exploration. The framework sets the stage for a series of studios that can delve deeper into what threats to data commons need to be resolved (and how) and how to position, structure, and govern data commons so they become part of the solution to minimize data asymmetries and increase access to data to support and enhance public interest decisions (through generative AI).

A 10-Part Framework to Explore Next Generation Issues of Data Commons

What are the unresolved issues that require thoughtful consideration and innovative solutions? What aspects can enhance the success and resilience of data commons, ensuring they are sustainable, effective, and aligned with the evolving needs of an AI driven world. Below are the issues that need to resolve to achieve the vision of data commons that are accessible, equitable, and sustainable, and that prevents data misuse and extraction while fostering a culture of productive data re-use toward the public good.

Screenshot 2024 05 10 at 9.19.01 am

1. Update Governance Frameworks

One key area to consider is how to adapt existing governance frameworks around data commons to better fit today’s the digital realities. This raises the question: How can we update Ostrom’s governance principles, originally designed for managing physical commons, to effectively govern digital data? When applying Ostrom’s work to digital data, it might be important to broaden the concept of communal governance to encompass various forms of collective and democratic governance, as digital communities are often more difficult to define. Part of that exploration will involve defining the roles of data stewards in managing (or “stewarding”) a 21st century data commons. It is essential to explore and possibly rework their responsibilities and authority, as well as the oversight mechanisms necessary for data stewards to be able to effectively safeguard communal data resources.

Additionally, we need new frameworks for data reuse and sharing. Current methods are often inadequate and antiquated, failing to reflect and respect the preferences, expectations, and norms of different communities. A new governance framework that explores the potential (as well as challenges) of social licenses will be essential. It is also vital that this framework move toward enhanced digital self-determination. Prioritizing social licenses can foster trust around the reuse of private data for public good and ensure data-driven organizations consider the needs of communities as well as those of individuals.

2. Equitable Access Mechanisms

The challenge of our era is not simply one of access to data but of more equitable access to data and the advanced algorithms and methods that use it. In our current data ecosystem, powerful datasets and tools are often held in silos. This is a problem as data access is fundamentally linked to economic opportunity, improved governance, better science, and citizen empowerment. To meaningfully act on data’s potential, we must ensure that the supply of data can be matched with the demand.

Creating mechanisms for more equitable access is a complex process that involves both technology and society (as well as politics and culture). We must keep in mind that questions surrounding equitable access to data occur within a broader socio-economic context and broader concerns about equity and inclusion. Therefore, even as we consider specific technical or process-based interventions (e.g. tiered access systems that differentiate among different uses, including commercial vs non-commercial, and user groups) there is also a need to build coalitions of stakeholders and different interest groups beyond these specific interventions.

Community engagement and other participatory methods are imperative. Policymakers and other stakeholders can consider the potential of participatory decision-making, for example through community workshops, outreach programs, citizen assemblies, and other mechanisms and bodies.

3. Enhance Data Reusability and Quality

Data reusability is critical for any data commons and to addressing larger data asymmetries. The extent to which data is reusable is linked to its quality and accuracy as well as to the existence of standards that promote interoperability and shareability. Technical approaches to these issues must be embedded within a broader culture and a set of norms that encourage responsible sharing and reuse among all stakeholders.

A variety of quality assurance mechanisms exist, involving both technical and human approaches from peer reviews to automated quality checks. It is essential that such steps be taken throughout the data lifecycle, helping to ensure transparency and trust among all participants in the data ecosystem.

4. Prevent a Tragedy of the Data Commons

Long-standing concerns over a tragedy of the data commons, marked by data exploitation and extraction, offer a cautionary parable for efforts to promote data commons in the AI era. The central challenge is going beyond existing notions of data as public and private goods (and explore Ostrom’s aforementioned classification of club and common pool goods). It is about ensuring that public access to data does not result in private (over-)exploitation. Perhaps our notions of responsible sharing need to be stretched so as to include not just protections for privacy and security but also an awareness of data as a finite (and precious) resource that needs to be nurtured and cared for even as it is harnessed and utilized.

Although data asymmetries pose unique challenges, these concerns are not new. A large quantity of theoretical and evidence-based frameworks around information asymmetries do exist, in the field of economics among others, and we should consider how this existing body of literature and research can be adapted to the current landscape.

5. Encourage Active Participation and Contribution

As noted above, participation and inclusion are key to healthy commons. Many of the steps outlined in this document are aimed at these goals. It is important to emphasize that merely creating avenues for participation does not necessarily lead to active participation and contributions. We therefore need to think beyond formal just mechanisms and toward a fostering culture that proactively fosters inclusion and participation. We also need to take stock of recent innovations in engagement and learn their relevance to inclusive governance approaches of data commons.

Two key steps in this regard are raising awareness and fostering a sense of community among potential participants. Outreach, education, capacity building, as well as providing networking opportunities for community members to connect and collaborate can be considered. Overall, cultivating a vibrant and engaged community is key to sustaining a vital data commons ecosystem.

6. Navigate the Particular Impacts of Generative AI

Like all technological innovations, generative AI creates its own landscape and ecology, marked by specific (and often new) risks and opportunities. To navigate this landscape, we need to implement AI-specific governance measures–tailored governance frameworks that borrow and learn from existing models, but that also adapt them to evolving frontiers. Questions remain about the extent to which AI will require altogether new ethical and policy frameworks (e.g., for data reuse) or whether existing data governance frameworks can be built upon. In the context of AI, data governance is also intricately linked with AI model governance, highlighting the interconnected management of data and the AI models that utilize it. Similarly, we need to explore the potential of (and need for) new institutional structures that are adapted for the age of AI.

7. Rethink Data Commons Architecture

Building on the preceding point about new institutions and policy frameworks, it may be helpful to rethink structures or architectures for the data age. In particular, decentralized and federated models have shown promise for data sharing, and can play a valuable role in reducing dependency on central authorities. Distributed (or federated) control can also help increase equitable access to the data commons (and again builds upon Ostrom’s work of polycentrism).

When considering federated or decentralized governance models, we should be thinking about both policy (or institutions) as well as the role of technology itself. Robust, scalable data commons infrastructure can support dynamic and complex data ecosystems. However, models are necessary to encourage investment.

8. Promote Sustainable Business Models

The history of efforts promoting public access to resources is littered with examples of well-intentioned but ultimately unsustainable initiatives. Any effort to promote an AI data commons must therefore place sustainability at the heart of its architecture, and consider what kinds of funding models or sources — such as membership fees, license and re-use fees, public-private partnerships, crowdfunding, endowments–might best enable the broader goals of inclusion, participation, and responsible reuse and sharing.

In addition to considering funding models, it is also helpful to establish mechanisms and processes that clearly demonstrate value. Articulating — and, more important, showing — the case for high-quality, reusable data is one of the most effective ways to attract investment and financial support.

9. Foster a Culture of Data Sharing and Use

Promoting a data commons requires a complex interplay of technical and non-technical interventions. Alongside the various technical interventions we can imagine, a broader culture of sharing and reuse within organizations and also in society at large can transform how organizations operate. Education and advocacy are key; all stakeholders — including but not limited to policymakers–should be aware of the benefits of data sharing and the importance of a data commons (as well as the need to avoid a tragedy of the commons).

Various mechanisms can be considered. A repository of case studies and successful examples can help showcase success stories and inform stakeholders. Case studies can also help highlight lessons learned (both positive and negative) and demonstrate value to investors and funders. Other steps that can help foster a broader culture of sharing include community outreach, engaging civil society, and creating partnerships among the private and public sectors (where the vast majority of data resides) and the broader community of researchers and citizen groups.

10. Become Data Driven about the Data Commons

Finally, we need to harness data itself in the quest to establish a data commons. Better use of data can make us more accountable and responsive, and can introduce feedback loops and adaptation mechanisms that can help make data commons initiatives more inclusive and better suited to tackle wicked problems. Data can help us iterate and adopt agile approaches to creating new and responsive instances of the data commons.

Case studies, examples and models, and other forms of research generate their own forms of (perhaps more qualitative) data. These types of repositories are essential in order to understand best practices for data commons management, governance, and technology. What we need is a broader culture of research and enquiry into data use, reuse and sharing.

***

While this framework offers some possibilities, there are still many questions left to answer, questions that might drive an agenda for exploring data commons in the era of generative AI. The following questions are drawn from the above discussion, and are deliberately open-ended. Our goal is not to prescribe or pre-judge solutions, but to provide a framework for further investigation.

What new governance mechanisms and institutions are necessary to promote a data commons in the era of generative AI?
To what extent can existing governance frameworks be updated–or do we need to begin with a clean slate?
How best to prevent a tragedy of the data commons, one that would be marked by private over-exploitation of public data resources? What lessons (if any) does the existing literature and research on preventing a tragedy of the commons offer us?
How can the specific practices involved in AI development, such as fine-tuning, which are well-understood by developers but not widely discussed, be better understood and incorporated into the design of data commons for AI, considering the current lack of clarity surrounding many aspects of AI training and model development?
How can we promote equity and inclusion in efforts to create a data commons, and how can data be part of the solution (rather than the problem) to broader patterns of inequity and exclusion?
What models are best suited to promote sustainability of data commons initiatives?
How can we best demonstrate the value of data sharing and reuse–for funders, policymakers, and (in order to build trust) to citizens in general? In particular, what role does data itself play in achieving this goal?
What examples and case studies of data commons (either successes or failures) now exist, and how best can we showcase these to build on existing experiences and lessons learned?
How best to build awareness about the potential (and challenges) of data commons approaches, and what different forms of outreach of education might be necessary for different stakeholders (policymakers, private sector, researchers, citizens, etc.)

These represent a small sampling of the questions that could be asked–and need to be asked. In the coming weeks, The GovLab’s Open Data Policy Lab will be hosting a studio series to dive deep into these questions and identify key issues and pathways to accelerating the data commons movement in the generative AI era.

We invite you to submit additional questions for consideration, or your answers to these questions by emailing us at [email protected].

Footnotes

Data commons in this framework should not to be mistaken for the graph system developed by Google under the banner of Data Commons Dot Org, which utilizes natural language processing to facilitate conversational data access and — but heavily relies heavily on the availability or establishment of other data commons. See: https://datacommons.org/.

About the authors

Stefaan G. Verhulst is the co-founder of The Governance Lab and The DataTank, and editor-in-chief of Data & Policy

Hannah Chafetz and Andrew Zahuranec are research fellows at The Governance Lab

The authors would like to thanks Gretchen Deo and Alek Tarkowski for their comments and suggestions to earlier versions.

***

This is the blog for Data & Policy (cambridge.org/dap), a peer-reviewed open access journal published by Cambridge University Press in association with the Data for Policy Community. Interest Company. Read on for ways to contribute to Data & Policy.

Screenshot 2024 05 10 at 9.20.32 am