Blog Post
10 Data Commons for Cultural Knowledge and Preservation
Posted on 24th of November 2025 by Hannah Chafetz, Andrew Zahuranec, Stefaan Verhulst
Artificial Intelligence is opening unprecedented possibilities for how cultural heritage can be preserved, revitalized, and expressed—from revealing forgotten histories to enabling multilingual access and keeping endangered cultural practices alive. Yet realizing this potential requires access to rich and representative cultural data. This creates a unique tension: cultural data must be available enough to enable inclusive and culturally aware AI, while also being safeguarded against extractive use, misuse, and erasure.
Around the world, new cultural data commons are emerging to navigate this tension. These initiatives—spanning museum archives, public-domain literature, 3D scans of heritage sites, and contemporary artistic works—demonstrate how cultural data can be made accessible in ways that are collaborative, respectful, and aligned with community expectations. They show that access and protection are not opposing goals, but co-requirements for culturally responsible AI. In this blog we highlight 10 compelling cultural data commons. These initiatives are exemplary, not exhaustive, and are listed in alphabetical order. We then outline several commonalities across these examples and reflections on how they are designed and put into practice. We conclude with pathways for future research and initiatives.
The blog is part of a larger initiative to examine and illustrate how data commons (collaboratively governed data ecosystems) can enable responsible AI development. To complement our blueprint for using data commons and recent innovation challenge, we collected over 70 data commons from around the world that support AI in different ways. Among them, we found several initiatives that are supplying AI-ready cultural data to help developers adapt AI models for different cultures and contexts. These data commons range from museum archives to contemporary artworks to historical literature.
Data Commons Examples
Initiative | Country | Funding model | Data Types | Data access model |
AI4Culture, a public platform that offers AI tools and open datasets for training AI. Components on the platform are interoperable with the common European data space for cultural heritage. Some of the open datasets include verified translations, transcriptions of scanned handwritten documents, European artwork classification data, and 950,000 hours of speech data. The AI4Culture platform accepts contributions of datasets for review. The platform's tools and data can be used for AI-generated translations of cultural heritage metadata, multilingual subtitle generation, and multilingual text recognition in scanned documents. | European Union | Funded by the European Union through its Digital Europe Programme | Image data, translations, and transcriptions, speech data | Available through its online platform, CCO license |
Common Corpus is one of the largest public-domain datasets for LLM training coordinated by Pleias (a technology company) in collaboration with HuggingFace, Occiglot, Eleuther, and Nomic AI. Its “Open Culture” dataset includes public domain books and newspapers in several languages from national libraries and archives along with other sources. | Global | “The corpus was stored and processed with the generous support of the AI Alliance, Jean Zay (Eviden, Idris), Nvidia Inception program, Nebius AI and Tracto AI. It was built up with the support and concerted efforts of the state start-up LANGU:IA (start-up d’Etat), supported by the French Ministry of Culture and DINUM, as part of the prefiguration of the service offering of the Alliance for Language technologies EDIC (ALT-EDIC).” | Books, newspapers, other texts | Open access, available on Hugging Face |
European Books Data Commons is a proposal for a new initiative that builds on work done by Open Future, Proteus Strategies, and Creative Commons and recent convenings by Open Future and the Europeana Foundation. This initiative seeks to bring together and digitize public domain books across Europe’s libraries. Among its goals, it aims to develop a repository of books that can be used for AI training. This initiative has not yet launched. | Europe | The data commons will operate as an independent initiative that sits within Europeana and the common European Data Space for Cultural Heritage | Digitized books from European libraries | The data will be housed on an online storage system. The team is looking to harness aspects of the Institutional Data Initiative’s infrastructure given its similar focus. |
MetaBelgica is a shared management infrastructure maintained by federal scientific institutes in Belgium. It is seeking to develop a “Linked Data platform of Belgian entities [...] in the cultural heritage sector.” The data will be accessible via a shared platform with persistent identifiers, with the aim of providing trustworthy, high-quality reference data. This initiative has not yet fully launched. | Belgium | MetaBelgica is financed by the INFRA-RED call for proposals from Belspo, the government body in Belgium responsible for research policy. | The platform will provide cultural data from institutions such as Belgium’s ARchives and Museum of Literature, the Study and Documentation Centre for War and Contemporary Society, the Belgian State Archives, the Dutch Heritage Network, FID Benelux, Belgium’s national library, Meta4Books, the Museum of Industry Ghent, the National Library of Germany, the National Library of Luxembourg and many more. | “Research data and documents will be published with a DOI on our Zenodo community (in addition to the deposit on the Belgian repositories Orfeo and SODHA)” “Code and technical specifications will be published via our GitHub organization” |
Open Heritage makes cultural heritage data available for non-commercial uses—from research to academic purposes. Open Heritage operates as a consortium and is led by the Cultural Heritage Engineering Initiative (CHEI) at Qualcomm Institute and several other institutions. The Open Heritage Alliance which consists of several institutions makes decisions ranging from how to disseminate 3D data to ethical implications. | Global | Open Heritage is supported by Western Digital, Google Arts and Culture, Kacyra Family Foundation, National Endowment for the Humanities, and Dropbox | 3D datasets of cultural heritage locations globally. Data types include: “LiDAR - Terrestrial, LiDAR - Aerial, Photogrammetry - Terrestrial, Photogrammetry - Aerial and Short Range Scans.” | Datasets are available for download through the Open Heritage website. The datasets are licensed using Creative Commons licensing, but the terms vary depending on specifications set by the data supplier. |
The common European data space for cultural heritage is the European Union’s leading data space to increase public participation in preserving Europe’s cultural heritage and promoting digital culture as a public good. It offers access to high-quality data, particularly 3D data, with a goal of promoting responsible, public reuse. | European Union | The data space is funded by a variety of EU projects, including DE-BIAS, EUreka3D, 5DCulture, and AI4Culture. It seeks to facilitate the implementation of the European data strategy. | The data space hosts 61,000,000 items sourced from 3,500 data providers. The focus is on items of cultural heritage, such as artworks and historical artifacts. | Hosted on the European Union’s Europeana.eu web portal for free access. |
The HathiTrust Research Center at Indiana University and the University of Illinois Urbana-Champaign provides access to its digital library of books for research purposes. Through its HathiTrust Digital Library it provides access to several data analytics tools. The initiative has been criticized in the past for copyright infringement. | United States | Has received support from the National Endowment for the Humanities (NEH) and the Institute for Museum and Library Services (IMLS) Revenue from HathiTrust member institutions through annual fees | Books | Available through its Digital Library Two tiers of access: members have full access to the library, non-members have limited access |
The Institutional Data Initiative (IDI), an initiative at Harvard Law School Library, published a dataset of almost one million public domain books for AI training. The digitization of these books began in 2006 as part of the Google Books Project. Of the 1,075,000 books that were scanned for this project, 983,000 books are public domain and are published as the Institutional Books 1.0 dataset. This dataset, supported by Microsoft and OpenAI, can be used for training or evaluating LLMs, especially for multi-language processing or tasks that may involve historical language. With the release of the dataset, the team published a report documenting their data collection and processing methods. This initiative has not yet fully launched. | United States | Funding donated from OpenAI and Microsoft | “One million public domain books, scanned at Harvard Library; [...] millions of pages from hard-to-find historical newspapers [from Boston Public Library]” | Public access through Hugging Face |
TRANSFER Data Trust is a data cooperative for cultural and media preservation. The cooperative includes contemporary art and allows artists to preserve their work without institutions. The cooperative is owned by its members. Members are provided the opportunity to contribute to cooperative decisions through consultations. | North America and Europe | Initial fiscal sponsorship from Gray Area Funded by the Knight Art + Tech Expansion Fund, Filecoin Foundation for the Decentralized Web (FFDW) and Filecoin Foundation in June 2024 Received additional funding from GSR Foundation in July 2025 Through Open Collective, accepts donations or pay a fee to become a founding member or sponsor | Contemporary artwork from members, digital artwork | Data is accessed through “the stack,” version 1.0 is available only to its network (currently proof of concept, private beta and next announcement was planned for September 2025 and to be announced) |
Wikimedia Commons GLAM is a data commons including data from cultural institutions from around the world. It includes a range of cultural data sources along with their metadata. To secure the data, the Wikimedia Commons team sets up content partnerships with cultural institutions. Similar to Wikipedia, volunteers play a critical role in the commons. They help with several tasks including outreach and maintenance. | Global | Wikimedia Commons GLAM is an initiative within WikiMedia Commons and part of GLAM-WIKI. Wikimedia Commons was initiated by the Wikimedia Foundation. Wikimedia Commons received an initial grant from the Alfred P. Sloan Foundation. | Cultural data from GLAM institutions including images, videos, sounds, and 3D models. | Wikimedia Commons GLAM makes data available using Media Wiki (Wikipedia’s software) and includes structured data following Wikidata’s approach. This addition of structured data was applied in-part to support attribution by secondary users. |
Common Qualities
The examples above demonstrate several patterns in how cultural data commons are designed and implemented. These include:
Purpose
Initiatives such as the common European data space, AI4Culture, MetaBelgica, and the TRANSFER Data Trust are increasing access to cultural heritage data for preservation and research. These initiatives underscore the importance of digitizing cultural artifacts and of sharing these digital assets with cultural heritage communities to minimize the duplication of efforts. Wikimedia Commons GLAM emphasizes the importance of supporting cultural heritage institutions, researchers, and those contributing to the platform. Similarly, Open Heritage seeks to support research and education and specifies that the data can only be used for non-commercial purposes.
Other initiatives—the Institutional Data Initiative, European Books Data Commons, and the Common Corpus—aim to broaden access to cultural knowledge through the digitization of books, newspapers, and educational texts. These efforts cover several topics and materials.
Data Types
These data commons are providing access to a range of cultural assets—from newspapers and artworks to audio recordings. Initiatives such as the common European data space and MetaBelgica are making cultural artifacts available, while The TRANSFER Data Trust focuses on contemporary artworks, enabling artists to preserve and share their work without relying on institutions.
Other efforts concentrate on cultural texts such as newspapers, books, and historical documents. For example, the HathiTrust Research Center offers access to an extensive corpus of books, ranging from historical volumes to works of literature in multiple languages.
Additionally, projects such as Open Heritage and Wikimedia Commons GLAM include 3D data. Open Heritage supplies 3D data from cultural heritage sites—including data collected from LiDAR sensors.
Funding Models
Setting up and maintaining a data commons for the cultural sector requires substantial investment. In Open Future’s recent publication, Outline for a European Books Data Commons, the team explains that “estimated annual operating costs [will be] between €500k and €750k” for their proposed books data commons emphasizing the importance of having a sustainable funding model over time.
The funding models behind these initiatives echo those identified in previous analyses. Projects such as AI4Culture, MetaBelgica, and the common European data space are supported by government funding. Others rely on grants from private companies and philanthropic organizations. For example, Harvard’s Institutional Data Initiative is funded by Microsoft and OpenAI; the Common Corpus has received support from several institutions, including the Nvidia Inception program; and the TRANSFER Data Trust is backed by the Knight Art + Tech Expansion Fund. The TRANSFER Data Trust also supplements their funding with public donations. In contrast, The HathiTrust Research Center collects membership fees.
Data Access Models
AI4Culture, The HathiTrust Research Center, and the TRANSFER Data Initiative have built their own platforms to facilitate access to cultural data with clear licenses. These projects illustrate the potential of data commons to not only provide secure access but also trusted environments where researchers, cultural heritage specialists, and others can process and analyze data responsibly.
Open Heritage’s data is available for download from their website. Much like other data commons, they use creative commons licensing but allow the data supplier to select the type of license for their respective dataset. Each dataset is also accompanied by a DOI to help streamline the citation process.
Other initiatives are applying different strategies to support attribution. The Wikimedia Commons GLAM, for instance, includes structured elements that can make it easier for the data to be attributed to the institution.
Some initiatives, such as the HathiTrust Research Center, use tiered access models: Members receive full access to the complete data corpus, while non-members have more limited access.
Other projects—including Common Corpus—leverage existing infrastructure, making their datasets available through commonly used platforms like GitHub and Hugging Face.
What emerges is not a uniform model but a shared orientation: data access structures that seek to balance openness with protection, and community agency with the practical needs of cultural preservation and technological innovation.
Governance and Operational Models
Across the examples, a distinctive governance landscape begins to take shape—one characterized far less by centralized authority than by distributed, collaborative arrangements.
Several cultural data commons are governed, managed and maintained in collaboration with partner institutions. These partners serve not only as data contributors but also as active participants in governance and decision-making. For instance, the common European data space is operated by the Europeana Foundation, which is composed of 19 partner institutions. MetaBelgica has established a “Follow-Up Committee” that brings together organizations such as the Dutch Heritage Network and Wikimedia Belgium to provide subject-matter expertise throughout implementation. Open Heritage is managed by a consortium of institutions.
Other initiatives rely upon robust community engagement and networks of volunteers. Wikimedia Commons GLAM, for instance, sets up data supply partnerships with GLAM institutions and harnesses its volunteers for maintenance needs and other tasks.
Reflections
As the examples above illustrate, data commons have the potential to transform how cultural knowledge is preserved and to make communities visible within AI applications. However, alongside this potential it is equally important to consider when it is appropriate to make community data available and when it should not be included in AI systems. Establishing a system to maintain a social license is critical not only to understand community expectations, but also for actively involving affected groups in decision-making processes.
Below we provide a set of illustrative questions to guide future research. As interest in cultural data commons continues to grow, we hope these questions serve as jumping off points for deeper exploration:
How can the governance of cultural data commons reflect and align with community values and expectations?
What strategies and mechanisms can cultural data commons adopt to establish and maintain the necessary social license to operate? In what ways can and do cultural data commons negotiate community expectations around what should or should not be made accessible for AI use?
How can data commons organizers ensure that the communities they represent meaningfully benefit from the resulting AI?
How can data commons organizers increase the use of cultural data commons in AI applications while reducing dependence on copyrighted data?
How can cultural institutions collaborate to create collective data commons and minimize duplication of effort?
How are data commons defining boundaries around the use of cultural data in AI (e.g., deepfakes, generative remixing, derivative works)?
***
Have any questions or are interested in collaborating? Reach out to us at [email protected].