By Bee Lehman and Tom Van Nuenen
Over the first half of the 2020s, the landscape for researching almost any component of European studies changed dramatically. Among the larger events, OpenAI released ChatGPT in late 2022. Then a 2023 ransomware attack on the British Library shut down the repository for almost a year.[1] Added to that, in 2024, the US federal government began to remove large swaths of data from publicly accessible interfaces, raising questions across the world about long-term accessibility of Open Access resources. Long-standing issues of access, preservation, and digitization have become acute as governments restrict data, repositories falter under attack, and generative AI both depends on and obscures what remains. These crises expose a collapse long in motion: the basic conditions of scholarly research are now visibly insecure.
For repositories—libraries, archives, museums, and similar facilities—these questions were particularly fundamental. Academic libraries, particularly in the United States, had to reevaluate the entire landscape and their role within it. The catch is that information professionals must now navigate both traditional library science and rapidly evolving digital technologies that automate and obfuscate the process of digitization. This article explores what gets digitized, how digital content is preserved, and how analysis takes shape in our heavily digital age. It concludes by examining what digitization and preservation mean for chatbots and generative AI, even when using reasoning or “deep research” approaches.
Digitized and Digital
An impressive array of European government and heritage material now exists online. This material—born-digital or digitized—provides scholars from around the world with unprecedented access to ways to engage online with European studies, which is especially valuable in an era when many scholars face travel or funding constraints. However, just as with print and in physical environments, scholars must understand the scope of what exists in a digital environment to be conscious of inherent biases and interpretative limitations in their research.
Access has long been an issue in archival scholarship. Archives, libraries, museums, and other knowledge repositories committed to preserving and making documentation available are subject to limitations in what they can and will preserve. Space, funding, and organization lead to gaps in documentation, particularly for community records and data pertaining to marginalized groups. In consequence, scholars have long been aware that what is preserved and available often presents a top-down, elitist view of the past. Moreover, most scholars do not have the financing, time, or physical capability to travel across the European continent to access the full range of relevant materials for their research. While physical access is indeed complicated by geography, parallel challenges persist even in the digital world. Digital repositories—online databases and platforms where institutions store and share their digitized collections—typically operate independently from each other. Each national library, university archive, museum, and heritage organization maintains its own repository with its own search interface, organization system, and access requirements. This dispersion means that—as in the physical world—scholars often have to go to multiple digital spaces and learn new discovery systems in order to find relevant material.
To address this fragmentation, several repositories have “bought in” to aggregating platforms. Among others, Europeana (https://www.europeana.eu/), HathiTrust (https://www.hathitrust.org) and the Internet Archive (https://archive.org/), as well as Wikimedia Commons (https://commons.wikimedia.org), serve as central hubs where individual institutions can share books, images, and other materials they own. These aggregators help scholars navigate and discover relevant content from across and about the continent, without needing to know which specific institution holds what materials.[2] Yet, these aggregators can only serve resources that exist in a digital format. Perhaps unsurprisingly, digitization priorities and preservation reflect existing privilege structures, determining whose voices are considered worthy of digital preservation. Before 1990, the majority of materials focused on privileged elites and state institutions. The arrival of social media platforms has raised questions about whose voices are preserved and in what capacity, but the active preservation and retrieval of that information remain uncertain. Within European studies, this means that online access privileges English and French speaking states. Digitized materials perpetuate colonial biases and power structures embedded in original collection practices.[3]
Furthermore, what gets digitized heavily depends on where the material is located and which government controls it. The British government, for example, has invested greatly in digitizing government and heritage material. In contrast, the Turkish government has not provided as extensive funding for digitizing or opening access to their state documents or cultural heritage material. In consequence, researchers seeking British material can find relevant digital materials easily, while scholars focused on the Ottoman Empire or Turkey frequently must visit the relevant countries and archives in person. This disparity means that research based on digital resources focused on European-wide or interstate conflicts often benefits from better access albeit to limited perspectives that reinforce specific biases about the center and periphery of the continent.
The result is that digitized sources end up reinforcing structures of power already in place. While there is now more content from marginalized and oppressed communities in repositories today than in the pre-internet world, the bulk of this material—particularly that located in large repositories—emanates from institutions that perpetuate ingrained cycles of social violence. The resources necessary to keep and preserve physical materials have long meant that socio-economic elites are usually the ones to establish and maintain archives, museums, and other repositories. Unsurprisingly, those elites were (and are) usually committed to supporting their own historical narratives, resulting both in the preservation of records focused on society’s “upper” echelons and supporting their actions. Those limited records often lead to an imbalanced understanding of both the past and present, suggesting the “non-existence” of marginalized communities and perpetuating historical erasure.[4] Thus, even as digitized and born-digital material make a wider range of actors’ voices more accessible than twentieth-century scholars could have imagined, significant limitations endure within European studies, which affect what exists in the digital world in the first place, what people can easily find digitally, and how that material becomes available.
Preserving the Digital
Once material has reached digital format and is made available online, the longevity of this material becomes a growing concern. The attack on the British Library underscored the problem. Not only did the attack make access to the entire interface crumble for a year, but it underscored the fragility of the infrastructure enabling access to some of the older digitized collections. Indeed, the material that scholars have access to digitally, regardless of whether it was born-digital or digitized, needs an actively maintained infrastructure. Unlike physical material such as vellum books, which can be placed on a shelf or in an attic and ignored for centuries, digital material cannot survive without constant maintenance. This maintenance includes server space. Although data exists in the misleading metaphor of the “cloud,” it must physically reside on one or multiple servers.[5] To function, those servers need a team of people updating their hardware and supporting their networks. In addition, to make accessing the material possible, a team of humans must manage the systems designed to allow people discover and retrieve content. In short, preserving digital resources demands continuous, active maintenance. Maintaining online access to digital collections requires a workforce actively supporting relevant servers through constant checks, data migrations, and updates.[6] Funding those updates and paying the necessary staff is often prohibitive.
Part of the issue is that computer scientists and librarians or archivists differ fundamentally in their conceptual approach to “forever.” For the former, five to ten years of consistent function represents success.[7] In practice, the people building the infrastructure often focus on the questions “does it work now?” and “could it function in the future with support?” But hardware degrades over time, and software must be patched or updated. Neither is stable because each depends on multiple layers of technology to function.[8] In contrast, for the archivist, the goal is for the material to last a number of centuries, or “in perpetuity.”[9] As a measure against some of these issues, most librarians and archivists support some degree of object redundancy. For example, in early 2025 the Bulgarian Archives State Agency’s website temporarily went offline, its material no longer directly accessible to scholars over the internet. Nonetheless, low-resolution material remained available through Europeana, and higher-resolution versions through Wikimedia Commons. Such redundancy meant that, in spite of the fact that direct access to the Bulgarian institution’s high quality digitized collections was impossible, these collections continued to be available nonetheless.
In short, to promote digital preservation, scholars have to consider a series of minute details. There needs to be robust server space, excellent metadata, and redundancy in the form of back-ups, and so on. The recent US government removal of significant chunks of data from its public facing repositories underscored the vulnerability of digital preservation. Without physical copies or clear, off- and online redundancy, information can be permanently lost. Furthermore, digitizing heritage material is not the only step in the process. Once physical material is digitized or born-digital material collected, it needs to be processed to allow for access and discovery. Without proper metadata and indexing, even digitized materials remain effectively invisible to the researchers who need them.
Digital Analysis
For programmers or business professionals, the emergence of artificial intelligence (AI) tools has opened unprecedented possibilities for development and analysis. Yet, for the social sciences and humanities communities, the availability of AI tools, alongside new opportunities creates significant pitfalls. On the one hand, AI tools offer fascinating possibilities for document analysis and equity in access. On the other hand, faulty assumptions about accuracy and inclusion in access are compounded. Given the issues discussed above, for researchers of almost any field doing deep analysis, GenerativeAI (GenAI) tools such as ChatGPT, Claude, and Gemini often cause as many problems as they solve. First, to use those tools well, scholars need to take the time to understand how to execute some form of prompt engineering—a skill that most scholars have never had to previously consider.[10] This involves setting a clear definition of success criteria, writing clear instructions, controlling and structuring output, providing examples, applying constraints, and providing a role to the GenAI model.
Furthermore, not only is there an increasing number of generative AI models to choose from, there are also many different kinds of parameters and options to query those models. For example, even closed-source GenAI models typically offer opportunities for incorporating external documents through retrieval-augmented generation (RAG), allowing researchers to limit queries to specific journals or book collections.[11] Open-weight models, further, offer different possibilities through fine-tuning on specialized datasets. Those limitations, however, mirror the field’s existing problems mentioned above: RAG systems can only retrieve from already digitized materials, while fine-tuning reproduces biases present in training data.
Even when disregarding these problems, information access remains problematic.Despite many groups’ efforts, not all material has been digitized. Current GenAI systems work with topics that appear in many sources across the training data—major historical events, canonical texts, well-documented figures.[12] But humanities research often centers on the exceptional rather than the typical: a single surviving letter, a unique manuscript, a rare document, or a marginalized voice. These are the materials least likely to appear in training datasets—and even if they do, they rarely appear with sufficient frequency to be reliably reproduced or analyzed by the system. Consequently, as AI models generate statistically probable responses, they effectively produce consensus narratives. This is antithetical to research seeking to uncover dissenting voices and alternative interpretations and challenge dominant narratives. GenAI is oriented around an “average” historical interpretation, not a provocative new reading. This also has a danger of preventing the kinds of “accidental discoveries” that are crucial to traditional archive research.
A tangential issue is that, when systems lack information, they will often generate plausible-sounding but false information. While these are called “hallucinations,” they are not different from “successful” interpretations the machine makes in any real sense; both factually correct and incorrect responses are simply statistical predictions of likely word sequences based on training patterns.[13] The model has no mechanism for distinguishing truth from fiction; it only knows what patterns appear probable. Even training methods like RLHF (Reinforcement Learning from Human Feedback) that reward factual responses, or constitutional training that explicitly discourages fabrication, are still statistical improvements to likelihood, not genuine verification. Unlike a library catalog that simply returns “no results found” by checking against a database, GenAI systems produce fluent text regardless of knowledge gaps.
Even when information has been digitized, many generative AI programs face multiple hermeneutic barriers. First, these systems depend on optical character recognition (OCR) to process text, but OCR struggles with older materials (i.e., pre-1900s) and non-standard Latin scripts containing annotations, flourishes, abbreviations, and other particularities. Photographs and visual materials present an even bigger challenge. The result is that interpretive AI is biased toward recent materials written in standardized, computer-readable type. Second, the language of training data creates another barrier. Current GenAI systems are mostly training on English patterns, further marginalizing research on Eastern European, Iberian, or Nordic materials that might be digitized but still underrepresented in AI training. There are multiple communities working to develop Large Language Models (LLMs) with non-English materials, but these efforts remain in early stages and face similar limitations regarding historical materials and non-standards scripts.[14] The modern, standardized, English-language text that GenAI excels at is precisely the opposite of what most European historical research requires. Third, GenAI systems treat all time periods as simultaneously present, lacking genuine understanding of historical change and context. They might anachronistically apply modern concepts to medieval texts, or miss how the meaning of terms evolved over centuries. For European studies, where understanding historical context is crucial, this is particularly problematic.
A Call for Critical European Studies
While digital materials provide unprecedented access to information, scholars must carefully consider what they can access, for how long, and why. Current online availability offers no guarantee of permanence. The “forever” of physical archives has no digital equivalent. As digitization continues reshaping European studies research, scholars must remain critically aware of what gets digitized, how it is preserved, and how AI tools interact with these materials. The digital revolution has democratized access in many ways, but it has also introduced new forms of bias and vulnerability. Understanding these limitations is essential for conducting rigorous research in our digital age. Only by recognizing both the opportunities and constraints of digital scholarship can researchers navigate this evolving landscape effectively.
Bee Lehman is a literatures and digital humanities librarian at UC Berkeley, where they specialize in information literacy. They earned their MLIS from Simmons University in 2007 and PhD in history from UNC at Chapel Hill in 2017. Their research focuses on European migration, digital humanities, and travel literature.
Tom van Nuenen (PhD, Tilburg University) is a lecturer in advanced computing and senior data scientist at the University of California, Berkeley. His work bridges computational social science, AI ethics, and digital culture, with a focus on algorithmic bias and the use of computational methods for cultural analysis. He is the author of Traveling Through Video Games (Routledge, 2023) and Scripted Journeys (De Gruyter, 2021) and has published in IEEE Transactions on Knowledge and Data Engineering, ACM CSCW, and Tourist Studies.
[1] Learning Lessons from the Cyber-Attack: British Library Cyber Incident Review (British Library, 2024), 18, https://cdn.sanity.io/files/v5dwkion/production/99206a2d1e9f07b35712b78f7d75fbb09560c08d.pdf.
[2] Carlotta Capurro and Marta Severo, “Mapping European Digital Heritage Politics: An Empirical Study of Europeana as a Web-Based Network,” Heritage & Society (2023), 1–21, https://doi.org/10.1080/2159032X.2023.2266801.
[3] Gerben Zaagsma, “Digital History and the Politics of Digitization,” Digital Scholarship in the Humanities 38, no. 2 (2023): 830–51, https://doi.org/10.1093/llc/fqac050.
[4] For more on the history of the archive, see Markus Friedrich, The Birth of the Archive: A History of Knowledge (University of Michigan Press, 2018). For more on the importance of community archives, see Michelle Caswell et al., “‘To Suddenly Discover Yourself Existing’: Uncovering the Impact of Community Archives1,” The American Archivist, The American Archivist 17, no. 1 (June 2016): 56–81; Sarah Salter, “History, Activism, Erasure: Archival Paradox as Institutional Practice,” Journal of Feminist Scholarship 19, no. 19 (2021): 24–41.
[5] This is a gross oversimplification. For more, see Lisa Goddard and Dean Seeman, “Negotiating Sustainability: Building Digital Humanities Projects That Last,” in Doing More Digital Humanities: Open Approaches to Creation, Growth, and Development, ed. Constance Crompton and Richard J. Lane (Routledge, 2019).
[6] Lise Jaillant et al., “Introduction: Challenges and Prospects of Born-Digital and Digitized Archives in the Digital Humanities,” Archival Science 22, no. 3 (2022): 285–91, https://doi.org/10.1007/s10502-022-09396-1.
[7] Erin Baucom, “A Brief History of Digital Preservation,” in Digital Preservation in Libraries: Preparing for a Sustainable Future (An ALCTS Monograph), ed. Jeremy Myntti and Jessalyn Zoom (American Library Association, 2019).
[8] Amelia Acker, “Emulation Practices for Software Preservation in Libraries, Archives, and Museums,” Journal of the Association for Information Science and Technology 72, no. 9 (2021): 1148–60.
[9] That phrase is incredibly central to these efforts in part because of the legal framework built around it. For some considerations regarding scholarship, see Tony Horava, “Today and in Perpetuity: A Canadian Consortial Strategy for Owning and Hosting Ebooks,” The Journal of Academic Librarianship 39, no. 5 (2013): 423–28, https://doi.org/10.1016/j.acalib.2013.04.001.
[10] Bertalan Meskó, “Prompt Engineering as an Important Emerging Skill for Medical Professionals: Tutorial,” Journal of Medical Internet Research 25, no. 1 (2023): e50638, https://doi.org/10.2196/50638.
[11] Muhammad Arslan et al., “A Survey on RAG with LLMs,” Procedia Computer Science, 28th International Conference on Knowledge Based and Intelligent information and Engineering Systems (KES 2024), vol. 246 (January 2024): 3781–90, https://doi.org/10.1016/j.procs.2024.09.178.
[12] Lauren Klein et al., “Provocations from the Humanities for Generative AI Research,” arXiv:2502.19190, preprint, arXiv, February 26, 2025, https://doi.org/10.48550/arXiv.2502.19190.
[13] Ziwei Ji et al., “Towards Mitigating LLM Hallucination via Self Reflection,” in Findings of the Association for Computational Linguistics: EMNLP 2023, ed. Houda Bouamor et al. (Association for Computational Linguistics, 2023), https://doi.org/10.18653/v1/2023.findings-emnlp.123.
[14] Gabriel Nicholas and Aliya Bhatia, “Lost in Translation: Large Language Models in Non-English Content Analysis,” arXiv:2306.07377, preprint, arXiv, June 12, 2023, https://doi.org/10.48550/arXiv.2306.07377.
