Vanishing knowledge: Archiving science in the digital age

While scientific content appears more accessible than ever, the reality is that virtual memory is riddled with gaps leading to significant loss of digital information

Orr Peleg/Davidson Institute of Science|
When students begin writing a report or conducting research, their first step is often to log on to Google Scholar in search of relevant articles and information. Have you noticed the line that appears beneath the search prompt - ‘Stand on the shoulders of giants’? This brief phrase encapsulates the essence of scientific progress, which largely builds upon the accumulated achievements and wisdom of multiple generations. There is no need to rediscover what has already been discovered. 
The knowledge is preserved, along with an understanding of how it was acquired. Both researchers and the general public can access the vast body of scientific knowledge accumulated throughout history. You can revisit it for further exploration, learn how specific experiments were designed and draw inspiration from the writings of great scientists. 
7 View gallery 
(Photo: Shutterstock)
The wealth of knowledge that has accumulated over the years is so vast that we now face challenges in preserving all of it. This difficulty stems not only from the sheer volume of research and data but also from the methods of storage and preservation. The cumulative impact of the transition to digital storage, occurring in the current generation, is becoming increasingly apparent. While digital memory is accessible and rich, it is also fraught with gaps and limitations.
Old methods, new technologies
Just like knowledge itself, the way we preserve knowledge has gone through many transformations over the years. The invention of writing greatly facilitated the preservation of history, and the printing revolution further enhanced our ability to pass knowledge from generation to generation. 
By the late 20th century, the scientific community had largely adopted the practice of disseminating and preserving scientific knowledge mainly through periodical journals, which serve as the primary medium for publishing new research.
This method met several important needs. First, it allowed scientists to stay informed about the latest developments in their field. Second, research was published only after undergoing rigorous peer review by experts in the field, which helped ensure that experiments and studies were reported accurately and reliably. 
Equally important, the journals were preserved in printed copies in university libraries and archives. If a researcher wanted to learn about a specific study, they could locate the article by identifying the year, issue, and journal in which it was published, making it easy to access the article and its information.
Printed journals provide a good record of the continuous progress in all fields of science. However, their accessibility is limited due to the sheer volume of information that has accumulated over the years - and continues to grow. 
7 View gallery 
(Photo: Shutterstock)
Rough estimates suggest that a physical archive containing a single copy of every issue of every scientific journal ever published would require tens of thousands of square meters of storage space. And that’s assuming it would hold only one copy of each issue, which would severely limit the availability of the content for reading. In other words, if every university were required to maintain an archive of this size, which would grow every month, there would be no room left for anything else.
The solution emerged in the 2000s with the advent of digital storage technology. Digitization freed up vast amounts of shelf space, replacing physical volumes with many terabytes of information, which still ultimately have a physical presence, stored on large computer servers that occupy space and consume energy and resources.  
A cloud carried by the wind
A significant portion of the world's scientific content is now stored exclusively in digital form, commonly referred to as 'the cloud.' However, the term ‘cloud’ can be misleading, as it is essentially only a metaphor for decentralized technology that allows access to resources and services online without requiring users to understand the underlying infrastructure. 
While the physical location of the data is not visible to end-users, this does not mean it lacks a physical presence. When we store information in the cloud, the data is stored on servers located in data centers owned by cloud service providers, who are responsible for their maintenance. In other words, online storage does save physical space in the world, but it is not entirely detached from physical existence or from the constraints of reality. 
The data stored in the cloud remains subject to physical limitations: these include factors such as the location of the data centers, local and international regulatory requirements, physical and network security requirements and the infrastructure hosting the data.
A recent study, which is naturally also stored in the cloud, found that cloud storage has introduced unexpected challenges. The research presents a highly alarming figure: about a quarter of the items stored digitally online are inadequately preserved.
7 View gallery 
(Photo: Shutterstock)
When an article is uploaded to the web, it is assigned a digital identifier called a Digital Object Identifier (DOI). The DOI is a sequence of symbols, letters and numbers, that serves as a unique stamp validating the article's existence online. However, as it turns out, the situation is not as straightforward as may seem.
The researches constructed a database form original archival sources and examined the preservation status of nealy 7.5 million DOIs, assessing their digital preservation and accessibility. The study found that 58 percent of the tested DOI were indeed present in at least one digital archive, but an additional 28 percent—two out of every seven articles - were seemingly unpreserved. 
The remaining 14 percent were excluded from the study for being too recent or for having insufficient metadata to identify the archival source. In other words, more than a quarter of the tested articles with a DOI were found to be inaccessible via the internet.
So can the internet forget? Most of us have likely heard or read warnings about online personal information security more than once. We know that what we upload to the internet will remain there indefinitely, potentially resurfacing at the most inconvenient times to remind us of past recklessness and frivolity. While this concern is undoubtedly valid, it turns out that information does sometimes disappear - just not necessarily the kind we would prefer to forget.
The black hole of articles
There are a variety of reasons why information disappears from the web. Website owners may choose to remove articles from their websites or archives due to changes in editorial policy, licensing agreements, or business decisions. 
This can also happen when a journal ceases publication, changes ownership or undergoes reorganization. Updates or redesigns of websites hosting scientific articles can also affect the accessibility or availability of older content published prior to the upgrade.
7 View gallery 
(Photo: Shutterstock)
The URLs (address lines) of web pages and files can change, which can result in broken links or make it difficult to locate specific articles. Websites may also encounter technical issues, such as server failures, database corruption or cyber-attacks. Such problems may result in the loss of articles or in their temporary unavailability. 
In certain cases, scientific articles can get lost due to neglect or abandonment. Website owners might abandon their sites, allowing the website and its content to gradually deteriorate from a lack of maintenance. 
Additionally, lawsuits and legal disputes over intellectual property and copyright infringement can lead to the removal of articles from online platforms. If publishers and authors cannot obtain the proper permissions or licensing for copyrighted content, they may be forced to remove it. 
Additionally, some scientific articles are restricted behind paywalls or subscription barriers, limiting access to eligible users only. If access permissions change or licensing agreements expire, these articles may become inaccessible to the public. Finally, laws or court orders may mandate the removal of certain articles from the Internet due to defamation, privacy violations, or other content deemed illegal or harmful.
Lost treasures
The repercussions of losing knowledge can be severe. When a study is lost, it means that valuable scientific knowledge may also vanish. Such losses can hinder scientific progress, as researchers may be unable to build on or replicate previous findings. The ability to repeat experiments is a cornerstone of scientific research and an essential part of it. 
If original research articles disappear, it becomes difficult, if not impossible, for other researchers to confirm the findings or replicate the experiments. This can undermine the credibility of scientific findings and impede the advancement of our knowledge and understanding. 
7 View gallery 
(Photo: Shutterstock)
The disappearance of research studies can also impact scientific education at all levels. Students and lecturers rely on access to scientific literature to stay updated on the latest findings and theories. When research articles vanish, the ability to teach and learn scientific concepts accurately may be compromised 
Moreover, scientific research plays a crucial role in government policy decision-making across key areas such as public health, environmental protection and technology development. The loss of scientific knowledge may restrict policymakers' access to essential evidence and data, potentially resulting in uninformed or ineffective policies.
The disappearance of research studies can also be financially harmful. Research drives innovation and economic growth in many industries. The loss of knowledge could stifle innovation by limiting access to basic knowledge, hindering the development of new technologies and products. 
And finally, from a broader, long-term perspective, scientific research plays a vital role in documenting humanity’s achievements and progress. Lost research can obscure the developmental trajectory of ideas, theories and methodologies over time, making it difficult to trace the course of scientific progress.
 In the current era, any damage to the accessibility of scientific research and our ability to validate it could foster the spread of misinformation and ignorance. Therefore, efforts to preserve the entire body of scientific literature are crucial for maintaining the integrity, accessibility, and continuity of scientific knowledge for future generations.
Knowledge preservation
Keeping paper copies of research articles can serve as a form of insurance against their disappearance, but this solution is often impractical and unnecessary in the digital age, especially given the valuable shelf space it occupies. Instead, adopting digital preservation strategies and promoting open-source and open-access initiatives would be more effective. 
7 View gallery 
(Photo: Shutterstock)
Such an approach would ensure that research articles remain freely available online to the general public without restriction. Articles published under shared licenses are typically hosted on platforms designed for long-term preservation, contributing to their continued availability.  
However, open-access journals are particularly vulnerable to the risk of shutting down. A study found that between 2000 and 2019, at least 174 such websites disappeared from the internet, highlighting that the model of free scientific journalism does not guarantee the long-term preservation of content.  
One possible solution could be to combine digital storage with online publishing. Many libraries, archives and academic institutions offer digital storage services. These services ensure the long-term accessibility of digital content by storing copies in secure, redundant systems. 
Website owners can collaborate with such organizations to preserve their digital archives, including research papers, thereby ensuring that the content survives even if the original website is no longer available.
Content that has already been lost can sometimes be partially or fully recovered with the help of the Internet Archive - a huge non-profit digital library initiative designed to provide universal access to all information available online. 
The archive was founded in 1996, in the early days of the commercial internet, with the goal of offering continuous access to historically valuable collections existing in digital format. To achieve this, it backs up websites, software applications and games, music, movies, videos, and millions of books, journals, and documents.
7 View gallery 
(Photo: Shutterstock)
One of the most significant roles of the Internet Archive is to combat the disappearance of online content,  also known as "link rot,” including research articles. The Internet Archive addresses this by periodically scanning the internet, creating copies of webpages and storing them for future reference. Thus, even if a web page is deleted or significantly altered, users can still access archived versions of the page.
The mechanism that enables this is the "Wayback Machine," where old web pages can be found by URL, subject or date, navigated through and even linked or bookmarked within the archive for future use, citation or reference. However, this tool has limitations: its web scanning mechanism may skip certain sites, pages within them, or some of the files stored on them. 
As a result, some content, especially dynamic or interactive web pages, may not be fully captured or may not function properly in the archived versions. Additionally, preservation efforts depend on the cooperation of website owners—content protected by a paywall, for example, will not be available to the public even in the Wayback Machine’s backup version. 
Cooperation from publishers and researchers in publishing and storing content in the Internet Archive can help preserve science and the history of science for both current and future generations of scientists.
Content distributed by the Davidson Institute of Science Education. 
<< Follow Ynetnews on Facebook | Twitter | Instagram >>