Digital storage is extremely compact and offers exact, rapid, and nearly cost-free replication through infinite iterations, whether to another digital storage medium or a playback device for human readers (watchers, listeners). This is obviously quite fantastic and caused a great deal of existing and new content to move to digital storage and onto the Internet. Much less obvious is the potential for information loss caused by this ubiquitous digitization, compared to traditional physical artifacts such as books.
The most obvious downside of storing information on the Internet mirrors its great upside: it’s dynamic. At any time the owner can enact changes that are almost immediately globally visible. This is very annoying when the owner changes or deletes content that others had intended to rely on. Link rot and content drift, collectively known as “reference rot,” have become such pervasive problems that any mere URL reference tends to become meaningless within years. Jill Lepore’s The Cobweb reports:
But a 2013 survey of law- and policy-related publications found that, at the end of six years, nearly fifty per cent of the URLs cited in those publications no longer worked. According to a 2014 study conducted at Harvard Law School, “more than 70% of the URLs within the Harvard Law Review and other journals, and 50% of the URLs within United States Supreme Court opinions, do not link to the originally cited information.” […]
Last month, a team of digital library researchers based at Los Alamos National Laboratory reported the results of an exacting study of three and a half million scholarly articles published in science, technology, and medical journals between 1997 and 2012: one in five links provided in the notes suffers from reference rot. It’s like trying to stand on quicksand.
Aside from deliberate changes and deletions, another fundamental problem with digital-only information is its need for active maintenance. Publishing information over the Internet requires all of the following:
- a big pile of hardware for network and servers,
- a continuous supply of electricity for both,
- periodic replacement of defective media and other hardware,
- periodic backups to avoid data losses in case of such defects.
One common reason for link rot is that someone decided to no longer pay an ISP’s charges for all this, or that the ISP itself went out of business. Moreover, any Internet publisher is automatically exposed to hostile requests that might attempt to make it inaccessible through denial-of-service attacks, or infiltrate it with destructive software. This further increases maintenance efforts.
Lepore proposes the Internet Archive as a solution to reference rot (more on which below), as well as a dedicated academic citation snapshot service, Perma.cc. These services do protect against individual authors, documents, or ISPs going offline, but digital storage gets messier when we go behind and beyond currently available Internet pages.
Vint Cerf, one of the “fathers of the Internet,” last year warned of a digital dark age due to an increasing inability of modern hardware and software to access old digital content stored offline. Companies are both permitted and economically required to keep changing their products, and eventually compatibility goes by the wayside. One interesting point made by Peter Suciu’s article is that there really is no simple technical solution to this problem.
Cerf suggested that an X-ray snapshot of the content, its application, and most importantly the operating system, could be one way to ensure that future generations are able to reproduce the technology to retrieve said data.
There may be two problems with this solution however. One is that the X-ray would have to be preserved – and that likely should not be digitally. The other issue is that it might involve a corporate entity to maintain the X-ray, which brings the entire problem full circle.
Commercial entities are indeed unlikely to bother with a task as unprofitable as ensuring access to legacy information. Google itself has scaled back such efforts, leaving the vast donation-funded Internet Archive as the sole universal deposit of archived web pages. A worthy effort but subject to the same catch-22 as Cerf’s “X-ray:” attempting to preserve digital information digitally merely reiterates the same dangers on another level. Should funding dry up and someone find another use for all those servers, or should old storage media become unreadable, part or all of the Internet Archive will be gone.
Aside from Internet connectivity and format compatibility, digital store media themselves require continuous active maintenance. To be sure, most digital media are less terrible than early consumer-grade magnetic disks whose click of death supported an entire cottage industry. But even today, server drives are expected to fail within 5-10 years under continuous heavy load, and thus are always used in redundant setups with multiple disks and removable backup media.
These include traditional mainframe storage tapes for which librarians estimate a life expectancy of several decades. Unfortunately, any given tape might fail after just one year – and that’s usually not obvious by looking at it. One must actually try to read it back. Then of course one must already have made a backup copy in case it doesn’t work, and then create a new backup copy to replace the failed one. Optical disks, mechanically pressed and correctly sealed, should be more reliable – at present the jury is still out on that, as this type of storage hasn’t been around long enough. The same applies to other new storage media such as high-capacity USB sticks. Fundamentally, they do not change the situation: they can fail in a visually non-obvious way, so periodic tests and renewed backups are necessary.
No Technical Solution
Way back in 1999, and now ironically available only through the Internet Archive’s “Wayback Machine,” librarian Stewart Brand wrote Escaping the Digital Dark Age. Nearly 20 years ago the situation was already quite clear:
Digital storage is easy; digital preservation is not. Preservation means keeping the stored information cataloged, accessible, and usable on current media, which requires constant effort and expense. Furthermore, while contemporary information has economic value and pays its way, there is no business case for archives, so the creators or original collectors of digital information rarely have the incentive – or skills, or continuity – to preserve their material.
Brand realizes that stored digital information requires continuous preservation, i.e. active maintenance, to remain accessible. Obviously it’s also in his professional interest as a librarian to advocate for such preservation. The downside is equally obvious: any technical solution that requires continuous professional attention to keep documents viable is not really a “solution,” but rather a workaround for an inherent and inevitable downside of digital storage. Brand has the right idea in this sentence:
And don’t forget atomic backup – while the durability of bits is still moot, the atoms in ink on paper have great stability.
Any digital storage of data that you want preserved must ultimately be backed up by human-readable physical artifacts. Printed paper remains usable for centuries and millennia, requiring nothing more than a storage place with environmental conditions that humans themselves consider clean and pleasant. Internet replication greatly reduces the need for physical storage of documents that one currently reads or writes, but it does not eliminate the need for physical long-term storage. I think it’s quite likely that many less popular books, having lost their digital representations for one reason or another, will have to be periodically scanned back in from their print editions. And if you rely on the content of some Internet link, you could surely do worse than print a hardcopy… just in case.
I have had the privilege of reading a five-hundred-year-old book, an experience hardly different from that reading a modern book. Compare such robustness to the lifespan of electronic documents: some of the computer files of my manuscripts that are less than a decade old are now irretrievable.
– Nassim Nicholas Taleb, Antifragile, p.332
As a final thought, consider how a world of purely digital information would accommodate convenient, accidental, quiet censorship. Governments no longer need to send agents to gather forbidden books. Anything that requires an active web host vanishes forever with the click of a button, or with the host itself. Anything that the Zeitgeist abhors for some decades would likewise vanish, merely for lack of anyone with the will and resources to conserve it. The effect would be similar to the “Dark Ages” disappearance of Greek and Roman classics deemed unworthy by the Church to be preserved by continuous copying, except on a vastly accelerated timescale. Active maintenance is inevitably selective. This is why we should preserve our important documents in formats that require as little of it as possible.