Magazine Subscribe Events Careers Backblog About Press Releases Media Kit Supplements Books
Top 300 Issue 2007 Latest Issue Archive Editor's Letter From the Publisher Sponsors / Advertisers
Current Issue

Backbone TV


NEW Geoweb video
Portals
Backbone's information on...


Careers

Data Management

Economic Development

Education

Green
New Supplement

Health

Olympic Tech

Outsourcing 

Security 
New Supplement

Social Networking

Tech Associations Canada

Travel

Unified Communications & VoIP

Web 2.0

Wireless 
Multimedia

sponsored by



Videos - NEW

Small Business
Case Studies -NEW

Webcasts

How-to Guides

Guide for Small Business


Is your company eligible to be featured in an Intel Small Business Case Study?

A new dark age November 8, 2007 
What happens when our data is no longer readable?

By Danny Bradbury

What a difference 19 years can make, especially when those years bridge a transition from analogue to digital data. In 1983, the Interactive Unit of the British Broadcasting Corp. (formed to explore the new world of digital multimedia) decided to create an interactive electronic version of the original Domesday Book. King William I had compiled the book, the first broad survey of English life, in 1086. The BBC engaged schools around the country to explore and document 12,000 blocks of land. It collected thousands of photo, text and video entries and stored them in part analogue and part digital format on Philips Laserdiscs—12-inch monsters that were one of the first optical storage media available. 

But in 2002, the U.K.’s National Archives began to worry about how to preserve the project. The BBC Micro computer (the platform for which the user interface had been written) was rapidly becoming obsolete; the user interface was proprietary and written in the BCPL computer language, and Laserdiscs were essentially dead, meaning players were becoming less available. Without players, it would be impossible to transfer the data to a new machine. If conditions had degraded this badly in just those years, who knows how readable the project files would be in 20 years, or 50? The data produced by the project could soon become unreadable.

Old before its time
We are producing more data than ever before. The School of Information Management and Systems at UC Berkeley estimated the world produced approximately five exabytes of new information in 2002, or about 800MB per person. In 1999, the figure was between one and two exabytes. Preservation of that data is of considerable concern to archivists, who worry we are entering a digital dark age. “The biggest problem with digital data preservation is people don’t think it’s a problem.

“But then if you ask them whether they can access their first digital files, they say no and don’t connect the two questions,” said Alexander Rose, executive director of the Long Now Foundation, an organization promoting long-term thinking. “People have said that the computer age is moving so quickly that it isn’t even required to know its own history. It’s one of the few things we’ve seen that does that.”

King William I commissioned the Domesday Book
in 1086 and the paper it is printed on is still
readable. The same cannot be said for the version
of the book created in 1983.

People only begin to realize the extent of the problem when negative stories come forth, Rose said. One comes from the U.S. Navy. Navy officials interviewed by Popular Mechanics in December 2006 complained that Computer Aided Design files for complex carriers such as the USS Nimitz viewed with modern software were not displaying data the same way they did when opened using older versions of the software. That can be a significant problem, especially with mission-critical data.

“In Texas, many oil companies have had air-conditioned warehouses filled with computer tapes that recalled their field explorations,” said James Porter, founder of data storage market research firm DiskTrend. “There have been millions of reels of computer tape from these explorations—from the desert, from the North Sea, you name it. They are stored in those warehouses because the oil companies think that maybe they will work out [new ways] to find oil.” But if that data becomes unreadable, the whole exercise becomes useless.

One of the biggest problems for people wanting to store data is the selection of a physical media. There are no ratified standards detailing the archival quality of storage media technology, and advances in storage methods mean formats are essentially supplanted every 10 years. For example, DVD will shortly give way to new formats such as Blu-Ray and HD-DVD. Companies like InPhase are already developing holographic storage media which will significantly increase storage capacities, making high-definition DVDs yesterday’s news.

“Early CDs and DVDs had metal in the disk, and there have been stories of them failing pretty quickly. They have become much better at protecting the metals, but it is an Achilles heel,” said Kevin Curtis, InPhase founder and CTO. “Typically, you try to get to 20 or 30 years of storage in professional media.” He argues that his storage technology will hold up longer because it is based entirely on plastics.

Making copies
Part of the problem in preserving data digitally is that simulating conditions to test the corruption of bits over long periods is difficult, said David Rosenthal, chief scientist for LOCKSS, a Stanford University preservation initiative. This makes it difficult to prove industry is doing a good job of keeping data intact, and therefore it is difficult to produce a benchmark standard. Moreover, long-term data storage may not be at the top of the agenda for many storage firms. “Storage vendors are typically looking at an application where it is important to keep the data for a limited period of time, and then make sure it’s really gone. There are some regulatory areas where they’re talking about 100-year lifetimes, but most people in business simply say that they’re taking plenty of backups or crossing their fingers,” he said. “Even by expecting this media to last five years, or tapes to last more than 10 years, you are asking for trouble.”

Consequently, the only solution for people wanting to store data reliably on physical media beyond five or 10 years is to copy it across to another tape or disk periodically. However, this creates its own challenges in terms of cost, because each copy incurs an administrative overhead for issues such as maintenance and security.

The LOCKSS project epitomizes Rosenthal’s approach to preserving physical data. LOCKSS stands for Lots of Copies Keeps Stuff Safe and the project targets only academic publications, an important but small subset of the world’s new data. LOCKSS is a network of cheap storage devices (essentially PCs with hard drives) spread across the globe. When an academic publisher signs up to the project, it gives LOCKSS permission to absorb its data into the network. A publication is carved up into many fragments, which are then run through a mathematical algorithm to create a hash (a digest of the data). Hashes are well understood tools to validate data—if the hash changes, you can assume the underlying bits of the data have also changed in some way.

Fragments of the publication are distributed around various machines on the network, meaning no one piece of data is only ever stored in one place.

Periodically, machines will verify hashes with each other for a particular piece of data. If a device’s hash doesn’t tally with those held by others, it will recopy the data from another machine.

“We built huge numbers of replicas of content and figured out ways of making them very suspicious of each other,” said Rosenthal, adding that it is still mathematically difficult to quantify the reliability of the system. “We don’t know how reliable it is, but we can guess it’s fairly reliable.”

Others are also pursuing persistence through mass duplication. The MIT Libraries and HP jointly developed DSpace, another distributed system for preserving scholarly works. UC Berkeley’s Computer Science Division is working on a prototype global-scale persistent data system called OceanStore.

Copy the software, too
But such systems,along with LOCKSS, only solve the problem of physical persistence. There is another, more politically charged problem to overcome. Even if you solve the physical preservation problem, people in 20 years may not have access to the software that created the data. Consequently, they may not be able to read it. Who is to say, for example, that Microsoft Office will still be available in its current form halfway through this century?

“You have to start from the premise that no format will last forever,” said Adrian Brown, head of digital preservation at the U.K.’s National Archive. “You could say that you preserved something as long as you have a current means of accessing the information—broadly as long as you have software that is capable of interpreting that format for you.”

One way to help guarantee that is to ensure data is stored in non-proprietary formats that are open and well published. Preserving photographs in JPEG format, rather than a proprietary format such as RAW, makes it more likely people in the future will be able to access them, for example. But even that may not stand the test of time. Ultimately, Rosenthal said, you have to keep a copy of the software somewhere safe. “You run the old binary of the application and it should all work,” he said.

“But there’s an assumption that you have someone around who knows how to operate a 20-year-old machine,” Rosenthal said. Or a 50-year-old one. Or one that hasn’t been used for more than a century.

One way around this could be to do everything in software, including the computer platform itself. “Virtualization is one force working in our favour, and it falls into the same category as emulation,” Rose said. Even if people cannot access Apple Mac hardware in 50 years, for example, a system could be written that will run OS X on a computer that hasn’t been invented yet.

And then lawyers step in
All this assumes the company that sold the application software in the first place allows people to keep copies of it, something that is far from certain in an age of rapidly shifting software licencing policies and Software as a Service arrangements. “One of the big problems around this is intellectual property law, not the technology at all,” Rosenthal said.

But intellectual property and technology are intrinsically linked, and digital rights management is locking up vast swathes of data. “Any kind of encryption technology that bars access by technological means is a problem not necessarily for preservation, but certainly for being able to use the preserved object,” said Chris Rusbridge, director of the U.K.’s Digital Curation Centre.

Unfortunately, lawyers don’t think that way. In the U.S., for example, the Digital Millennium Copyright Act 1998 made it illegal to try to break encryption, even for research purposes. With both information holders and software vendors protecting their intellectual property so aggressively, archivists are facing both technical and legal hurdles in documenting what must be one of the most exciting times in human history.

Little wonder, then, that years after the BBC’s Domesday project was solved, you still can’t buy it off the shelves. When the BBC realized its data was dying, CAMiLEON stepped in. The project, formed by the universities of Michigan and Leeds, deals with long-term data preservation projects. It took an emulation approach to the Domesday challenge, using an open source BBC Micro emulator to run the software, and was still able to find Laserdisc readers that could transfer the data, while also accessing data from the original analogue videotapes used for storage.

Many different copyright holders made data available to the project, and understanding who owns what and how it can be used has proved to be the biggest challenge of all. Consequently, although the data has been preserved and the community data put online, much that made the Domesday project such a valuable guide to British life in the 1980s is still unavailable to the public.

While you can’t go and take the original Domesday Book out of the library either, it does exist and you can browse and download the book online. That’s not bad for something produced almost 900 years before the first PC rolled off the assembly line. One wonders whether we will be able to say the same of the information produced today.

CoverStory Archive
Top Lists

 

Top 50 Technology Companies

more Top lists>>
Green Innovation

Top 300 Issue
 
Gadget of the Week (Canadian)



Pick the best 3G for you 
RIM Blackberry Bold 

Choosing the right smartphone is an important decision, and here’s the good news: while both the new iPhone and the Bold are excellent, the feel is entirely different, making it easy to choose.

more>>
Gadget of the Week (Japanese)




Sounds of Japan
Why record just the visual when you can capture the sounds as well.

more>>
Backblog RSS feed
Click to subscribe
© 2006-2007 Backbone Magazine. All Rights Reserved. Privacy Policy | Terms of Use.