Home Up Services Experience Strategy Contact me Hot Topics

Getting a handle on data

By Julian Perkin
Financial Times;
Nov 17, 2004  (See article on-line at FT.com)


It is hard to imagine life without the worldwide web, which unleashed the power of the internet on an unsuspecting world. The web is far from perfect, but as individual users we tend to take the rough with the smooth, accepting the broken links and out-of-date information as irritants in exchange for the treasure trove of information, entertainment and cheap goods and services such as low-cost flights.

But things are more serious for governments, which risk failing in their duty to provide citizens with proper information and services, and for businesses which can lose customers and money simply because data have a habit of changing form and moving to new locations on the web.

Fortunately, new solutions are emerging. Most academic journals published online are now able to cross-reference each other reliably - if you click on a citation it is pretty well guaranteed to lead to a real published paper, even if it was published some years ago and is now hosted on a new owner's website. No broken links. No early draft versions.

These cross-references rely on Digital Object Identifiers or DOIs. Proponents argue that DOIs, and a related technical innovation called the Handle System, will provide an equivalent quantum leap to that of the worldwide web, with important implications for governments as well as publishers and other media companies.

To be able to open up their systems and share information, governments and departments need to agree on standards enabling information to be correctly and uniquely identified on different systems and for these systems to inter-operate reliably - providing the necessary information flow while respecting the policies on classified material and other statutory exemptions. Demand, already high, is set to ramp up. More than 500,000 requests have been made by citizens, journalists and companies to the US government under its Freedom of Information Act since its enactment in 1966.

Two forceful waves of change are inducing government departments to change the way they manage their data. The first is the tide of events that demand a response through greater co-operation - from counter-terrorism imperatives following September 11, including the homeland security initiative in the US, and fall-out from the wars in Afghanistan and Iraq, through to failures of social security and police departments adequately to share information to protect the public. This was made manifest in the UK by the Bichard report published in June this year which showed how failures to pass on information from one police force to another meant standard checks were not made on Ian Huntley, the convicted Soham child murderer, who would otherwise have been barred from a job in a school.

The second wave is the trend in policy-making towards open government and greater transparency. More than 50 countries from Mexico to Indonesia and including the US and Canada, the European Union and central and eastern Europe have passed, or are in the process of passing, freedom of information (FoI) laws. The pace of change is increasing - the number of countries with FoI legislation has more than doubled in the past decade. In the UK, for example, the Freedom of Information Act (2000) goes fully into effect from January 2005. Despite accusations that the bill has been watered down, the demands on government departments and their systems to meet new public rights to access government information will still be considerable.

This is where the Handle System and DOIs, two standards that are rising to prominence as digital identifiers, can play a key role. The Handle System is a comprehensive system for assigning, managing and resolving persistent identifiers, known as "handles," for digital information on the internet. It provides a global, standard method for uniquely and permanently identifying digital content on the internet. Content means anything from newspaper articles, official reports, photographs and illustrations, and tables of statistics through to music tracks and video libraries.

DOIs are a standard method, based on the Handle system, for identifying published digital content. They are primarily concerned with publication and are endorsed by the publishing industry as an international standard. DOI lends itself to providing reliable linking and discovery of documents on the web.

The critical difference between DOIs and traditional links on the worldwide web is that DOIs identify the actual content, while the web references its location. Identifying something by its location, as we all know from experience of the web, has its drawbacks. Things frequently move and links then fail. You can never be quite sure whether you have the latest version, or the definitive copy. And, since duplicates are easily and frequently made, searches give you multiple instances - or, worse, different versions - of the same material.

Handle-based identifiers, including DOIs, are unique on a global basis, and persistent - that is, they will stand the course of time, unlike many web addresses. As a result, they are guaranteed to resolve to a real document and can be used, through a system of access via trusted intermediaries, to ensure that everyone gets the same, definitive version of a document such as a government report.

Identical copies of documents will have the same DOI, so the version found can be trusted, while different releases will have different DOI references that can be correlated through similar systems of access. Of course, there may be value in providing access to different versions of reports, or to supporting information and sources related to documents. DOIs can be used to link together such correlated sets of documents.

For images, audio and video clips, DOIs can be sewn into the fabric of the content, so references are carried with the object, even if it is cut and pasted into another document or application. This opens up interesting possibilities - for publishers and commercial organisations including those in the music, graphics arts and video production industries, as well as for governments - concerning copyright control. Embedded DOIs can be used either to police access to, or to track copies made of, copyright-protected material.

Add to this the potential to identify elements of content rather than whole documents - a chart in an official report, say, or a chapter of a book, or a track on a music CD - and there is clearly a range of new commercial opportunities for publishers and other information providers, not to mention some threats to existing business models.

This ability to identify and access separately component parts of documents - known in this field as "disaggregation" - can serve the needs of public information provision and disclosure to link together partial sources of information from different reports "federated" across the systems of multiple government departments: for example, to respond fully to requests under the Freedom of Information Act.

Sensitive information within reports can also be classified so that such reports can essentially be made public without compromising national security, endangering the innocent, or giving away state secrets - with blanked-out names, locations and so on.

A key strength is that DOIs will facilitate a degree of convergence between printed reports and on-line data. The DOI code will also be printed below the tables and charts in hard copy reports and books. This code can be typed into a web browser to access the same services - latest figures, more exhaustive statistics, access to source data etc.

DOIs will appear just like links that we have become used to. But they will be more reliable, greatly enhancing the online experience by blending in seamlessly behind the scenes. They will be like a turbo-charger under the bonnet of the web engine, rather than an alternative or competitor to the web.

In a second article, to be published in the next FT-IT on December 1, we look at some of the users of the new system.

Questions of identity

Who is behind DOIs and Handles?
DOIs are being developed and implemented jointly by the International DOI Foundation (IDF) and by the Corporation for National Research Initiatives (CNRI), a US-based academic institution with close links to the US government and responsible for many initiatives that underpinned the development of the web. DOIs are implemented on the Handle Architecture system, developed by the CNRI for identifying resources on the internet. In many ways, this is designed to address known weaknesses in the current "domain name" system of web addressing. The team responsible for the Handle system includes Dr Robert Khan of CNRI and other individuals involved in the creation of DNS.

Who is using the system?
Applications using handles have been developed by divisions of the US government's Department of Defence, exploiting internet channels without making documents publicly available on the web. Governments, along with publishers, are considering the use of DOIs for official publications, and the Handle system for sharing and disseminating internal information between departments and inter-governmentally.

How does it work?
Digital identifiers - DOIs and Handles - are allocated by registration agencies and held in one of five Local Handle Systems (LHS) in the US, Asia and Europe. Each is mirrored by a Global Handle System in Washington, partly as a back-up and also to look up identifiers that cannot be served locally. The registration agency holds a directory of pointers to the actual published material. When you click on a DOI or Handle link, the system goes first to the local, and if necessary to the Global, Handle system which points to the actual information. It is analogous to making a mobile phone call, where the number is sent to the mobile network operator, which locates the actual location of the phone and routes the call via the nearest cell.

What happens when content changes location or ownership?
Because content is accessed indirectly, movement of the information from one location (such as a web site address) to another can be accommodated with no discernible impact on the user. Changes in content ownership can also be accommodated since the registration agencies can change their look-up tables to reference the new content owner's systems and naming schemes.

How can we be sure the system will work, and what safeguards are there?
The registration agency acts as a trusted intermediary - trusted by the publisher and the user of the information. (DOI agencies tend to specialise in certain domains such as official government publications.) It guarantees continuous availability and can automatically resolve to alternative sources if the primary source is not available for any reason. The Handle system is proof against technical failure of any one central location and is not under any single authority's control. So it cannot be shut down by a maverick individual authority and is not under the control of any one nation.

What about existing identification schemes?
DOIs can accommodate publishers' existing identification systems such as ISBN for book publications. A DOI comprises a prefix and a suffix. The prefix might indicate that the publication is identified by an ISBN, while the suffix, which is not limited in length or format and references the specific content, would contain the specific ISBN. The advantage is that ISBNs and other existing identification schemes can be resolved online using the standard DOI resolution mechanism, which will be understood by many different systems and applications.

What about copyright and access rights?
These can be managed via the trusted intermediary role. The DOI does not give direct access to a document, so rules can be applied at the look-up stage - checking for subscription payment or authority to access, or perhaps requiring the user to acknowledge the copyright terms.

Copyright Financial Times group