The 10 DOs and DON'Ts for persistent URIs
This document explores best practices on the publication of Uniform Resource Identifiers (URI)sets, both in terms of format and of their design rules and management. The first two elementsare linked in that a well-designed URI is more likely to persist than a badly designed one.Management issues are independent to URI design itself.
Why is URI persistence important?
When a book is published, if nowhere else, it should still be found in national librariesmany years in the future. When a patent is lodged or a work copyrighted, that creates alegal status that can be referred to reliably both now and in the future. Books, patents andlegal documents are matters of record and what is sought here is the equivalent foridentifiers that lie at the heart of Web-based interoperable data.
The recent development of open data and the desire to increase its interoperability have lead toan increased reliance on URIs as identifiers for a wide variety of concepts; everything fromlanguages to buildings, government departments to currencies. Against this demand there is anatural human reluctance to depend on the Web - a system that is seen as being 'new.' Thisreluctance is entirely understandable given that the RFC that defines the URI syntax is only14years old. Furthermore, de-referenceableii URIs depend on the provision of an online service,one that cannot be maintained without some agency funding the relevant server infrastructure.Such funding is itself ultimately dependent on a decision that the cost is less than the benefit, abalance that is very much subject to change in either direction over time.
Why are persistent and well-formed URIs important when public administrationsexchange data?
During data exchange, there is a need for common identifiers for the resources (classes,properties, individuals, real world entities) exchanged. Even before the evolution of linkeddata, Peristeras et al. (2008*) emphasised in their work the need for common identifiersto support cross-border public service provision. Nowadays, the use of URIs as means ofassigning unique, global identifiers to resources can provide an effective solution to this.By referring to the same URI, different agents (let them be human or machines) caneasily reason that they are referring to the same resource, regardless of how thisresource is modelled in national/regional/local information systems. This practicallymeans the use of persistent and well-formed URIs, can help EU Member States toovercome semantic interoperability conflicts and provide to their citizens and businesscross-border public services, thus supporting the Single Market and the mobility ofpeople, information and goods in the EU.
* Peristeras V., Loutas N., Goudos S., Tarabanis K.: A Conceptual Analysis of SemanticConflicts in Pan-European E-Government Services. In Journal of Information Science, vol. 34(6), pp. 877-891, 2008
The specific objectives for the study are to:
The stability of URIs depends on the way in which a given organisation prepares and managesthem. In this context, there is an underlying dependency on the Internet infrastructure,specifically the Domain Name System (DNS). At the present time, this system underpins theentirety of the World Wide Web that itself underpins huge volumes of information exchange. Setagainst a historical timeline measured in centuries, the DNS system will undoubtedly evolve andbe replaced. However, we cannot predict when and how such evolution will happen.
There are many types of URI. ISBN numbers seen on the back of books, for example, can berendered as URIs (e.g. isbn:978-0-575-08360-8). Digital Object Identifiers (DOIs) can also berendered as URIs and so on. The best known example though is of course HTTP URIs, thosethat begin with http://. These are de-referenceable. HTTP URIs can be put in a browser'saddress bar to return more information about the resource. Other URI schemes may not be dereferenceablein the same way which is the key difference between RDF (which supports theuse of any URI scheme) and Linked Data, which depends on HTTP. This document focusesentirely on HTTP URIs. However, it is noteworthy that services such as handle.net provideHTTP URIs for other persistent identifier schemes, such asDOI andARK. These allow non-HTTP based identifiers to be appended to a common HTTP URI prefix and thus make them dereferenceable.Whist noting the existence of such systems, particularly in the ANDS Case Study(section 3.2), this document will focus entirely on HTTP URIs that act as complete identifierswithout the need to refer to separate identifier schemes. URI schemes such as mailto: and isbn:are not considered, again, as they are not de-referenceable.
This study intends to reach out to government officials and CIOs of governments as well asprivate companies, technology consultants and the research community, who:
This document is structured as follows:
URIs are a manifestation of a technical architecture - the World Wide Web - that is anapplication of a deeper system, the Internet. These technical foundations are important whendesigning identifiers that are intended to be used and re-used by persons unknown, into andperhaps beyond the foreseeable future. As noted in section 1.2, this document takes thepersistence of the Web and the DNS system as a given but it is important to note that thearchitecture of the Web is well defined and based on a set of principles that are themselves builtfor long term persistence, including the evolution of relevant technologies.
It is within that technical framework that the following sub sections discuss the steps that can betaken to maximise, if not ensure, the long term stability of a URI.
A Uniform Resource Identifier (URI) is a compact sequence of characters that identifiesan abstract or physical resource. (...)The following example URIs illustrate several URIschemes and variations in their common syntax components:
A URI can be further classified as a locator, a name, or both. The term "UniformResource Locator" (URL) refers to the subset of URIs that, in addition to identifying aresource, provide a means of locating the resource by describing its primary accessmechanism (e.g., its network "location"). The term "Uniform Resource Name" (URN) hasbeen used historically to refer to both URIs under the "urn" scheme (RFC2141), which arerequired to remain globally unique and persistent even when the resource ceases to existor becomes unavailable, and to any other URI with the properties of a name.Source:RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax
The aim when publishing URI sets is that a given URI can be de-referenced. That is, a useragent can make a request to that URI over the Internet and receive a meaningful responseback. If the user agent is a Web browser, then what comes back should be a human readableHTML document. If the user agent is an RDF client then RDF should be returned from the sameURI. In order for this to happen, it is important to consider the technology behind this, namelyHTTP, and how this is implemented on a server.
In any discussion of URI persistence it is necessary to understand two facets of the discussion:
These facets of HTTP mean that we can immediately say that persistent URIs should notinclude file extensions or technologies.
Many URI sets will be published and de-referenced programmatically and this will be doneusing a particular technology. 15 years ago it would probably have been done using Perl, 10year ago it would be done with PHP, today it might be with Python, Ruby, ASP.Net or anynumber of alternatives. Even something as seemingly stable as .html should be avoided. Adocument might be published today in HTML but in 20 years time, maybe HTML8 will be sodifferent that the file extension .html8 becomes common and some important documents mightget updated accordingly. File extensions often (although not necessarily) reveal the technologyused to create the resource and few things change as rapidly as technology.
It follows that query strings should always be avoided too. So, something likehttp://example.com/getId.aspx?id=7
is almost guaranteed to be ephemeral. Better toestablish a URI such ashttp://example.com/id/7
and let the server deconstruct it andreturn the relevant data through whatever technology is in use at the time, which can beupdated as required with no change to the URI.
Another aspect of server configuration and the design of the HTTP protocol itself is contentnegotiation. As we have seen, a URI such ashttp://example.com/id/7
includes noinformation about the nature of the resource itself. It might be machine readable data in anynumber of formats, it may be a human readable document that is available as HTML and PDFin any number of human languages. A properly configured server can receive a request for asimple URI like that and return the correct representation of the resource based on the detail inthe request. A user agent, be it a browser or something looking for data to process, will includeinformation about what kinds of formats it can handle and the human languages its user canprocess. It is this data within the HTTP request that determines what the response will be.New representations of the resource may become available and these can be added to theresources available to the server with no change to the URI which continues to identify thespecific resource.
Tim Berners-Lee first addressed the issue of URI stability in his 1998 paperCool URIs don'tchange. Essentially, the advice is to include data that will not change within the URI and toleave out anything that will. Very little is guaranteed never to change. Perhaps the onlyexception is the date of creation of a new resource such as a document. It may change status,author, owner, title etc. but its first creation date is something that can be included in a URI andthis might be useful in some circumstances, but if it can be left out, then it should be. Where aURI identifies a previously-existing resource then its creation date cannot be known withsufficient certainty to include it in the identifier.
It may be considered that the subject of a document is something that will never change but thisis not so. New drafts of the document may use a new title, a policy of using a particulartaxonomy to describe subjects may be brought in and so on. So even something as 'stable' as asubject should not be included in any URI designed for long term survival.
A class of URI that deserves special mention is that of 'latest version.' Such a URI should bevery stable but what it returns might vary frequently. The W3C publishing system gives a goodexample of this.
The URIhttps://www.w3.org/TR/vocab-org/
always points to the latest version of theOrganisation Ontology. At the time of writing, the latest version is the one published on 23October 2012 which has its own stable URI ofhttps://www.w3.org/TR/2012/WD-vocab-org-20121023/
and that is the document returned from the short URI for now. When a newversion is published, it will have its own identifier and the latest version URI will return thatdocument when de-referenced.
https://www.w3.org/TR/vocab-org/
is a persistent URI and, like a news portal's homepage, is guaranteed to return the latest information, even when specific information has goneout of date.
This chapter sets out a series of case studies. The focus is very much on public sector use ofURIs, particularly linked data. A variety of online sources were used including academic papersand official guidelines, and these were augmented by direct e-mail exchanges andcommunications.
This section reports on the persistent URI policies of EU agencies and services.
The Publications Office (OP) of the European Union began its CELLAR project in 2010 with thevision to make all the metadata and digital content it manages available at a single place in aharmonised and standardised way in order to:
From apresentation (PPT) given to the European Information Association by the head of theEnterprise Architecture Unit, Peter Schmitz, in March 2011, we can see that URI-basedtechnologies are an integral part of CELLAR and that long term preservation was recognised asbeing important from the outset. When the project began, the OP was able to draw on in-houseskills and expertise but to realise the project it was necessary to engage contractors. At the timeof writing, initial data is being loaded into CELLAR ahead of its formal launch and further datawill be added over the coming months.
Publishing information and committing to its long term preservation and stability has alwaysbeen an important aspect of the OP's work. This organisation maintains several de factoschemas such as the EUR-Lex schema smartAPI that is already more than 10 years old andthat can be used as a URI set. However, its use is no longer encouraged and, for reasons thatwill discussed in later sections, URIs such ashttp://eur-lex.europa.eu/smartapi/cgi/sga_doc?smartapi!celexplus!prod!CELEXnumdoc&lg=en&numdoc=308R1008
must considered brittle since they depend on a particular implementation. The adoption oflinked data has put greater emphasis on the importance of URIs as identifiers and it is thischange that has lead to the development of CELLAR.
URIs need to be resolvable so that further information can be extracted from them and forinformation to be available in multiple languages and multiple formats. The URIs themselves aredesigned carefully for long term management and stability. Every URI begins with the samepattern:
http://publications.europa.eu/{type}/{subtype}
where there are just three possible values for{type}
:
resource
, for content and metadata resources;ontology
, for schemas;webapi
, for Web API services.For example, editions of the Official Journal all have URIs beginning with:
http://publications.europa.eu/resource/oj/
where 'oj
' acts as the subtype. Named Authority Lists begin with
http://publications.europa.eu/resource/authority/
Beyond the second path component, the structure depends on the specific case. OJ editions allhave identifers based on their year of publication and edition within that year. Each edition ofthe OJ is available in multiple languages and these are included in the URI; so the the firstedition of the OJ from 1952, in German, is identified by:
http://publications.europa.eu/resource/oj/JOP_1952_001_R.DEU.
The entry for English in the Named Authority List for languages is
http://publications.europa.eu/resource/authority/language/ENG.
These URIs are inherently stable since any new Named Authority List can be added, with thelist name as the third path element, and the specific entry in that list as the fourth. The name ofa publication will always appear as the second path element and so on. The Publications Officeconsistently uses ISO 693-3 3 character language codes as the level of detail provided is theright one for OP’s multilingual environment.
The OP makes extensive use of content negotiation (section 2.3) and language negotiation.Many items published by the OP are 'works' within theFRBR sense of the word. Each workhas its own URI and CELLAR returns a specific manifestation of that work based on the HTTPRequest headers.
CELLAR makes use of its HTTP server's native support for content negotiation for the first ofthese but not the second. That is, the server inspects the Accept header in the HTTP requestand returns HTML (for humans), RDF or XML. (All HTTP requests include information about thedevice making the request. In the case of a Web browser, this will include the type of browser,operating system and language preferences). One of these three formats is always returned.HTTP is less deterministic for languages so that if the request header specifies a language inwhich the particular work is not available, the server response can vary betweenimplementations. The canonical response is either 'No Acceptable Variant' or 'Multiple Choices'- neither of which may be helpful for some users and so CELLAR uses its own software toalways return a representation of the work. Each manifestation of a work, that is, a particularversion of the requested resource in a specific language and specific data format, has its ownURI and this can of course be accessed directly.
As noted, the Publications office has designed its URIs to survive for the long term. Theintention is clear: that URIs will persist and will continue to mean the same thing over the longterm. However, no organisation can be certain about its future and a commitment to maintain aservice indefinitely is very hard for anyone to make. The recentEuropean Council Decision tosupport European Legislation Identifiers (ELI) was widely welcomed. However, Member Stateshave not translated that into a firm commitment to guarantee the stability of ELIs over the longterm. Similarly, like any organisation, the OP itself is subject to political change (broken link changed) and it is clearlypossible that the office might be reorganised, re-named, merged or split. Such eventualitiescannot be foreseen or guaranteed against.
The best guarantee of persistence is usefulness. The OP is making its best effort tocreate a stable URI set, designed, managed and published with longevity in mind. TheOP has a track record of maintaining URIs for more than 10 years and cannot today seeany reason why the URIs embodied in CELLAR will not persist. The subdomain,publications.europa.eu, is as stable as any can be and was chosen deliberately for thatreason. Even if the name of the institution were to change, the subdomain is sufficientlygeneric that it could easily survive. This would not be the case if, for example, the nameCELLAR, i.e. the project that created it, had been used as the subdomain.
The systems behind the various publications handled by the OP vary depending on the type ofpublication itself. The OJ is developed as a separate document whereas the Named AuthorityTables (reference tables used throughout the European Institutions) are managed directly bythe OP and edited using Microsoft Excel. From there an XML document is created and thisbecomes the Master file. Various scripts are then used to generate the HTML and RDF versionsof the tables. Importantly, the output of each of those scripts is a static file so that the resolutionof each URI is not dependent on a dynamic process, this adds stability to the system.
Recently, several Directorate Generals of the Commission have formed aninformal workinggroup on persistent URIs. This group was ‘officially’ launched in February 2012. Its objectiveis to propose a solution on how to proceed with the compilation of guidelines that can be usedto create consistent URIs and a common URI assignment policy. The aim of such guidelineswould be to define a working, scalable and performant URI model, that can be rolled outCommission-wide, to uniquely identify each physical item (e.g. data objects like 'toxic substance'or a NUTS region), each abstract concept (e.g. 'governance', styles, map layers) and each corevocabulary and datasetxiii.
The WG focuses on three areas, which are reflected in the recommendations of chapter 4:
According to the WG, a URI model should follow the following form:
{URI Root}/{Resource path}/{ID}/{String}/{Options}
Where:
Only the URI Root and Resource Path are necessary for all resources. The URI root can havethe following form: http://{Europa Home URI}. The standardURI Root can be:
http://ec.europa.eu/URI/.
The Resource Path is a hierarchical scope definition for the URI and can have the form:
{Policy Definition}/{Sub-domain Definition}*/{entity}
For examplehttp://ec.europa.eu/URI/health/indicators/echi
is made of the followingcomponents:
A URI structure with a fixed part and loosely defined part is sufficient to specify entities aswell as instances in the case of integral resources (such as big datasets, collections andqueries on them). The options part remains to be further defined but models like OpenData Protocol can serve to establish specific guidelines at a later stage.
N.B. This is an initial proposal under review (valid at the time that this report was compiled). It is expectedto evolve through the discussions of the Working Group. In particular, it seems very likely that the Working Group will decide to remove the policy definition from the URI as this information is volatile and thus subject to change.
The WG published also a set of recommendations to be considered when EU agencies andservices design URIs.
http://ec.europa.eu/opendata
.europa.eu/health/guidelinesnot europa.eu/health/1235564798765465498
;europa.eu/health/…
noteuropa.eu/healthy/
;An open question remains: will one DG become the single, central authority for managing theattribution of URI sets or will there be a set of authorities depending on the policy content?
Under the LATC Support Action, a team at DERI created a linked data version of theEurostat data (broken link removed). This is a flagship sophisticated system that gets top marks in the 5 Stars ofLinked Open Data devised by Tim Berners-Lee. As Figure 2 shows, 5 star data is published inRDF and linked to other data sets, all under an open licence. In addition to meeting the 5 starcriteria, the data is updated each week and a policy is in place (broken link removed) that makes a best effort toensure that the system persists beyond the life of the project. The Eurostat Linked Data projectnot only created the data but also produced a number of freely available tools and was the basisof DERI’sSarven Capadisli's MSc thesis.
★ | Available on the web (whatever format)but with an open licence, to be Open Data |
★★ | Available as machine-readable structured data (e.g. excel instead of image scan of a table) |
★★★ | as (2) plus non-proprietary format (e.g. CSV instead of excel) |
★★★★ | All the above plus, Use open standards from W3C (RDF and SPARQL) to identify things, so that people can point at your stuff |
★★★★★ | All the above, plus: Link your data to other people’s data to provide context |
The base URI proposed for Eurostat’s data is:
http://eurostat.linked-statistics.org/
Capadisli and Hausenblas say that they decided to keep the same file name for the metadataand the actual dataset containing observation values as they appear in the original data anddistinguish them by using dsd and data in the URI pattern. The code lists shared among alldatasets are provided by using dic in the URI pattern. Hence the following URI patterns aredefined for:
Metadata: http://eurostat.linked-statistics.org/dsd/{id}
whereid
is one of the dataset’s metadata file.
Datasets: http://eurostat.linked-statistics.org/data/{id}
whereid
is the filename of the dataset containing observation values.
Code lists: http://eurostat.linked-statistics.org/dic/{id}
whereid
is the filename of dictionary.
Observations: http://eurostat.linked-statistics/org/data/{dataset}#{dimension1},{dimensionN}
where the order of dimension values in the URI space depends on the order of dimensionvalues present in the dataset.
The Eurostat Linked Data project also is a rare example of a multilingual dataset. Although thelanguage used to mint a URI may be apparent, technically, all URIs are dumb strings andtherefore language-neutral (see section 2.2). Any number of labels may be attached to URIs inany language as set out by Jose Emilio Labra Gayo in Dublin at the Linked Open Data andMultilingualWeb workshop in June 2012. The Eurostat Linked Data Project home pageincludes some sample SPARQL queries. Queries like this are executed behind the sceneswhen URIs such ashttp://eunis.eea.europa.eu/species/124054/linkeddata
arede-referenced.
The Web page seen in a browser offers a human reader a number of options for navigatingaround the data (note the tabs such as 'General Information' in Figure 3). When the same URI isde-referenced by a machine that accepts RDF, all the data is returned, including the vernacularnames. The careful design and publication of persistent URIs doesnot imply the publication ofmultilingual 5 star linked data. Persistent URIs are also an important component in lesssophisticated systems that target interoperability. However, if public administrations are to enjoythe maximum benefit of this powerful technology, for their own internal data management asmuch as reporting to the outside world, then the Eurostat Linked Data offers a show case forwhat can be achieved and how to achieve it.
<SpeciesSynonym rdf:about="species/124054"> <speciesCode>124054</speciesCode> <foaf:isPrimaryTopicOf rdf:resource="124054/general"/> <binomialName>Thunnus alalunga</binomialName> <validName rdf:datatype="http://www.w3.org/2001/XMLSchema#boolean">true</validName> <eunisPrimaryName rdf:resource="species/124054"/> <taxonomicRank>Species</taxonomicRank> <taxonomy rdf:resource="taxonomy/2201"/> <dwc:scientificNameAuthorship>(Bonnaterre, 1788)</dwc:scientificNameAuthorship> <dwc:scientificName>Thunnus alalunga</dwc:scientificName> <rdfs:label>Thunnus alalunga (Bonnaterre, 1788)</rdfs:label> <dwc:genus>Thunnus</dwc:genus> <speciesGroup rdf:resource="speciesgroup/2"/> <dwc:nameAccordingToID rdf:resource="references/1785"/> <ignoreOnNameMatch rdf:datatype="http://www.w3.org/2001/XMLSchema#boolean">false</ignoreOnNameMatch></SpeciesSynonym><rdf:Description rdf:about="species/124054"></rdf:Description><rdf:Description rdf:about="species/124054"> <dwc:vernacularName xml:lang="sq">Ton pendgjate</dwc:vernacularName> <dwc:vernacularName xml:lang="de">Thun</dwc:vernacularName> <dwc:vernacularName xml:lang="de">Thunfisch</dwc:vernacularName> <dwc:vernacularName xml:lang="de">Weißer thun</dwc:vernacularName> <dwc:vernacularName xml:lang="da">Albacore</dwc:vernacularName> <dwc:vernacularName xml:lang="da">Hvid tun</dwc:vernacularName> <dwc:vernacularName xml:lang="da">Tun</dwc:vernacularName> <dwc:vernacularName xml:lang="es">Albacora</dwc:vernacularName>… More
Example 1 - The EEA Linked Data concerning Thunnus alalunga returned to a user agent that acceptsRDF athttps://eunis.eea.europa.eu/species/124054/linkeddata
The UK was an early adopter of open data and of linked data in particular. Its data portal,data.gov.uk, went online on 30th September 2009, 4 months after the US portal at data.gov. Animportant difference between the two is that the British portal has already from its inception putgreater emphasis on the added value of linked data.
In fact, the UK portal has always given prominence to linked data as Figure 4 makes clear withits reference to SPARQL (these days the link is to 'Linked Data'). The personal involvement ofTim Berners-Lee and Nigel Shadbolt is an important factor here as both are staunch advocatesof linked data and therefore the use of stable URIs.
It was against this background that a group within the UK government wrote a document calledDesigning URI Sets for the UK Public Sector (PDF), published in October 2009, less than 2 weeksafter data.gov.uk went online. The document covers many design issues (see next section) toensure persistence, these are copied below.
General
Requirements for URI sets that are promoted for re-use
http://education.data.gov.uk
This:
The first point highlights that a URI data set requires a resilient technical infrastructure.The first point under item 3 is stark: "expect to be maintained in perpetuity." That is a boldstatement and goes well beyond what anyone can reasonably predict, however, the keyword in that bullet point is 'expect.' What the guidelines are highlighting is the need tothink for the long term and to show theintention that the URIs will persist.
Thedata.gov.uk
domain is put forward as the long term domain to use although it has notbeen possible to identify a published URI persistence policy for the domain. Using that as theupper domain name means that sectors can then be defined as sub-domains rather than usinga public-facing Web site that will be on a domain name that reflects the departmental name.Departments come and go - anecdotally, the average lifespan of a UK government departmentis 4 years. A case that might have been in the mind of the authors ofDesigning URI Sets for theUK Public Sector was the Department for Children, Families and Schools known before June2007 and since May 2010 as the Department for Education.
The UK Government now hasadopted a policy (PDF) that explicitly recognises the value of URIs asidentifiers and it is clear from the policy that government URI data sets are expected topersist over the long term.
Designing URI Sets for the UK Public Sector suggests that URIs should include the followingcomponents:
URItype, for example one of:
These elements can be combined into a general URI pattern thus:
http://{domain}/{type}/{concept}/{reference}
where{domain}
is a combination of the host (e.g. data.gov.uk, europa.eu etc.) and therelevant sector ('transport', 'education' etc. in the case of the UK, 'resource', 'ontology' orWebAPI' in the case of the Publications Office). It is a matter of choice whether the sector isdefined as a sub-domain of the host or as the first component of the path, so that bothhttp://transport.data.gov.uk
of the UK government and the example shown before, ofthe Publications Office,http://publications.europa.eu/resource
are both validvalues for the{domain}
variable.
The{type}
element will vary between different use cases. At data.gov.uk it will be one of thetypes listed immediately above; for Dublin Core (section 3.3.1) it will be one of 'terms
','dcmitype
' etc. and for CELLAR (section 3.1.1) it will be something like 'oj
' or'authority
.' In Europeana (section 3.4.3) it will be simply 'item
' meaning that the URIidentifies an item in a collection.
The{concept}
and{reference}
elements are also found in different forms across manyexamples but there will be more variation depending on the specific case. For Dublin Core(section 3.3.1) terms it is enough to simply add the relevant term on to:
http://purl.org/dc/terms/{term}
The UK example already given shows how a URI such as:
http://transport.data.gov.uk/id/road/B3178
is minted (the B3178 is a road in Devon) and in Europeana we see URIs such ashttp://data.europeana.eu/item/00000/E2AAA3C6DF09F9FAA6F951FC4C4A9CC80B5D4154
where '00000
' identifies the collection (this can be thought of as the{concept})
and the long string identifies the specific item within that collection (the{reference}
).
A final design feature captured in Designing URI Sets for the UK Public Sector and repeatedelsewhere is the addition of a file extension at the end of a URI that reflects the likely mediatype. For example, pastinghttp://transport.data.gov.uk/id/road/B3178
(the URIfor the B1378, broken link removed) into a Web browser gives a 303 redirect to the document that describes that roadwhich is athttp://transport.data.gov.uk/doc/road/B3178
(note the substitution of 'doc
' for 'id
'). The data shown at that page (was) available in 7 different formats: HTML, CSV,JSON, RDF/XML, Text, Turtle and XML.
The particular version returned whenhttp://transport.data.gov.uk/doc/road/B3178
is de-referenced will be decided through content negotiation (section 2.3) but it was possible torefer to a specific representation of the data by appending the relevant file extension so thatappending '.ttl' (http://transport.data.gov.uk/doc/road/B3178.ttl) for example will return exactlythe same data as seen as an HTML page in a regular Web browser but encoded as RDF andserialised in Turtle. Importantly, every representation of the document includes links to all theothers. A similar set up can be seen, for example, at Open Corporates (seehttps://opencorporates.com/companies/gb/04285910 for example.
This kind of functionality, and the ability to resolve well-designed URIs at different levels of thetree, is provided by the Linked Data API (broken link removed). Although adopted across data.gov.uk, it is not arecognised standard and is not used as widely as the designed principles on which it is based.
To persist, a URI must be designed to persist. In summary: that which is subject to changeshould not be included in a URI. That which is permanent should be included. Notes on what toleave out of URIs are discussed insection 2.4. This section focuses on what to include.
The UK Government'sDesigning URI Sets for the UK Public Sector paper again provides a keyreference point. It was expanded upon shortly after its publication in an influentialseries of blogposts byJeni Tennison, then lead developer atlegislation.gov.uk and now TechnicalDirector at theOpen Data Institute. The ideas have been extended substantially by Leigh Doddsand Ian Davis in their bookLinked Data Patterns. Like Jeni Tennison, Leigh Dodds and IanDavis were directly involved in the development ofDesigning URI Sets for the UK Public Sectorand although their book is more recent (it was published in May 2012) and goes intoconsiderably more depth, the thinking behind it is the same.
The core ideas in all these documents stem originally from a W3C Semantic Web InterestGroup NoteCool URIs for the Semantic Web which in turn is a development of Tim Berners-Lee's originalCool URIs Don't Change document from 1998. The design principles havebeen collected, re-phrased and discussed in other publications, such as the American Federal data portal, data.gov (broken link removed), and Tom Heath and Chris Bizer's bookLinked Data: Evolving the Webinto a Global Data Space but the core ideas have not changed and can be seen repeatedeverywhere with little variation.
As discussed in previous sections, the choice of domain name is a critical one: the domainname itself must be one that is expected to persist for as far into the future as anyone canreasonably see. In many cases, this means establishing or using a servicespecificallyestablished to provide that stability.
Designing other URI components requires an understanding of what a given URI actually refersto: what it identifies and what it does not. The use of HTTP URIs as identifiers is extremelypowerful since it allows computers to look up information about a given 'thing' and to recognisethat wherever the same URI is used, it identifies the same 'thing.' However, there is a usefuldistinction to be made between URIs that identify real world things as opposed to concepts andbetween a single piece of data and a set of data. There is a more fundamental differencebetween a real world object, such as a school or a person, and a document that describes thatobject. This difference is at the heart of a discussion that has been going on for more than adecade under the arcane title ofHttpRange-14. W3C's Sandro Hawkegathered manyreferences to the discussion when it first became 'old' - in 2003. The W3C TechnicalArchitecture Group (TAG) - the permanent Working Group that oversees the overall direction ofWeb technologies - isstill grappling with the issue. An article by Mike Bergman fromJanuary 2012Give Me a Sign: What Do Things Mean on the Semantic Web? offers aphilosophical discussion of the issue.
In a nutshell, if a URI identifies a physical object that cannot be transmitted over the Internet,what should be returned instead when the URI is de-referenced?
The usual solution - that is, thesolution agreed by the TAG in 2005 - is that where a URI refersto an information resource, i.e. something that can be represented as a stream of bytes, thedata should be returned. Where the URI identifies a non information resource (like a building ora person) then the HTTP server should offer a303 'Other' re-direct to a different URI thatidentifies a document that describes the object. This solution has many adherents as it usesexisting HTTP infrastructure and is faithful to the semantics. However, the additional round tripto the server to fetch the document describing the thing you asked about in the first place, andthe fact that you very often can't copy and paste a URI from a browser window into anotherdocument (because it's changed), are things that many developers and practitioners areunhappy with - hence the endless discussion. Be that as it may, the resolution is embodied in allthe URI design patterns seen in the course of compiling this document.
Estonia, a country known for its advanced use of e-Government services, has recentlypublished itsFramework of Websites version 1.0 document (broken link removed), part of theInteroperability Framework of the State Information System version 3.0. These documents do not addressURIs in the sense discussed in this document, i.e. as identifiers used in data sets.
Priit Parmakson of the Estonian Information Systems Authority provided us with acomprehensive overview of initiatives in Estonia that deal with persistent URI design rules andmanagement.
Although there is no official URI policy, Estonia’sFramework of Websites document does,however, offer guidelines on Web site construction. All public administrations in Estonia arerequired to publish a Web site and within it there are certain sections that must be present atdefined URLs. For example, the contact information must be athttp://{domain}/kontakt
,the news must be athttp://{domain}/uudised
and so on. We can draw from theseexamples the following URI model:
{domain}/{type}/
The above pattern is not always used as there are examples of the use of URIs as identifierswithin datasets. More information is provided in the next section.
TheEstonian Land Cadastre, operated by the Estonian Land Board (Eesti Maa-amet) has apublic interface where cadastral units can be accessed by permalinks in the formhttps://xgis.maaamet.ee/ky/72704:004:0430. This is aninteresting case akin to ANDS (section 3.4.2) as the onlineservice is being used to resolve a non-URI identifier. In this case,72704:004:0430
is thepermanent identifier. The bulk of the URI(http://xgis.maaamet.ee/ky/FindKYByT.asp?txtCU=
) is clearly not designed forpersistance because:
?txtCU=
) and thus the URI reveals a lot about the systembehind the URI (it's looking up a value in a database).The replacement either of the database system or of ASP would almost certainly require achange in the URI and so is unlikely to persist. We understand that URI persistence may nothave been a documented requirement when this system was developed. It may also be that inthis case the need for persistence is understood as not too relevant.
A similar situation applies to the Estonian National Place Names Register (Riigikohanimeregister). Owned by the Ministry of Interior Affairs, and operated by the EstonianLand Board, the register has a publicly accessible service (avalik liides) where acomprehensive database of Estonian place names can be accessed by URLs in the form:
http://xgis.maaamet.ee/knravalik/knr?obj_id=3375.
Here again we can surmise that that the actual identifier is the number3375
and that a query isbeing made to a relational database rather than data being provided directly through a URI suchashttp://xgis.maaamet.ee/knravalik/knr/3375
which would fit the pattern usedelsewhere.
The Estonian State Gazette (Riigi Teataja), the national database of legislation, owned by theMinistry of Justice has made all Estonian law referenceable and de-referenceable, down toparagraph level. For example, the Public Information Act (Avaliku teabe seadus) is accessible athttps://www.riigiteataja.ee/akt/122032011010?leiaKehtiv.
In this instance, the query string (?leiaKehtiv
) is not being used as part of the identifier assuch. Instead, it is an instruction to return the latest version. Although not directly inline with therecommended practice of assigning a stable URI that always points to the latest version withdifferent URIs for each draft (section 2.5) it is closer in spirit than the other examples. It is alsonoteworthy that individual paragraphs within the legislation can be accessed directly using afragment identifier so that to refer to § 431 one can use:
https://www.riigiteataja.ee/akt/122032011010?leiaKehtiv#para43b1
There are several stable URI patterns in use in Estonia. Authentic descriptions of all publicsector information systems, for example the Riiklik ehitisregister (Estonian Register of Buildings)is identified byhttps://riha.eesti.ee/riha/main/inf/riiklik_ehitisregister
and an ontology such as the Ontology of Health Insurance is identified ashttps://riha.eesti.ee/riha/onto/ravikindlustus/2008/r2
with appendedindividual terms so that the role of an Assistance Doctor (and abiarist) is identified within therelevant ontology as:
http://riha.eesti.ee/riha/onto/toohoivejasotsiaalkysimused/ravikindlustus/2008/r1/abiarst
Although these URIs have obviously been carefully designed to persist, they are not fullyconformant with the practices examined so far as they include the version number/date ofthe ontology, a practice which may create persistence problems (see section 3.3.1 for anexample of where this goes wrong).
Estonia is already perfecting their URIs. An example of this is the requirement to indentifyobjects with URIs in the Estonian Open Data Guidelines (broken link removed). In the summer 2012, a programmeof open data projects was launched with a budget of € 1.3 million. The 8-10 projects now underway are dealing with the establishment of URI schemes for different types of objects. Forexample, the national sports person database already has convenient URI scheme in usehttp://www.spordiinfo.ee/esbl/biograafia/Ilmar_Ruus
. This is exactly in line withpersistent URI design discussedearlier, having the basic structure of :
http://{domain}/{type}/{concept}/{reference}
(section 3.2.1.1).
The introduction of URIs as identifiers in the Estonian public sector should be viewed in thelarger context of naming and identification practices which are recognised as being extremelyimportant for interoperability and data quality. One current large programme, where identifierissues play major part, is Estonia's Computerised Census 2020. This involves close linking ofdata from a number of base registers and several projects are underway that review andimprove identifier systems in these registers.
TheAgenzia per l’Italia Digitale has recently published a set of guidelines for achievingsemantic interoperability in the public sector through Linked Open Data (broken link removed). As part of their LinkedOpen Government Data roadmap, the agency is planning to gradually open-up data aboutadministration, public contracts, geo-data and taxonomies, people with key positions in thepublic sector, public services and organisational units.
We have interviewed Dr. Giorgia Lodi and Mr. Antonio Maccioni to find out how the agency isdealing with persistent URI design rules and management for their LOGD. Mr. Maccioni saidthat the agency's URI strategy was inspired by the following two reference initiatives:
This is another indication of the great impact that these two initiatives have on URI design,which makes them important references.
Dr. Lodi, Mr. Maccioni and their team have specified URI structures for classes, properties,instances and individuals. The base URI for all resources ishttp://spcdata.digitpa.gov.it
.
The URIs for classes must follow the following structure:
http://spcdata.digitpa.gov.it /{concept name}
Different values for {concept name} are possible depending on the type of the class, e.g.administration, organisational unit or public contract.
Unlike the suggestion of the UK guidelines, the Italian team decided not to use thegovernment sector in the URI and refer directly to the concept name. According to Mr.Maccioni, removing the public administration’s hierarchical structure from the URIsstructure will increase their reusability across different sectors, which then fosters thevision of a unique global knowledge space known as the semantic web.
The URIs for properties must follow the following structure:
http://spcdata.digitpa.gov.it/{property name}
while the URIs for instance must be structured as follows:
http://spcdata.digitpa.gov.it/{concept name}/{natural key}
In this case, {concept name} refers to the class to which the instance belongs to and {naturalkey} denotes a unique alphanumeric identifier of the instance, e.g. the legal identifier of a publicadministration agency or a post code (depending on the nature of the instance).
A key design decision of the team is that all terms in a URI will be in Italian. This is expected tolimit the complexity of the URIs and make them more understandable and easier to interpretespecially among a non-technical audience. Of course, this is pursued by avoiding duplicateURIs while reusing external ontologies and vocabularies. In fact, when third party classes areused, class names are translated into Italian to be used as concept names. For instance, if theclasshttp://purl.org/goodrelations/v1#ProductOrService
is used fordescribing products (i.e. defined by natural keys prod-id1, prod-id2, etc.), the URIs of theproducts will be formed as follows:
<http://spcdata.digitpa.gov.it/Prodotto/prod-id1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/goodrelations/v1#ProductOrService> .<http://spcdata.digitpa.gov.it/Prodotto/prod-id2> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/goodrelations/v1#ProductOrService> .
Finally, the URI structure for individuals is as follows:
http://spcdata.digitpa.gov.it/{individual name}
where {individual name} can for example be the name of a dataset.Mr. Maccioni said that currently their persistent URI infrastructure does not support contentnegotiation but that this is part of their future plans.
Despite not having in place yet an official policy on persistent URIs, a number of EU memberStates are aware of the importance of URI persistence and plan to create such a policy in thenear future.
Peter Krantz provided us with a comprehensive overview of initiatives in Sweden that deal withpersistent URI design rules and management. Sweden has seen a small number of linked dataprojects such as Swedish Open Cultural Heritage (SOCH) K-samsök (broken link removed) and one at the NationalLibrary of Sweden described inMaking a Library Catalogue Part of the Semantic Web (broken link removed). Thelatter uses URIs that are similar to some of the other initiatives:
http://libris.kb.se/resource/bib/{number}
for bibliographic records and for authority records:
http://libris.kb.se/resource/auth/{number}.
Notice the type of eitherbib
orauth
. Dereferencing an example URI ofhttps://libris.kb.se/bib/7771917 gives either a Web page or RDF data depending onthe user agent used (i.e. via content negotiation, seesection 2.3). Although there are no(known) official Swedish government guidelines to refer to, Sveriges Domstolar, the SwedishCourts, have published guidance on publishing URIs for legal information (broken link removed). This paper repeatsmuch of the advice we have already seen and cites many of the same references documents.
<rdf:Description rdf:about="http://libris.kb.se/resource/bib/7771917"> <dc:identifier rdf:resource="URN:ISBN:9188420566"/> <dc:description xml:lang="sv">Li:S</dc:description> <dc:publisher>Tranan</dc:publisher> <rdfs:isDefinedBy rdf:resource="http://data.libris.kb.se/open/bib/7771917.rdf"/> <dc:title xml:lang="sv">Vitlöksballaderna</dc:title> <dc:description xml:lang="sv">Första svenska uppl. 2001</dc:description> <dc:date>2001</dc:date> <dc:creator rdf:resource="http://libris.kb.se/resource/auth/205835"/> <bibo:isbn10>9188420566</bibo:isbn10> <dc:language rdf:resource="http://purl.org/NET/marccodes/languages/chi#lang"/> <dc:creator>Mo, Yan, 1955-</dc:creator> <dc:creator>Yan Mo</dc:creator> <dc:language rdf:resource="http://purl.org/NET/marccodes/languages/swe#lang"/> <dc:type>text</dc:type> <rdf:type rdf:resource="http://purl.org/ontology/bibo/Book"/> <dc:description xml:lang="sv">NB: En annan version av orig. titel, se 740</dc:description> <rda:placeOfPublication rdf:resource="http://purl.org/NET/marccodes/countries/sw#location"/> </rdf:Description>
Example 2 - Some of the RDF data returned fromhttp://libris.kb.se/bib/7771917
Thodoris Papadopoulos of the Ministry of Administrative Reform and eGovernance, Greeceprovided us with a comprehensive overview of initiatives in Sweden that deal with persistentURI design rules and management.
Greece does not yet have a formal policy on persistent URIs, nor any official open datapublished as linked data. Like Estonia, it does have some rules on Web site URLs. However,Greece still lacks Linked Data building blocks. For example, the list of Official GovernmentCategories (broken link removed) and the Unique Lexical Identifiers (Latin identification description strings) for mostof theGreek Public Organizations are published online but the URI in use is not designed forpersistence:
http://www.ermis.gov.gr/portal/page/portal/ermis/egcl?p_topic=perivallon_kai_fysikoi_poroi
(broken link removed)(the Environment and Natural Resources public services).
The above URI is tightly coupled to the particular portal implementation. This means thatthe URI will be impacted once the portal changes. With proper URI design, this shouldnot happen as this change will impact all client applications using the portal.
However, the query stringperivallon_kai_fysikoi_poroi
is perhaps a component of a futurepersistent URI scheme. It is noteworthy that the Greek authorities recognise the value inadopting a standards-based approach, an attitude that is being encouraged by nongovernmentalactivities such as publicspending.gr (broken link removed). There is a willingness to participate in, forexample, the ISA Programme and recognition that the design of URIs should take into accountthe following elements:
These are the factors highlighted inDesigning URI Sets for the UK Public Sector which againwas cited in research for this document.
Finland does not have at the moment a national policy on persistent URIs. However, accordingto Ms. Kauhanen-Simanainen, Ministerial Advisor at the Finnish ministry of Finance,Finland is planning to develop common guidelines for this in the near future.
The National Library of Finland recognises that URIs should always be persistent, both whenpublishing linked data and when using links to refer to documents and other online resources.To ensure this, they argue that persistent identifiers, such as URNs, should be used as URIs (broken link removed).In this case, mapping from a persistent identifier to current location(s) would be maintainedcentrally in a resolution service. On the contrary, adopting a Linked Data approach, the FinnishResearch project FinnONTO, run by theAalto University, suggests the use of HTTP URIsversus that of URNs and DOIs. The project has also delivered a tool for managing URIs, whichis available through theONKI service.
The authors understand that the subject of URI strategy is being discussed by the relevantauthorities in theNetherlands but, as yet, nothing has been finalised or published. It is hopedthat this document will prove useful in those discussions. InDenmark, all data is published in aformat in which it is originally used which at the moment doesn't include any linked data.Therefore the subject of URI persistence has not yet been discussed. Likewise theCzechRepublic has not yet considered the issue. A policy is under development byCTIC on behalfof the government inSpain and follows the UK pattern closely.
This section reports on the persistent URI policies of standardisation bodies and other similarinitiatives.
The Dublin Core Metadata Initiative has a very clear set of policies around URI design andpersistence, set out onfits Web site. It publishes 4 vocabularies, each one with its ownnamespace.
The URI pattern is shown below:
http://purl.org/dc/{vocabulary}/
More information is shown in the table below. Individual terms within those vocabulary follow the final /character. In common with recognised best practice, terms are written in camel case andclasses begin with uppercase letters, properties with lower case.
Namespace | Notes |
---|---|
http://purl.org/dc/terms/ | All DCMI properties, classes and encoding schemes. This is best known and widely used namespace |
http://purl.org/dc/dcmitype/ | Classes in the DCMI Type Vocabulary |
http://purl.org/dc/dcam/ | Terms used in the DCMI Abstract Model |
http://purl.org/dc/elements/1.1/ | The Dublin Core Metadata Element Set, Version 1.1. This was the namespace of the original 15 elements. |
Two things stand out from the table above. First of all it is interesting to note that the original DCElements namespace included the version number. A version number also occurs in the(equally widely used) FOAF namespace (http://xmlns.com/foaf/0.1/
) despite thespecification currently being onversion 0.98. Both FOAF and DCMI date from the earliest days ofthe Semantic Web and so the inclusion of version numbers is an indication of this. At the time,the assumption was that as new versions of the vocabulary came out, new namespaces wouldneed to be declared so that the specific semantics of any term could be updated withoutaffecting previous versions that were already in use. That sounds eminently sensible on paper.However, the implication is that an application built today should use today's namespace - i.e.the latest one available - whilst older applications and data would be based on whatever wascurrent at the time, even if the two versions of the vocabulary included the same terms. Onemight have seen modern day applications using:
http://purl.org/dc/elements/1.5/title
whilst older ones would use
http://purl.org/dc/elements/1.1/title
even though the meaning oftitle
has not changed between the versions.
DCMI's move to a namespace without a version number avoided this unhelpful URI proliferationhappening, as has FOAF's continued use of '0.1' in its namespace long after it has lost itsmeaning. Once a term is defined in the Dublin Core namespace, it is subject to extremely strictchange control. Quoting from the policy:
Changes of definitions … will be reflected in the affected DCMI recommendation and/or DCMIterm declaration. If, in the judgment of the DCMI Directorate, such changes of meaning arelikely to have substantial impact on either machine processing of DCMI terms or the functionalsemantics of the terms, then these changes will be reflected in a change of URI for the DCMIterm or terms in question. The URIs for any new DCMI namespaces resulting from suchchanges will conform to the DCMI namespace URI pattern defined above.
In other words, unless the change in definition is trivial and is very unlikely to affect runningapplications, a new definition means a new term with its own URI and the old one will persist.
The second thing to notice is DCMI's use of persistent URLs (purls) at purl.org.
Some historical context is useful here. OCLC, the Online Computer Library Center, is both theorganisation behind purl.org and was the host from 1995 to 2008. In that sense one canthink of Dublin Core and purl.org as having a common ancestry in OCLC. Whether DCMIwould have used purl.org without that common ancestry is unknowable. The idea behind it isthat it provides stable, persistent URIs that can be used even if the resources they ultimatelyresolve to move. The rationale for using purl.org is that it is generally easier to hand on aservice, like purl.org, than it is an organisational domain such as dublincore.org. If required, thepurl.org service is easier to be taken on by another organisation. The use of purl.org thereforecan be seen as a clear indication of the intention that the URIs will persist for as long into thefuture as it can reasonably be foreseen. The question of whether DCMI needed to use purl.orgarises since its own Web site, dublincore.org, is itself designed for long term stability.
The separation of namespace and definition that purl.org provides does allowDCMI to move things around on their site, publish updated schemas at new URIs and soon without affecting the rock solid stability of URIs likehttp://purl.org/dc/terms/creator
onwhich so many applications depend.
purl.org itself is used by many organisations and therefore a lot of people regard it asimportant. It is that diversity of interest that is the best guarantor of purl.org's continuedexistence, even if ownership changes in future. Nevertheless, it is a service like any other andas its usage continues to grow, so does the cost of running it.
TheW3C operates a strict policy with regard to the creation of its URIs. Certain sections ofw3.org are very tightly controlled and there is a policy stating that once a URI has been minted,it should never cease to exist. Team members and document editors are generally authorised topublish and edit documents but do not have sufficient privileges to delete them.
Formally published documents, particularly documents that are evolving into standards, are allpublished at:
https://www.w3.org/TR/{shortname}
where {shortname} is something easy to remember likemobile-bp
(Mobile Web BestPractices) orvocab-org
for the Organisation Ontology. This will be the 'latest version URI' - i.e.the specific document returned will change as the document evolves (seesection 2.5).
Individual documents have URIs in the form:http://www.w3.org/TR/{status}-{shortname}-{yyyymmdd}
and so include an indication of the status of the documentand the date on which it was published. Other parts of the w3.org namespace arecovered byURIs for W3C Namespaces. It is noteworthy that the policy stronglyencourages the publication of documents in 'dated space', i.e.http://www.w3.org/{year}/{month}/
.
On a typical working day, 50 documents are created with URIs of this form, mostly minutes ofmeetings which are preserved in multiple formats and mostly with filenames beginning with theday of the month, for example the minutes of the eGov Interest Group meeting held on 19thOctober are athttp://www.w3.org/2012/10/19-egov-minutes.html. This simple useof the date and group name as part of the URI helps ensure uniqueness with little effort and is abig aid to persistence. The creation of documents outside dated space and the special areareserved for namespace documents (http://www.w3.org/ns/{shortname}
) is rare andrequires specific authorisation from W3C management.
As well as operating a well-defined policy of URI creation, W3C also has a published policy onURI persistence. The policy takes the form of a pledge by the three host organisations (MIT,ERCIM and Keio University):
The policy was written personally by Tim Berners-Lee who sums it up as:
The intent is to setan example by reducing the failure of links due to clumsy management or inadequatecommitment to information persistence, and to provide a stable reference base of informationabout W3C-related topics as a service to the community.
This combination of careful URI management and a stated persistence policy, includingprovision for what should happen should the organisation cease to exist, means that thecommunity is right to have great confidence that resources on w3.org will persist for a long time.Although not covered by the persistence policy, the W3C mailing list archives are also managedand maintained for the long term. The very first archived mail from 28th October 1991, less than3 months after the original WWW software was released, isstill available online.
This section reports on the persistent URI policies of pioneers around the globe.
In November 2010, Data.gov announced (broken link removed) that some of its datasets would be available also asLinked Data.
The URI template of data.gov, which is in line with that of data.gov.uk (in fact it was designedbased on the latter), is:
http://' BASE '/' 'id' '/' ORG '/' CATEGORY ( '/' TOKEN )+
whereid
is used for identifying non information resources;ORG
is a short token forrepresenting the agency, government, or organization that controls the identfier space; andCATEGORY
andTOKEN
identify the specific instance.
For example, in the case of US Government Agencies, the suggested URI template is:
http://BASE/id/us/fed/agency/NAME/SUBNAME
In that case the URI for the National Ocenaic and Atomsphairic Administration would be:
http://BASE/id/us/fed/agency/Commerce/National_Oceanic_and_Atmospheric_Administration
Jim Hendler’s team at RPI (broken link removed) undertook the task to design the URIs for data.gov. Citing fromtheir website, the data.gov URIs are:
In its own words, "ANDS is building the Australian Research Data Commons: a cohesivecollection of research resources from all research institutions, to make better use of Australia'sresearch data outputs."
It is an effort to manage its research output for easier discovery and re-use and, as part of this,it runs a service called 'Identify My Data' that acts as a Handle Service. On request from anauthorised user (typically a researcher in an Australian scientific institute), it issues aHandleof the form:
102.100.100/nn
where the sequence 102.100.100 is made up of ‘102’ (Australia) dot ‘100’ (e-research) dot ‘100’(ANDS), followed by a sequential number (nn). It is for the individual researcher to thenassociate that Handle with metadata about the resource which will typically include its locationon the Web. Handles issued by ANDS can be resolved using handle.net, for example,http://hdl.handle.net/102.100.100/15
.
The documentation of this system is spread across three increasingly detailed documents:
The important aspect for the current discussion is that this is a dedicated service established toissue persistent identifiers to the Australian research community. ANDS is committed tomaintain the Persistent Identifiers forat least twenty years (section 4.6) (link has changed and now points to a new organisation homepage. ANDS seems to no longer exist - so much for twenty years …) and is organised in such a waythat it could readily continue well beyond that time frame. The two-part nature of the servicemeans that the researcher him/herself also needs to maintain the data and to inform the 'IdentifyMy Data' service of any changes (which can be done programmatically). They can move,update or delete their data but this is independent of the ANDS service.
The up side of this arrangement is that the identifiers are very likely to persist in terms of theirstructure, uniqueness and meaning, even when a given researcher moves on to new topics.The downside is that by shifting some of the responsibility for maintenance to an outsideagency, researchers may be less motivated to maintain their data despite this being animportant aspect of the system and so it could be argued that the problem oflink rot is notreally solved.
To ameliorate the URI policy explained above, users of 'Identify My Data' are stronglyencouraged to provide authority metadata, i.e. information about the identifier itself, whohas/had responsibility for its creation and the object it identifies. ANDS recommends thatcontact information is provided and tied to a role, not an individual, to increase thelikelihood of future discovery of the responsible researcher.
The final line of the policy statement is illuminating:
Because authority metadata is used when things go wrong, its availability should not be relianton external systems: failure to access an external system may be why things have gone wrongto begin with. Contact data should therefore be stored directly in the identifier record, ratherthan linked through some external database.
This makes perfect sense but it emphasises the fact that the 'Identify My Data' service isdesigned and managed for a non linked data environment. Indeed, the term linked data doesn'toccur anywhere in the relevant ANDS documentation. This is a service for assigning andmanaging persistent identifiers for resources that exist elsewhere, rather than a service formanaging URIs as first class citizens of the datasphere.
Europeana, a comprehensive and growing portal to Europe's cultural heritage collections, waslaunched in November 2008. At that time it only used URIs as identifiers for the records aboutthe millions of items held in cultural heritage collections across Europe. A pilot project that wasdescribed in a paper (PDF) at the 2011 Dublin Core Conference and released to the public inFebruary 2012 made linked open data available about 2.4 million items.
The original record URIs were of the form:
http://www.europeana.eu/resolve/record/{collectionID}/{itemID}
and to create identifiers for the objects themselves, the project decided to use the relatedpattern:
http://data.europeana.eu/item/{collectionID}/{itemID}
As noted insection 3.2.1.1, this in line with the UK'sDesigning URI Sets for the UK PublicSector which is again cited as a reference.
However, during the pilot project a problem came to light. Europeana assigns URIs to recordsas those records are ingested from the relevant cultural heritage institutions. Individual itemsare assigned a hexadecimal number at the time of ingestion, a number that is incremented aseach new item is added. This makes perfect sense as it is a largely automated system handlingmillions of records. However, during the linked open data pilot, the data from some collectionswas re-ingested and this generated a different set of identifiers. Projects and applications thathad depended on the original URIs had to update their systems. The PATHS Project as onesuch.
It is evident that automated URI minting systems like this must have a means of avoidingsuch faux-pas in future.
There are only two realistic ways of doing this:
The second option is most likely to apply to Europeana where efforts are being made to define apersistence policy but even so, some sort of DOI or ARK-based system is likely to be necessaryto perform the matching process reliably.
The lack of integration between the production system and the Linked Data pilot at Europeanais a continuing source of resistance to the wholesale use of persistent URIs since the URIs inthe production system are inherently ephemeral.
A discussion of persistent URIs should include at least one example that is avowedly notpersistent. Wikipedia provides just such an example.
A page name in Wikipedia may begin with a namespace prefix – a string (ending with a colon)which the MediaWiki software recognizes as placing a page in aparticular namespace. If thepage name does not begin with any of the recognised prefixes, then it is considered to be inthe main namespace. A full page name therefore takes one of the following forms:
URIs are created when new pages are created in Wikipedia. This process is automaticand is based on the page title or label. Since labels are inherently ambiguous, acommonly encountered feature of Wikipedia is its disambiguation pages such as:
At the time of writing, there are 18 pages all about 'Europe' - everything from the physicalcontinent via several references from Greek Mythology to the national anthem of the republic ofKosovo. As theWikimedia Foundation's notes on the subject make clear, the issue is wellunderstood:
Note that the URLs for pages in Wikipedia are not persistent (although they are quitepersistent, see e.g. Siorpaes/Hepp research on this issue). But still, they can change meaning:if a Kevin Smith should become US president some day, he will most certainly replace theauthor Kevin Smith from his place as the main topic of the Kevin Smith article. Also changes ina name - e.g. when a person marries or becomes pope - lead to changes of the URL. There willbe adequate redirects and disambiguations in most cases, but although these are easy to followand disambiguate for humans, this is not necessarily true for machines.
Wikipedia is a resource created by humans for humans. The derived linked data version,DBpedia, was a critical aspect of the development of linked data and has brought the issue ofURI persistence into focus as shown further down the page from which the quote was taken. Inshort: URIs need to become more persistent, especially now that the Wiki Media Foundationhas begin its new WikiData project is underway. This project marks a move from the humancentricservice to one that both humans and machines can use, making URI management allthe more critical.
Having reviewed several case studies from the public sector and the key technicalconsiderations, it is now possible to derive a set of best practices. The foundations of the bestpractices have not changed since the earliest days of the Web, however, experience hasallowed their refinement and evolution. Table 2 offers a guide to the sources that document thisevolution.
Status | Title | Authors and Date |
---|---|---|
Background | Cool URIs don't change | Tim Berners-Lee, 1998 |
Cool URIs for the Semantic Web | Leo Saurman, Richard Cyganiak, 2008 | |
Linked Data | Tim Berners-Lee, 2009 | |
Key Source | Designing URI Sets for the UK Public Sector (PDF) | UK Chief Technology Officer Council October 2009 |
Expansion | Creating Linked Data | Jeni Tennison, 2009 |
Linked Data: Evolving the Web into a Global Data Space | Tom Heath & Christian Bizer, 2011 | |
Linked Data Patterns | Leigh Dodds & Ian Davis, 2012 | |
Best Practices for Multilingual Linked Open Data | Jose Emilio Labra Gayo, 2012 | |
Detail | Statistical Linked Dataspaces | Sarven Capadisli, 2012 |
The figure below summarises the 10 DOs and DON’Ts of persistent URI design rules andmanagement, which are detailed in the remainder of this chapter.
The 10 DOs and DON'Ts for persistent URIs
The recommended pattern for a URI designed for persistence is:
http://{domain}/{type}/{concept}/{reference}
This comes originally fromDesigning URI Sets for the UK Public Sector and has been repeatedsuccessfully with little variation in many different scenarios. A full explanation is provided insection 3.2.1.1 and can be summarised as:
{domain}
is a combination of the host and the relevant sector. It is a matter of choicewhether the sector is defined as a sub-domain of the host or as the first component ofthe path.{type}
should be one of a small number of possible values that declare the type ofresource that is being identified. Typical examples include:{concept}
might be a collection (Europeana 3.4.3), the type of real world objectidentified (e.g. road,3.2.1.1), the name of the concept scheme (e.g. 'language'3.1.1);{reference}
is a specific item, term or concept.The URI template above does not include the name of the organisation or project that mintedthe URI. This makes it much less susceptible to change should the project end or theorganisation be merged or renamed.
Although concept schemes, ontologies, taxonomies and vocabularies are likely to go throughiterative cycles of change, version numbers and status information should not be included in theURIs. Rather, the URIs should remain stable between versions and new ones minted for newterms. URIs may be deprecated and their use discouraged but they should nevertheless bemaintained both in terms of the actual URI and the resource they identify.
Where resources are already uniquely identified, those identifiers should be incorporated intothe URI. For example, if schools are assigned integer identifiers, the URI for the school withidentifier 123456, could be:
http://education.data.example/id/school/123456
Caution: when re-using an identifier, it is essential to re-use them without changing the originalsemantics. For example, a URI for a vehicle licence is not an identifier for the vehicle itself.Furthermore, it is important to only re-use identifiers that themselves are likely to be persistent.
Minting new URIs for large datasets will need to be automated and the process must beguaranteed to produce unique identifiers. One way to do this might be to simply increment acounter as each new URI is minted. Imagine that in the example of the previous section theinteger URIs given to schools were given based on such a counter. In that case, the followingcould be possible URIs for two different schools.
http://education.data.gov.uk/id/school/123456
http://education.data.gov.uk/id/school/123457
Although this approach is perfectly feasible, we would recommend it only if one of the followingis true:
Query strings (e.g.?param=value
) are usually used in URIs as keys to look up terms in adatabase. This is brittle since it often relies on a particular implementation.
For similar reasons to the previous point, avoid file extensions in persistent URIs, particularlythose that stem from the technology used such as .php or .py.
A persistent URI should identify a conceptual resource. Where that resource is an informationresource - that is, something that can be transmitted as a stream of bytes - then different useragents should be able to access it in different formats. In particular, humans and machinesshould be able to access it in formats appropriate to their different needs. Typically this willmean that a resource such as
http://data.example.org/doc/foo/bar
can be returned in at least HTML and some serialisation of RDF.
Those specific representations of the resource should have their own URI and this should followa predictable pattern, the simplest of which is to add the relevant file extension, i.e.
http://data.example.org/doc/foo/bar.html
andhttp://data.example.org/doc/foo/bar.rdf
This does not break the best practice set out in4.2.6 since these are not the persistent URIs(http://data.example.org/doc/foo/bar is).
Multiple representations of the same resource should all link to each other using a suitablemethod. In HTML use a link element with the ‘rel’ value of 'alternate', in RDF usedcterms:hasFormat etc.
When de-referenced, URIs that identify real world objects that cannot be transmitted as a seriesof bytes (such as buildings, places and people) should redirect using HTTP response code 303to a document that describes the object. This should be done in a consistent manner that canbe written as a URI re-write rule, typically replacing the URI{type}
of 'id' with 'doc.' Seesection 3.2.1.2 for details.
Without exception, all the use cases discussed insection 3 where a policy of URI persistencehas been adopted, have used a dedicated service that is independent of the data originator.The Australian National Data Service uses a handle resolver, Dublin Core uses purl.org,services, data.gov.uk and publications.europa.eu are all also independent of a specificgovernment department and could readily be transferred and run by someone else if necessary.This does not imply that a single service should be adopted for multiple data providers. On thecontrary - distribution is a key advantage of the Web. It simply means that the provision ofpersistent URIs should be independent of the data originator.
Multiple, small scale services that are easily transferable, rather than a single point of failure,plus a continued demand for the service, are the greatest guarantors of persistence.
The authors would like to thank the following people for their valuable contribution to this study (in alphabetical order): Martin Alvarez-Espinar (CTIC, Spain), Adam Arndt (Danish Agency for Digitisation), Sarven Capadisli (DERI, NUI Galway), Makx Dekkers (formerly of DCMI), Kerstin Forsberg (Information architect, clinical trial data specialist),Giorgos Georgiannakis (DG SANCO, European Commission), Michael Hausenblas (DERI, NUI Galway), Josef Hruška (Ministry of Interior, Czech Republic), Aftab Iqbal (DERI, NUI Galway), Antoine Isaac (Europeana), Anne Kauhanen-Simanainen (Ministry of Finance, Finland), Peter Krantz (eGov Consultant, Sweden), Giorgia Lodi (Agenzia per l'Italia Digitale, Italy), Antonio Maccioni (Agenzia per l'Italia Digitale, Italy), Thodoris Papadopoulos (Ministry of Administrative Reform and eGovernance, Greece), Priit Parmakson (Estonian Information Systems Authority), Paul Suijkerbuilk (data.overheid.nl, The Netherlands).
Thumbs up and down icons frompsdgraphics.com
Main authors/designers of this document apart from me were Nikos Loutas, Stijn Goedertier and Saky Kourtidis of PwC.
The primary reason for creating this Web copy of the PDF original is to make it easy to reference particular sections. Each heading, highlighted paragraph, table and example have their unique identifiersso that you can link directly to. For example, the template URI defined inDesigning URI Sets for the UK Public Sector can be referenced directly athttp://philarcher.org/diary/2013/uripersistence/#basicUKuri.
To find the fragment identifiers, just mouse over the relevant section or heading (they'reshown as element titles, not the more usual CSS and/or javaScript jiggerypokery).
Images have been reduced in size to fit the screen. Click them to go to the original un-squeezed version.
Like all pages on this site, this one adapts to different sized devices.