« April 2008 | Main | June 2008 »

May 31, 2008

Exposing Public Data

As the range of public services (e.g. RSS) offered by publishers has matured this gives rise to the question: How can they expose their public data so that a user may discover them? Especially, with DOI there is now in place a persistence link infrastructure for accessing primary content. How can publishers leverage that infrastructure to advantage?

Anyway, I offer this figure as to how I see the current lie of the land as regards DOI services and data.

doi-services.jpg
Legend - Current DOI service architecture showing data repositories, service access points, and open/closed data domains.

The figure above shows the three data repositories and service access points in the current DOI services architecture. At right and bottom of the figure are the two types of service (public services and private services) that together are instrumental in getting a user from a DOI-based link (on a third-party site) to the correct page of content (from the primary content provider). (Note that a fourth, private data repository – the institutional repository – comes into play when OpenURL user context-sensitive linking is added.)

At left of the figure are services operated by CrossRef on its own metadata database which support a) publisher lookups of DOI, and b) third-party metadata services (DOI-to-metadata and metadata-to-DOI conversions). These might best be labelled protected services since they are not freely available: the first is open to members at a cost, while the second is free but to associated organizations only – members, affiliates, etc.

The term open data is used here in the sense implied by the current W3C SWEO LOD (Linking Open Data) Project. Open data is public data unencumbered by any access restrictions. By contrast, closed data is data that has some access restrictions placed on it – even data that is open to affiliates. (This is not an issue that LOD addresses directly, although it is implied that data is globally ‘open’, i.e. public.)

The current DOI service architecture thus breaks down as:

  • Native DOI services – resolving the DOI token
    • Public – DOI Proxy Server (‘dx.doi.org’)
  • Related DOI services – using the DOI token
    • Protected – CrossRef
    • Private – Publisher

Note that a DOI is ‘resolved’ into state data registered with it, or as ISO CD 26324 puts it: “Resolution is the process of submitting a specific DOI name to the DOI system and receiving in return the associated values held in the DOI resolution record for one or more types of data relating to the object identified by that DOI name.”

So, how might publishers best leverage this DOI service architecture to expose their public data?

May 29, 2008

Dark Side of the DOI

openhandle_p5_radial.jpg

(Click to enlarge.)

For infotainment only (and because it's a pretty printing). Glimpse into the dark world of DOI. Here, the handle contents for doi:10.1038/nature06930 exposed as a standard OpenHandle 'Hello World' document. Browser image courtesy of Processing.js and Firefox 3 RC1.

Referencing OpenURL

So, why is it just so difficult to reference OpenURL?

Apart from the standard itself (hardly intended for human consumption - see abstract page here and PDF here - and don't even think to look at those links - they weren't meant to be cited!), seems that the best reference is to the Wikipedia page. There is the OpenURL Registry page at http://openurl.info/regsitry but this is just a workshop. Not much there beyond the OpenURL registered items. (And why does the page seem uncertain as to whether it's a "repository" or a "registry"? Is there no difference between those terms?) The only other links are to a mix of HTML and PDF resources. (There really should be a health warning on links to PDFs - they are just not browser friendly documents.) And, I do have to wonder at this: the registry page has a link to the unofficial 0.1 version but not to the 1.0 standard. Er, why? And don't even try this link: http://openurl.info/. Not much info there.

Where else to go? The NISO site allows a search on "openurl" which returns links to the standard and to other related documents.

And then there's the community site http://openurl.code4lib.org/ targeted at developers and its Planet OpenURL which is a useful source for current awareness.

Me, I'm sticking with the Wikipedia page as the best reference for OpenURL. How odd that OpenURL aimed at improving linking on the Web should not have it's own simple access point. Thank heavens at least that DOI has a single reference point: http://doi.org/.

May 23, 2008

Tombstone

So, the big guns have decided that XRI is out. In a message from the TAG yesterday, variously noted as being "categorical" (Andy Powell, eFoundations) and a "proclamation" (Edd Dumbill, XML.com), the co-chairs (Tim Berners-Lee and Stuart Williams) had this to say:

"We are not satisfied that XRIs provide functionality not readily available from http: URIs. Accordingly the TAG recommends against taking the XRI specifications forward, or supporting the use of XRIs as identifiers in other specifications."
Alas, poor XRI. But what might this also mean for other URI schemes (note the reference above to "http: URIs)? Well, the message starts out with this:
"In The Architecture of the World Wide Web [1] the TAG sets out the reasons why http: URIs are the foundation of the value proposition for the Web, and should be used for naming on the Web. "
Now I'm not sure that this is quite what AWWW actually says. I don't find it to be that insistent that "http" URIs ... should be used for naming on the Web" but I would need to read it more carefully. Certainly, "http: URIs" fit the bill and are top of the class. But there is also a general recognition that other schemes than "http:" do exist.

Interesting times anyway with a "winner takes all" approach to identification. I wonder what this all means for DOI.

May 20, 2008

Metadata Reuse Policies

Following on from yesterday's post about making metadata available on our Web pages, I wanted to ask here about "metadata reuse policies". Does anybody have a clue as to what might constitute a best practice in this area? I'm specifically interested in license terms, rather than how those terms would be encoded or carried. Increasingly we are finding more channels to distribute metadata (RSS, HTML, OAI-PMH, etc.) but don't yet have any clear statement for our customers as to how they might reuse that data.

Time to put the caveats aside and focus on the actuals.

May 19, 2008

Nature's Metadata for Web Pages

Well, we may not be the first but wanted anyway to report that Nature has now embedded metadata (HTML meta tags) into all its newly published pages including full text, abstracts and landing pages (all bar four titles which are currently being worked on). Metadata coverage extends back through the Nature archives (and depth of coverage varies depending on title). This conforms to the W3C's Guideline 13.2 in the Web Content Accessibility Guidelines 1.0 which exhorts content publishers to "provide metadata to add semantic information to pages and sites".

Metadata is provided in both DC and PRISM formats as well as in Google’s own bespoke metadata format. This generally follows the DCMI recommendation "Expressing Dublin Core metadata using HTML/XHTML meta and link elements, and the earlier RFC 2731 "Encoding Dublin Core Metadata in HTML". (Note that schema name is normalized to lowercase.) Some notes:

  • The DOI is included in the "dc.identifier" term in URI form which is the CrossRef recommendation for citing DOI.
  • We could consider adding also "prism.doi" for disclosing the native DOI form. This requires the PRISM namespace declaration to be bumped to v2.0. We might consider synchronizing this change with our RSS feeds which are currently pegged at v1.2, although note that the RSS module mod_prism currently applies only to PRISM v1.2.
  • We could then also add in a "prism.url" term to link back (through the DOI proxy server) to the content site. The namespace issue listed above still holds.
  • The "citation_" terms are not anchored in any published namespace which does make this term set problematic in application reuse. It would be useful to be able to reference a namespace (e.g. "rel="schema.gs" href="..."") for these terms and to cite them as e.g. "gs.citation_title".

The HTML metadata sets from an example landing page are presented below.

If you view the page source you should see something like the text below. (Note that you may have to scroll past whitespace which is emitted by the HTML template generator.)

<link title="schema(DC)" rel="schema.dc" href="http://purl.org/dc/elements/1.1/" />
<meta name="dc.publisher" content="Nature Publishing Group" />
<meta name="dc.language" content="en" />
<meta name="dc.rights" content="&#169; 2008 Nature Publishing Group" />
<meta name="dc.title" content="Crystal structure of squid rhodopsin" />
<meta name="dc.creator" content="Midori Murakami" />
<meta name="dc.creator" content="Tsutomu Kouyama" />
<meta name="dc.identifier" content="doi:10.1038/nature06925" />
					
<link title="schema(PRISM)" rel="schema.prism" href="http://prismstandard.org/namespaces/1.2/basic/" />
<meta name="prism.copyright" content="&#169; 2008 Nature Publishing Group" />
<meta name="prism.rightsAgent" content="permissions@nature.com" />
<meta name="prism.publicationName" content="Nature" />
<meta name="prism.issn" content="0028-0836" />
<meta name="prism.eIssn" content="1476-4687" />
<meta name="prism.volume" content="453" />
<meta name="prism.number" content="7193" />
<meta name="prism.startingPage" content="363" />
<meta name="prism.endingPage" content="367" />

<meta name="citation_journal_title" content="Nature" />
<meta name="citation_publisher" content="Nature Publishing Group" />
<meta name="citation_authors" content="Midori Murakami, Tsutomu Kouyama" />
<meta name="citation_title" content="Crystal structure of squid rhodopsin" />
<meta name="citation_volume" content="453" />
<meta name="citation_issue" content="7193" />
<meta name="citation_firstpage" content="363" />
<meta name="citation_doi" content="doi:10.1038/nature06925" />


While it is not expected that search engines will index these terms directly and that no direct SEO is intended, we think there is enough value for applications to make use of these terms. The terms are reasonably accessible to simple scripts, etc. Note that even in RFC 2731 (published in 1999) there is a Perl script listed in Section 9 which allows the metadata name/value pairs to be easily pulled out. Running this over the example page yields the following output:

@(urc;
@|MISSING ELEMENT NAME; text/css
@|MISSING ELEMENT NAME; text/html; charset=iso-8859-1
@|robots; noarchive
@|keywords; Nature, science, science news, biology, physics, genetics, astronomy, astrophysics, quantum physics, evolution, evolutionary biology, geophysics, climate change, earth science, materials science, interdisciplinary science, science policy, medicine, systems biology, genomics, transcriptomics, palaeobiology, ecology, molecular biology, cancer, immunology, pharmacology, development, developmental biology, structural biology, biochemistry, bioinformatics, computational biology, nanotechnology, proteomics, metabolomics, biotechnology, drug discovery, environmental science, life, marine biology, medical research, neuroscience, neurobiology, functional genomics, molecular interactions, RNA, DNA, cell cycle, signal transduction, cell signalling.
@|description; Nature is the international weekly journal of science: a magazine style journal that publishes full-length research papers in all disciplines of science, as well as News and Views, reviews, news, features, commentaries, web focuses and more, covering all branches of science and how science impacts upon all aspects of society and life.
@|dc.publisher; Nature Publishing Group
@|dc.language; en
@|dc.rights; #169; 2008 Nature Publishing Group
@|dc.title; Crystal structure of squid rhodopsin
@|dc.creator; Midori Murakami
@|dc.creator; Tsutomu Kouyama
@|dc.identifier; doi:10.1038/nature06925
@|prism.copyright; © 2008 Nature Publishing Group
@|prism.rightsAgent; permissions@nature.com
@|prism.publicationName; Nature
@|prism.issn; 0028-0836
@|prism.eIssn; 1476-4687
@|prism.volume; 453
@|prism.number; 7193
@|prism.startingPage; 363
@|prism.endingPage; 367
@|citation_journal_title; Nature
@|citation_publisher; Nature Publishing Group
@|citation_authors; Midori Murakami, Tsutomu Kouyama
@|citation_title; Crystal structure of squid rhodopsin
@|citation_volume; 453
@|citation_issue; 7193
@|citation_firstpage; 363
@|citation_doi; doi:10.1038/nature06925
@)urc;

May 14, 2008

DOIs and PubMed Central - why no links?

Further to my previous post "NIH Mandate and PMCIDs" we've been looking into linking to articles on publishers' sites from PubMed Central (PMC). There are a couple of ways this happens currently (see details below) but these are complicated and will lead to broken links and more difficulty for PMC and publishers in managing the links. CrossRef is going to be putting together a brieifing note for its members on this soon.

The main issue we are raising with PMC, and that we will encourage publishers to raise too, is why doesn't PMC just automatically link DOIs? Most of the articles in PMC have DOIs so this would require very little effort from PMC and no effort from publishers and would give readers a perisistent link to the publisher's version of an article.

Current PMC linking methods. 1) Links on Author Manuscripts in PMC are pulled in from PubMed's LinkOut service which requires the publisher to register with PubMed and provide linking files. The DOI can be specified as the linking mechanism via LinkOut.

2) For final version of articles in PMC the journal image at the top of the page can be linked to the journal homepage or can have a "this article" link to the publisher's site. The publisher has to sign up with PMC for specifying the header graphic and the links. The PMC instructions (Word document) say "The static base (http://www.biomedcentral.com/) of the URLs for this link comes from the HTML template. PMC then dynamically completes the URL by adding an issn/vol/page. " and then says that any item in the XML (such as the DOI) can be used.

Both of the approaches outlined above require extra work and will be difficult for smaller publishers. In addition, the links will be fragile by not being based on DOIs. Publishers can specify that DOIs can be used but it isn't easy. We'd like to leverage the resources that publishers have already put into the DOI system but automatically making the DOIs active links - it would be very easy.