Main

November 28, 2011

Turning DOIs into formatted citations

Today two new content types were added to dx.doi.org resolution for CrossRef DOIs. These allow anyone to retrieve DOI bibliographic metadata as formatted bibliographic entries. To perform the formatting we're using the citation style language processor, citeproc-js which supports a shed load of citation styles and locales. In fact, all the styles and locales found in the CSL repositories, including many common styles such as bibtex, apa, ieee, harvard, vancouver and chicago are supported.

First off, if you'd like to try citation formatting without using content negotiation, there's a simple web UI that allows input of a DOI, style and locale selection.

If you're more into accessing the web via your favorite programming language, have a look at these content negotiation curl examples. To make a request for the new "text/bibliography" content type:

$ curl -LH "Accept: text/bibliography; style=bibtex" http://dx.doi.org/10.1038/nrd842

@article{Atkins_Gershell_2002, title={From the analyst's couch: Selective anticancer drugs}, volume={1}, DOI={10.1038/nrd842}, number={7}, journal={Nature Reviews Drug Discovery}, author={Atkins, Joshua H. and Gershell, Leland J.}, year={2002}, month={Jul}, pages={491-492}}

A locale can be specified with the "locale" content type parameter, like this:

$ curl -LH "Accept: text/bibliography; style=mla; locale=fr-FR" http://dx.doi.org/10.1038/nrd842

Atkins, Joshua H., et Leland J. Gershell. « From the analyst's couch: Selective anticancer drugs ». Nature Reviews Drug Discovery 1.7 (2002): 491-492.

You may want to process metadata through CSL yourself. For this use case, there's another new content type, "application/citeproc+json" that returns metadata in a citeproc-friendly JSON form:

$ curl -LH "Accept: application/citeproc+json" http://dx.doi.org/10.1038/nrd842

{"volume":"1","issue":"7","DOI":"10.1038/nrd842","title":"From the analyst's couch: Selective anticancer drugs","container-title":"Nature Reviews Drug Discovery","issued":{"date-parts":[[2002,7]]},"author":[{"family":"Atkins","given":"Joshua H."},{"family":"Gershell","given":"Leland J."}],"page":"491-492","type":"article-journal"}

Finally, to retrieve lists of supported styles and locales, either hit these URLs:

or check out the CSL style and locale repositories.

There's one big caveat to all this. The CSL processor will do its best with CrossRef metadata which can unfortunately be quite patchy at times. There may be pieces of metadata missing, inaccurate metadata or even metadata items stored under the wrong field, all resulting in odd-looking formatted citations. Most of the time, though, it works.

October 10, 2011

DataCite supporting content negotiation

In April CrossRef launched content negotiation support for its DOIs. At the time I cheekily called-out DataCite to start supporting content negotiation as well.

Edward Zukowski (DataCite's resident propellor-head) took up the challenge with gusto and, as of September 22nd DataCite has also been supporting content negotiation for its DOIs. This means that one million more DOIs are now linked-data friendly. Congratulations to Ed and the rest of the team at DataCite.

We hope this is a trend. Back in June Knowledge Exchange organized a seminar on Persistent Object Identifiers. One of the outcomes of the meeting was "Den Haag Manifesto" a document outlining five relatively simple steps that different persistent identifier systems could take in order to increase interoperability. Most of these steps involved adopting linked data principles including support for content negotiation. We look forward to hearing about other persistent identifiers adopting these principles over the next year.

Having said that, this time I will refrain from calling-out anybody specifically...

Enhanced by Zemanta

April 19, 2011

Content Negotiation for CrossRef DOIs

So does anybody remember the posting DOIs and Linked Data: Some Concrete Proposals?

Well, we went with option "D."

From now on, DOIs, expressed as HTTP URIs, can be used with content-negotiation.

Let's get straight to the point. If you have curl installed, you can start playing with content-negotiation and CrossRef DOIs right away:

curl -D - -L -H   "Accept: application/rdf+xml" "http://dx.doi.org/10.1126/science.1157784" 

curl -D - -L -H   "Accept: text/turtle" "http://dx.doi.org/10.1126/science.1157784"

curl -D - -L -H   "Accept: application/atom+xml" "http://dx.doi.org/10.1126/science.1157784"

Or if you are already using CrossRef's "unixref" format:

curl -D - -L -H "Accept: application/unixref+xml" "http://dx.doi.org/10.1126/science.1157784" 

This will work with over 46 million CrossRef DOIs as of today, but the beauty of the setup is that from now on, any DOI registration agency can enable content negotiation for their constituencies as well. DataCite- we're looking at you ;-) .

It also means that, as registration agency members (CrossRef publishers, for instance) start providing more complete and richer representations of their content, we can simply redirect content-negotiated requests directly to them.

We expect that that this development will round-out CrossRef's efforts to support standard APIs including OpenURL and OAI_PMH and we look forward to seeing DOIs increasingly used in linked data applications.

Finally, CrossRef would just like to thank the IDF and CNRI for their hard work on this as well as Tony Hammond and Leigh Dodds for their valuable advice and persistent goading.







March 25, 2010

DOIs and Linked Data: Some Concrete Proposals

Since last month's threads (here, here, here and here) talking about the issues involved in making the DOI a first-class identifier for linked data applications, I've had the chance to actually sit down with some of the thread's participants (Tony Hammond, Leigh Dodds, Norman Paskin) and we've been able sketch-out some possible scenarios for migrating the DOI into a linked data world.

I think that several of us were struck by how little actually needs to be done in order to fully address virtually all of the concerns that the linked data community has expressed about DOIs. Not only that- but in some of these scenarios we would put ourselves in a position to be able to semantically-enable over 40 million DOIs with what amounts to the flick of a switch.

Given the huge interest in linked data on the part of researchers and CrossRef members- it seems like it would be a fantastic boon to both the IDF (International DOI Foundation) and CrossRef if we were able to do something quickly here.

Anyway- The following are notes outlining several concrete proposals for addressing the limitations of DOIs as identifiers in linked data applications. They range in complexity/effort involved- with the simplest scenario providing minimal (yet functional) LD capabilities for just one RA's members (CrossRef's) and the most complex providing per-RA and per-RA-member configurability on how DOIs would behave for LD applications.

We'd appreciate comments, questions, suggestions, corrections, etc.

A: Simplest Scenario

What would need to be done?

  1. CrossRef implements a linked data service. For example, hosted at rdf.crossref.org.
  2. CrossRef recommends that any member publisher who wants to add rudimentary linked data capabilities to their site could simply insert some simple link elements into their landing Pages. So, for instance, for the article with the DOI 10.5555/1234567 in the Journal of Psychoceramics, the publisher would put the following in the landing page for the article:
<link rel="primarytopic" href="http://doi.crossref.org/10.5555/1234567" /> 
    <link rel="alternate" type="application/rdf+xml" href="http://rdf.crossref.org/metadata/10.5555/1234567.rdf" title="RDF/XML version of this document"/> 
    <link rel="alternate" type="text/html" href="http://www.journalofpsychoceramics.org/10.5555/1234567.html" title="HTML version of this document"/> 
    <link rel="alternate" type="application/json" href="http://rdf.crossref.org/metadata/10.5555/1234567.json" title="RDF/JSON version of this document"/> 
    <link rel="alternate" type="text/turtle" href="http://rdf.crossref.org/metadata/10.5555/1234567.ttl" title="Turtle version of this document"/>

In the above snippet the HTML version of the document is the publisher's existing landing page.

How it would work

  1. A sem-web-enabled browser would query dx.doi.org/10.5555/1234567 and get a normal 302 redirect to the publisher's landing page. 
  2. The sem-web-enabled browser would sniff the page for the link elements and retrieve the representations it wanted from rdf.crossref.org
  3. The returned document would contain an appropriate representation of the metadata that the publisher has deposited with CrossRef. It would also assert that:

doi.crossref.org/10.5555/12334567 owl:sameAs dx.doi.org/10.5555/1234567 .
dx.doi.org/10.5555/12334567 owl:sameAs info:doi/10.5555/12334567 .
info:doi/10.5555/12334567 owl:sameAs doi:10.5555/1234567 .

Alternatively, the publisher could implement their own linked data support on their own domain using whatever appropriate method they want. So, for instance, a larger publisher could support content negotiation at their site and return different/enhanced metadata, etc.

Pros

  1. Doesn't require changes at DOI/Handle levels
  2. Is easy for publisher to opt-in or opt-out
  3. Requires minimal development on the part of CrossRef.

Cons

  1. Only applies to CrossRef DOIs.
  2. It depends on publishers taking action. Might be a long time before publishers add the needed links to their landing pages or support content negotiation.
  3. DOI system is still not strictly LD compliant (e.g. it is returning 302 redirects. Naive sem-web browsers might 'stop' after getting a 302. Should ideally use 303s, content negotiation, etc.)
  4. Doesn't work for DOIs that currently bypass landing pages and which go directly to content.

B: Simple + IDF Global Semantic Compliance

What would need to be done?

  1. Same as "Simplest Scenario"
  2. IDF globally changes dx.doi.org to return 303 redirect

How would it work?

Same as Simplest Scenario, except that, because sem-web-enabled browser had been told it was being redirected to a NIR (via the 303), it would presumably be more likely to continue.

Pros

  1. All DOIs conform to expectations for LD identifiers
  2. Easy for publisher to opt-in or opt-out
  3. Requires minimal development on part of CrossRef
  4. Requires minimal work (?) on part of IDF

Cons

  1. Requires global change on part of IDF. Global change might conflict with requirements of other RAs.
  2. It depends on publishers taking action. Might be a long time before publishers add needed links to their landing pages or support content negotiation.
  3. Doesn't work for DOIs that currently bypass landing pages (e.g. OECD spreadhseets, UICR datasets, etc.)

C: Simple + IDF Global Semantic Compliance + RA CN Intercept

What would need to be done?

  1. Same as "B: Simple + IDF Global Semantic Compliance" Scenario
  2. IDF  changes dx.doi.org to redirect content-negotiated dx.doi.org queries to RA-controlled resolver depending on the preferences of the RA.
  3. RA implements DOI resolver (e.g. dx.crossref.org) that supports content negotiation. RA allows its members to specify to the RA  that they want either:
    1. RA to forward all requests to the member's site.
    2. RA to "intercept" content-negotiations for non-HTML representations and direct them appropriately (e.g. return appropriate representation from rdf.crossref.org)

How would it work?



Pros

  1. All DOIs conform to expectations for LD identifiers
  2. Allows RA to potentially LD-enable its members very quickly.
  3. Easy for ra-members to opt-in or opt-out
  4. Requires minimal development on part of CrossRef
  5. Would even work for DOIs that bypass landing pages

Cons

  1. Requires global change on part of IDF. Global change might conflict with requirements of other RAs.
  2. Requires change to add decision logic implementation on part of IDF. 
  3. Requires development of RA resolvers that implement per-member resolution logic (note- this would probably actually be done at DOI level)

D: Simple + IDF Selective Semantic Compliance + RA CN Intercept

What would need to be done?

  1. Same as Simplest Scenario
  2. IDF  changes dx.doi.org to return either 302 or 303 redirect depending on the preferences of the RA.
  3. IDF  changes dx.doi.org to redirect content-negotiated dx.doi.org queries to RA-controlled resolver depending on the preferences of the RA.
  4. RA implements DOI resolver (e.g. dx.crossref.org) that supports content negotiation. RA allows its members to specify to the RA  that they want either:
    1. RA to forward all requests to the member's site.
    2. RA to "intercept" content-negotiations for non-HTML representations and direct them appropriately (e.g. return appropriate representation from rdf.crossref.org)

How would it work?



Pros

  1. Allows RA to potentially LD-enable its members very quickly.
  2. Easy for ra-members to opt-in or opt-out
  3. Requires minimal development on part of CrossRef
  4. Would even work for DOIs that bypass landing pages

Cons

  1. Only some DOIs conform to expectations for LD identifiers
  2. Requires change to add decision logic implementation on part of IDF. 
  3. Requires development of RA resolvers that implement per-member resolution logic (note- this would probably actually be done at DOI level)

February 13, 2010

Is FRBR the OSI for Web Architecture?

(This post is just a repost of a comment to Geoff's last entry made because it's already rather long, because it contains one original thought - FRBR as OSI - and because, well, it didn't really want to wait for moderation.)

Hi Geoff:

First off, there is no question but that CrossRef was established to take on the reference linking challenge for scholarly literature. (Hell, it's there, as you point out, in the organization name - PILA - as well as in the application name - CrossRef.)

But one should also remember that DOI as it was sold at the time was promising so much more. I disagree with you that the participants back then were as wholly innocent of the FRBR terms as you might suggest. Certainly there were ample presentations on DOI that sought to elucidate those relationships.

No matter. FRBR is a useful reference model to clarify some of these concepts. But not one that we are overly concerned with at this time. Nor even whether DOI maps one to one onto a given FRBR layer. What we are more concerned with on a pragmatic level is how DOI maps onto the Web architecture and especially how it plays along with Linked Data concepts.

(Aside: A propos FRBR we might be in danger of repeating the OSI mistake for standardizing the network layer model. Ultimately that was maintained as a reference model but dropped as a concrete model in favour of the TCP/IP stack. Could be that FRBR is our OSI and Linked Data is our TCP/IP stack? That is, we might have to settle on the coarser data model in order to get a coherent story out the door where all can agree.)

You say:

"we need a mechanism to distinguish between when we are getting the thing pointed to by the CrossRef DOI (the PDF , HTML, etc.) as opposed to "something about the thing" (e.g. the landing page, metadata record, etc.)"
But that is exactly what we were chasing up in the earlier posts (both my DOI: What Do We Got? and John Erickson's DOIs, URIs and Cool Resolution). You want to distinguish between a thing and a description about a thing. And Web architecture does just that: it distinguishes between Information Resources (i.e. the things) and Non-Information Resources (i.e. descriptions of the things).

Now is this something that CrossRef can truly distinguish and make apparent in its service architecture? If we retain the notion of landing page we are already essentially saying that a CrossRef HTTP URI identifies a decsription of the resource, i.e. a Non-Information Resource, or Other Resource, and that is properly indicated within the architecture by returning a "303 See Other status" code.

I think that's all we're saying at the moment as a first step.

Web architecture wants to know if the DOI HTTP URI is a thing or description of a thing. I say the latter. You seem to suggest in your comment the latter too. I wonder if we could get a vote on that.

And btw, I am not suggesting that CrossRef needs to dive into the business of "tracking compoend documents in their entirety". Far from it. Lets just get a common resource architecture agreed publicly and then we can build on that.

This observation I received in a private email is something I fully support:

"The real problem is what doi http uri identify on the web. Everything flows from the answer to that Q."
Tony


February 11, 2010

Does a CrossRef DOI identify a "work?"

Tony's recent thread on making DOIs play nicely in a linked data world has raised an issue I've meant to discuss here for some time- a lot of the thread is predicated on the idea that CrossRef DOIs are applied at the abstract "work" level. Indeed, that it what it currently says in our guidelines. Unfortunately, this is a case where theory, practice and documentation all diverge.

When the CrossRef linking system was developed it was focused primarily on facilitating persistent linking amongst journals and conference proceedings. The system was quickly adapted to handle books and more recently to handle working papers, technical reports, standards and “components”- a catchall term used to refer to everything from individual article images to database records.

In practice the content outside of the core journals and conference proceedings has accounted for relatively low volume. However, we expect that over the next few years this will change and that books and databases will increasingly drive the future growth in CrossRef’s citation linking services. Interestingly, these content types all share characteristics that make them substantially different from the journals and conference proceedings that we have hitherto focused on.

Both books and databases introduce new challenges to technology and policies of our citation linking service. The challenges revolved around two areas:

  • Structure: Both books and databases can have complex structures and the publishers of this content are likely to require granular identification of these content substructures along with a mechanism for documenting the relationship between these substructures (e.g. this section is part of this chapter which is part of this monograph which is part of this series)
  • Versioning: Unlike typical journals and conference proceedings, books and database records sometimes change over time.


When confronted with the issues of structure and versioning publishers are often tempted to take shortcuts and decide to simply assign DOIs at the highest level structure and to the “work” instead of a particular “manifestation” or version of that work. Indeed, section 5.5 of CrossRef's DOI Name Information and Guidelines recommends this. But this approach could have a negative impact on the integrity of the scholarly citation record that CrossRef is attempting to maintain.

Fundamentally, CrossRef DOIs are aimed at providing a persistent online citation infrastructure for scholarly and professional publishers. Consequently, decisions about where to apply CrossRef DOIs should be guided by common expectations about the way in which citations work. Citations are typically used to credit ideas or provide evidence. A reader follows a citation in order to obtain more detail or to verify that an author is accurately representing the item cited. A rule of thumb is that a reader has a reasonable expectation that when they follow a citation, they will be taken to what the author saw when creating the citation. Any divergent behavior could result in the reader concluding that the author was misrepresenting the item cited. A further implication of this is that any changes to content that are likely to effect the crediting or interpretation of the content should result in that changed content getting a new CrossRef DOI.

Typically, this means that CrossRef DOIs should be probably assigned at the expression level and different expressions should be assigned different CrossRef DOIs. This is because assigning a CrossRef DOI at the higher "work" level is generally not granular enough to guarantee that a reader following the citation will see what the author saw when creating the citation. For example, one translation of a work might be substantially different from another translation of the same work. Similarly a draft version of a work might be substantially different from the final published version of the work. In each case, resolving a citation to a different expression of the work than the expression that was originally cited might result in the reader interpreting the content differently than the citing author.

In general, different "equivalent manifestations" of the same work can safely be assigned the same CrossRef DOI. So, for instance, the HTML formatted version an article and the PDF formatted version of an article can almost always be assigned the same CrossRef DOI. Any differences between the two are unlikely to affect the crediting of, or reader's interpretation of, the work. But sometimes it is even possible that different manifestations of an expression will differ enough to merit different CrossRef DOIs. For instance, a semantically enhanced version of an article might require new crediting (e.g. the parties responsible for adding the semantic information) and the resulting semantic enhancement may conceivably alter the reader's interpretation of the article.

Unfortunately, there is no hard and fast rule about where and when to assign new CrossRef DOIs. Instead there is only a guideline, namely:

"Assign new CrossRef DOIs to content in a way that will ensure that a reader following the citation will see something as close to what the original author cited as is possible."

The implications of this to publishers are important, especially when they are assigning DOIs to protean content types. For instance, it may mean that:

  • Book publishers should be expected to keep old editions of books available for link resolution purposes.
  • Publishers of content that can change rapidly (e.g. by the second) should provide facilities for creating frozen, archived snapshots of content for citation purposes.
  • All publishers of protean content should issue guidelines instructing researchers on when it is appropriate to cite a work, manifestation or version.

CrossRef needs to actively consider these issues as publishers start assigning CrossRef DOIs to more dynamic types of content. Minimally, we should be able to provide publishers with recommendations on how to make dynamic content citable. We may even want to consider enshrining certain types of behavior in our terms and conditions so as to ensure the future integrity of the scholarly citation record.

In short, we need to update our guidelines.

February 10, 2010

The Response Page

(Update - 2010.02.10: I just saw that I posted here on this same topic over a year ago. Oh well, I guess this is a perennial.)

I am opening a new entry to pick up one point that John Erickson made in his last comment to the previous entry:

"I am suggesting that one "baby step" might be to introduce (e.g.) RDFa coding standards for embedding the doi:D syntax."
Yea!

It might be worth consulting the latest CrossRef "DOI Name Information and Guidelines" (PDF) to see what that has to say about this. Section 6.3 - The response page has these two specific requirements for publishers:

  1. When metadata and DOIs are deposited with CrossRef, the publisher must have active response pages in place so that they can resolve incoming links.
  2. A minimal response page must contain a full bibliographic citation displayed to the user. A response page without bibliographic information should never be presented to a user.
What is truly shocking about these requirements is that this are purely user focussed. There is no mention whatsoever of machines. One might have thought that with the Linked Data gospel in full swing there would at least be a nod to machine-readable metadata. But there's none. I'm not saying that there should be any requirement, or even any recommendation. But a mention might have been useful to chivvy us all along.

I agree with John that publishers could be encouraged (or even just reminded) that machine-readable metadata could be made available through various mechanisms: HTML META tags (such as we currently provide at Nature - and as blogged here earlier), COinS objects, RDF/XML comments, or best of all RDFa markup as John mentions.

The Web is getting semantic. It's about time that CrossRef members joined the wave. And would be helpful if CrossRef were there to help us with some new guidelines too!

February 9, 2010

DOI: What Do We Got?

doi-what-do-we-got.png
(Click image for full size graphic.)

Following the JISC seminar last week on persistent identifiers (#jiscpid on Twitter) there was some discussion about DOI and its role within a Linked Data context. John Erickson has responded with a very thoughtful post DOIs, URIs and Cool Resolution, which ably summarizes how the current problem with DOI in that the way the DOI is is implemented by the handle HTTP proxy may not have kept pace with actual HTTP developments. (For example, John notes that the proxy is not capable of dealing with 'Accept' headers.) He has proposed a solution, and the post has attracted several comments.

I just wanted to offer here the above diagram in an attempt to corral some of the various facets relating to DOI that I am aware of. I realize that this may seem like an open invitation to flame on - and this is a very preliminary draft - but ... be kind!

So, this may be totally off the wall but it represents my best understanding of DOI as used by CrossRef.

I have distinguished three main contexts:

  1. Generic Data - A generalized information context where the an object is identified with a DOI, an identifier system that is currently being ratified through the ISO process. This is the raw DOI number. (This definitely is not a first class object on the Web as it has no URI.)
  2. Web Data - An online information context (here I use the term 'Web' in its widest sense) where resources are identified by URI (not necessarily an HTTP URI). Here DOI is represented under two URI schemes: 'doi:' (unregistered but preferred by CrossRef), and 'info:' (registered and available for general URI use). Also it has a presence on the Web via an HTTP proxy (dx.doi.org) URL where it is used as a slug to create a permalink (as listed at 'A'). A simple HTTP redirect is used (with status code 302) to turn this permalink into the publisher response page http://example/1. (Note that typically a second redirect will occur on the publisher platform, here shown by the redirect to http://example/2.)
  3. Linked Data - An online information context where resources are identified by HTTP URI and conform to Linked Data principles. Now this is where there is a tension arises between the common publisher perspective and the strict semantic viewpoint. Implicit in the general Web context given above was the notion that the permalink ('A') was somehow related to the abstract object and the redirection service applied to it associated the abstract resource with concrete representations of the object.
So how do we relate the DOI HTTP URI with the abstract ('work') identifier listed at 'D' in the diagram?

Well the Architecture of the World Wide Web recognizes two distinct classes of resources: Information Resources (IR) and Non-Information Resources (NR). (Note: Only the term 'information resource' is used in AWWW.) IR are those that can be directly retrieved using HTTP, whereas NR are not directly retrievable but have an associated description which is retrievable and is itself a proxy for the real world object.

So either the HTTP URI denotes an IR (as listed at 'B') and is resolved (through HTTP status code '302 Found') to a default representation, which is the view that the Linked Data community would currently have of DOI. But this is at odds with what the CrossRef position which regards DOI as identifying the abstract work. Alternately to fit better the CrossRef model of DOI the HTTP URI would denote an NR (as listed at 'A') which would be resolved (through HTTP status code '303 See Other') to an associated description - a publisher response page.

There will be those self-appointed URI czars who will bemoan the fact of there being multiple URIs. But frankly there is nothing inherently wrong with that. Just as in the real world there are many languages so in the online world there are multiple contexts and histories. We can attempt to make some sense of this by making use of the well-known semantic properties owl:sameAs and ore:similarTo and declare (as also shown in the diagram) the following assertions:


info:doi/D owl:sameAs doi:D .

http://dx.doi.org/D ore:similarTo info:doi/D .

http://dx.doi.org/D ore:similarTo doi:D .


Note that ore:similarTo (stemming from the OAI-ORE work) is a weaker kind of relationship than owl:sameAs (which comes from OWL) and may be appropriate in this usage.

In sum, scenario 'A' is what we have currently implemented, scenario 'B' is what might be commonly perceived as being implemented, and scenario 'C' may be a more correct semantic position.

Your comments (and not unkind comments, please;) are more than welcome.