Main

February 11, 2010

Does a CrossRef DOI identify a "work?"

Tony's recent thread on making DOIs play nicely in a linked data world has raised an issue I've meant to discuss here for some time- a lot of the thread is predicated on the idea that CrossRef DOIs are applied at the abstract "work" level. Indeed, that it what it currently says in our guidelines. Unfortunately, this is a case where theory, practice and documentation all diverge.

When the CrossRef linking system was developed it was focused primarily on facilitating persistent linking amongst journals and conference proceedings. The system was quickly adapted to handle books and more recently to handle working papers, technical reports, standards and “components”- a catchall term used to refer to everything from individual article images to database records.

In practice the content outside of the core journals and conference proceedings has accounted for relatively low volume. However, we expect that over the next few years this will change and that books and databases will increasingly drive the future growth in CrossRef’s citation linking services. Interestingly, these content types all share characteristics that make them substantially different from the journals and conference proceedings that we have hitherto focused on.

Both books and databases introduce new challenges to technology and policies of our citation linking service. The challenges revolved around two areas:

  • Structure: Both books and databases can have complex structures and the publishers of this content are likely to require granular identification of these content substructures along with a mechanism for documenting the relationship between these substructures (e.g. this section is part of this chapter which is part of this monograph which is part of this series)
  • Versioning: Unlike typical journals and conference proceedings, books and database records sometimes change over time.


When confronted with the issues of structure and versioning publishers are often tempted to take shortcuts and decide to simply assign DOIs at the highest level structure and to the “work” instead of a particular “manifestation” or version of that work. Indeed, section 5.5 of CrossRef's DOI Name Information and Guidelines recommends this. But this approach could have a negative impact on the integrity of the scholarly citation record that CrossRef is attempting to maintain.

Fundamentally, CrossRef DOIs are aimed at providing a persistent online citation infrastructure for scholarly and professional publishers. Consequently, decisions about where to apply CrossRef DOIs should be guided by common expectations about the way in which citations work. Citations are typically used to credit ideas or provide evidence. A reader follows a citation in order to obtain more detail or to verify that an author is accurately representing the item cited. A rule of thumb is that a reader has a reasonable expectation that when they follow a citation, they will be taken to what the author saw when creating the citation. Any divergent behavior could result in the reader concluding that the author was misrepresenting the item cited. A further implication of this is that any changes to content that are likely to effect the crediting or interpretation of the content should result in that changed content getting a new CrossRef DOI.

Typically, this means that CrossRef DOIs should be probably assigned at the expression level and different expressions should be assigned different CrossRef DOIs. This is because assigning a CrossRef DOI at the higher "work" level is generally not granular enough to guarantee that a reader following the citation will see what the author saw when creating the citation. For example, one translation of a work might be substantially different from another translation of the same work. Similarly a draft version of a work might be substantially different from the final published version of the work. In each case, resolving a citation to a different expression of the work than the expression that was originally cited might result in the reader interpreting the content differently than the citing author.

In general, different "equivalent manifestations" of the same work can safely be assigned the same CrossRef DOI. So, for instance, the HTML formatted version an article and the PDF formatted version of an article can almost always be assigned the same CrossRef DOI. Any differences between the two are unlikely to affect the crediting of, or reader's interpretation of, the work. But sometimes it is even possible that different manifestations of an expression will differ enough to merit different CrossRef DOIs. For instance, a semantically enhanced version of an article might require new crediting (e.g. the parties responsible for adding the semantic information) and the resulting semantic enhancement may conceivably alter the reader's interpretation of the article.

Unfortunately, there is no hard and fast rule about where and when to assign new CrossRef DOIs. Instead there is only a guideline, namely:

"Assign new CrossRef DOIs to content in a way that will ensure that a reader following the citation will see something as close to what the original author cited as is possible."

The implications of this to publishers are important, especially when they are assigning DOIs to protean content types. For instance, it may mean that:

  • Book publishers should be expected to keep old editions of books available for link resolution purposes.
  • Publishers of content that can change rapidly (e.g. by the second) should provide facilities for creating frozen, archived snapshots of content for citation purposes.
  • All publishers of protean content should issue guidelines instructing researchers on when it is appropriate to cite a work, manifestation or version.

CrossRef needs to actively consider these issues as publishers start assigning CrossRef DOIs to more dynamic types of content. Minimally, we should be able to provide publishers with recommendations on how to make dynamic content citable. We may even want to consider enshrining certain types of behavior in our terms and conditions so as to ensure the future integrity of the scholarly citation record.

In short, we need to update our guidelines.

August 14, 2009

Strategic Reading

Allen Renear and Carole Palmer have just published an article titled "Strategic Reading, Ontologies, and the Future of Scientific Publishing" in the current issue of Science (http://dx.doi.org/10.1126/science.1157784). I'm particularly happy to see this paper published because I actually got to witness the genesis of these ideas in my living room back in 2006. Since then, Allen and Carole's ideas have profoundly influenced my thinking on the application of technology to scholarly communication.

Those who have seen me speak at conferences recently will have heard me do an awful lot of ranting about the how publishers and librarians need to help researchers practice the time-honored art of "reading avoidance" (or as Renear and Palmer politely put it- "strategic reading"). I even managed to squeeze this rant into a recent interview I did with Wiley-Blackwell.

The essence of my argument has been that our industries need not be bamboozled by the technical jargon and messianic hand-waving that typically accompany discussions of new technology trends like "web 2.0", "text-mining", "the semantic web", "micro-blogging", etc. This is because there is a fairly simple way for us to understand the relative import (or lack thereof) of new technologies to scholarly communication and that is to ask the following question:

"Can the application of this technology in the realm of scholarly communication help researchers to read less?"

If the answer is "yes", then you'd better pay very close attention to it.

In fact, I'd go so far as to say the history of scholarly publishing can be characterized by the successful adoption of conventions and tools that help researchers read strategically.

Now I have something to cite when I rant.

Anyway, congratulations to Allen & Carole.

March 26, 2008

Word Add-in for Scholarly Authoring and Publishing

Last week Pablo Fernicola sent me email announcing that Microsoft have finally released a beta of their Word plugin for marking-up manuscripts with the NLM DTD. I say "finally" because we've know this was on the way and have been pretty excited to see it. We once even hoped that MS might be able to show the plug-in at the ALPSP session on the NLM DTD, but we couldn't quite manage it.

The plugin is targeted at production/editorial staff, but, of course, it will be interesting to see if any of this work can be pushed back to the author. I won't hold my breath on the latter score, but it will be fun to watch.

One thing I would note is that the NLM DTD can also be used in the humanities and social sciences, so, frankly, I think they should market it more broadly.

Anyway- the plugin can be downloaded from the Microsoft site.

And Pablo has setup a blog where testers can discuss the add-in.

And there is also an entry for the project on the Microsoft Research site (an interesting place to peruse, if you have a moment).

Congatulations to Pablo and his team.

December 14, 2007

On Google Knol

The recently discussed (announced?) Google Knol project could make Google Scholar look like a tiny blip in the the scholarly publishing landscape.

I love the comment an authority:

"Books have authors' names right on the cover, news articles have bylines, scientific articles always have authors -- but somehow the web evolved without a strong standard to keep authors names highlighted. We believe that knowing who wrote what will significantly help users make better use of web content."

And so I suppose this means they are assigning author identifiers....

July 02, 2007

Oh, shiny!

The other day Ed and I visited the OECD to talk about all things e-publishig. At the end of our our meeting, Toby Green, the OECD's head of publishing, handed all 30+ meeting attendees a copy of their well-known OECD Factbook- on a USB stick.

Picture of the OECD Factbookbook USB stick

Before you dismiss this as a gimick- note that organizations like the OECD get a lot of political and marketing mileage with "leave behinds"- print copies of their key reports, conference proceedings and reference works. While researchers might prefer electronic versions of the publications for their day-to-day work, print versions of the same publications seemed to continue to play a critical role as an "awareness tool." I know that, for this very reason, several NGO/IGOs that I've spoken to have despaired of ever ramping down their print operations.

I think that the OECD might have figured out a solution to this dilemma. It's difficult to describe how viscerally satisfying it was to receive one of these Factbook USB-sticks. From the way in which the other meeting attendees swarmed around Toby as he was handing them out, I think that they might have had the same reaction.

As we headed back to London on the Eurostar, I almost immediately popped the USB stick into my laptop and started browsing through the Factbook, much as I would have thumbed through a print version of the same (although -truth be told- I would have been tempted to conveniently "forget" the print version in order to not have to shlep it from Paris back to Oxford).

In short, I think the system works. Kudos to the OECD for a simple, inexpensive and creative experiment in e-publishing.

June 05, 2007

Resource Maps

nyc1.jpg

Last week we had a second face-to-face of the OAI-ORE (Open Archives Initiative – Object Reuse and Exchange) Technical Committee in New York, the meeting being hosted courtesy of Google. (Hence the snap here taken from the terrace of Google's canteen with its gorgeous view of midtown Manhattan. And the food's not too shabby either. ;~)

The main input to the meeting was this discussion document: Compound Information Objects: The OAI-ORE Perspective. This document we feel has now reached a level of maturity that we wanted to share with a wider audience. We invite feedback either directly at ore@openarchives.org or indirectly via yours truly.

The document attempts to describe the problem domain - that of describing a scholarly publication as an aggregation of resources on the Web - and to put that squarely into the Web architecture context. What the initiative is seeking to provide is machine descriptions of those resources and their relationships, something that we are inclining to call "resource maps" and as underpinning we are making use of the notion of "named graphs" from ongoing semantic web research. Essentially these resource maps are machine-readable descriptions of participating resources (in a scholarly object - both core resources and related resources) and the relationships between those resources, the whole set of assertions about those resources being named (i.e. having a URI as identifier) and having provenance information attached, e.g. publisher, date of publication, version information (still under discussion). It is envisaged that these compound object descriptions may be available in a variety of serializations from a published, object-specific URL (i.e. a good old-fashioned Web address) but some honest-to-goodness XML serialization is a likely to be one of the candidates. No surprises here, then.

Below is a schematic from the paper which shows the publication of a resource map (or named graph) corresponding to the compound object which logically represents a scholarly publication. For those objects of immediate interest to CrossRef these would likely be identified with DOI's although there is no restriction in OAI-ORE on the identifier to be used - other than it be a URI.

named_graph.png

Update: For a couple posts from some other members of the ORE TC see here (Peter Murray, OhioLINK) and here (Pete Johnston, Eduserv).

May 31, 2007

RSC's Project Prospect v1.1

We updated our Project Prospect articles today to release v1.1, with a pile of look & feel improvements to the HTML views and links. The most interesting technical addition is the launch of our enhanced RSS feeds, where we have updated our existing feeds for enhanced articles. These now include ontology terms and primary compounds both visually (as text terms and 2D images) and within the RDF - using the OBO in OWL representation and the info:inchi specification mentioned here by Tony only a few weeks ago.

The enhanced entries will soon become more common as we concentrate our enhancements on our Advance Articles, but the current example below from our Photochemical and Photobiological Sciences feed is lovely. RDF code after the jump - just as beautiful to the parents...

ProspectRSS.jpg

Continue reading "RSC's Project Prospect v1.1" »

March 23, 2007

Welcome to "Otmi-discuss"

Just a quick note to mention that we've now set up a new mailing list otmi-discuss@crossref.org for public discussion of OTMI - the Open Text Mining Interface proposed by Nature. See the list information page here for details on subscribing to the list and to access the mail archives.

And many thanks to the CrossRef folks for hosting this for us!

March 02, 2007

Open Content

In light of my earlier post on OTMI, the mail copied below from Sebastian Hammer at Index Data about open content may be of interest. They are looking to compile a listing of web sources of open content - see this page for further details.

(Via XML4lib and other lists.)

Continue reading "Open Content" »