NLM Blog Citation Guidelines
I've just returned from Frankfurt Book fair and noticed that there has been some recent popular interest in the The NLM Style Guide for Authors, Editors and Publishers recommendations concerning citing blogs.
Which reminds me of an issue that has periodically been raised here at CrossRef- should we be doing something to try and provide a service for reliably citing more ephemeral content such as blogs, wikis, etc.?
Personally, I cringe when I see people include plain old URLs (POUs?) in citations. What's the point? They are almost guaranteed to fail to resolve after a few years. In citing them, you are hardly helping to preserve the scholarly record. You might as well just record the metadata associated with the content.
So why don't we simply allow individuals to assign DOIs to their content?
As Chuck Koscher says, "CrossRef DOIs are only as persistent as CrossRef staff." CrossRef depends on its ability to chase down and berate member publishers when they fail to update their DOI records. Its hard enough doing this with publishers, so just imagine what it would be like trying to chase down individuals. In short, it just wouldn't scale.
But what if we provided a different service for more informal content? Recently we have been in talking with Gunther Eysenbach, the creator of the very cool WebCite service about whether CrossRef could/should operate a citation caching service for ephemera.
As I said, I think WebCite is wonderful, but I do see a few problems with it in its current incarnation.
The first is that, the way it works now, it seems to effectively leech usage statistics away from the source of the content. If I have a blog entry that gets cited frequently, I certainly don't want all the links (and their associated Google-juice) redirected away from my blog. As long as my blog is working, I want traffic coming to my copy of the content, not some cached copy of the content (gee- the same problem publishers face, no?). I would also, ideally, like that traffic to continue to come to to my blog if I move hosting providers, platforms (WordPress, Moveable Type) , blog conglomerates (Gawker, Weblogs, Inc.), etc.
The second issue I have with WebCite is simpler. I don't really fancy having to actually recreate and run a web-caching infrastructure when there is already a formidable one in existence.
So what if we ran a service for individuals that worked like this:
- For a fee, you can assign DOIs to your ephemeral, CC-licensed content.
- When you assign a DOI to a piece of content (or update an existing DOI), we will immediately archive said content with the Internet Archive (who, incidentally, charges for this service)
- We will direct those DOIs to your web site as long as you are both:
- Paying the fee
- Updating your URLs to point to the correct content
- If you fail in either "a" or "b", we will then redirect said DOIs to the cached version of the content on the Internet Archive (after having warned you repeatedly via automated e-mail).
(Note, as an aside, that we could in theory provide a similar dark-archive service for publishers with non free content using something like JStore as the archive)
This approach would help to ensure that a blogger's version of content was always linked to as long it was available. It would also preserve the "persistence" of CrossRef DOIs by making sure that we could always resolve the DOI even if we were not able to get the owner of said DOI to update it.
So back to the NLM guidelines... On the one hand, I'm delighted to see that the NLM has issued guidelines on citing blogs. It seems glaringly obvious that informal (and ephemeral) content such as blogs and wikis are increasingly becoming vital parts of the scholarly record. On the other hand, it also seems to me that recommending that somebody "cite" with a broken pointer (i.e. a URL) to content verges on tokenism. This isn't the NLM's fault- there just isn't a reliable mechanism for citing informal content in a manner that ensures you can then retrieve and look at said content in the future.
And this is no longer a problem confined to the Scholarly/Professional publishing space. As Jon Udell has occasionally pointed out, citation is increasingly an important currency for *any* professional writer on the web. It seems to me that a system for reliably citing blogs and wikis would benefit many communities. I could easily see commercial hosted Blog services (Blogger, WordPress) offering a "Cached-DOI" feature as a premium service to their clients.
So what do you think? What am I missing? is this something we should be looking at?

Comments
Well, Geoff, you did raise the question. :) So I've got to be devil's advocate and ask: What about PURL? (See http://purl.org/.) Current stats are listed as:
And as blogged earlier here
http://www.crossref.org/CrossTech/2007/07/purl_redux.html
there's a rewrite of the PURL software in the works.
Why would a blogger choose DOI over PURL? I must confess to not knowing as much about it as I should, but seems that there's a lower barrier to entry - fees, metadata, rope.
So, why not?
Tony
Posted by: Tony Hammond | October 15, 2007 08:48 AM
You could use purls (or handles or XRIs or Numly Numbers or whatever new unique identifier somebody chooses to dream up in the next few years), but I'll just point out two things:
1) If they are going to be sustainable, they had better have a business model that is sustainable. I suspect this means fees- if for no other reason than they'd need to pay ArchiveIt for their instant archive service.
2) DOI is already heavily used in the scholarly/professional space. Of course, the same does not apply to commercial content, so it is harder to make the case for DOI there.
Posted by: Geoffrey Bilder | October 15, 2007 11:11 AM
Your first point is well taken and we are working on this to resolve this (we will be showing the original version first if it is identical with the cached version). Most publishers who are using WebCite are citing both the original URL and the WebCite URL, so that there is little diversion from the original content anyway).
I do not understand your second point where you say "I don't really fancy having to actually recreate and run a web-caching infrastructure when there is already a formidable one in existence (and you link to archive.org).".
You do not have to create any infrastructure, as the WebCite infrastructure already exists. It should also be said that WebCite is actually older than archive.org (you can look this up, I mentioned it already in an article I published in 1998), and was created explicitly for citing / caching ephemeral (but scholarly important) material on the web.
Archive.org is crawler-based, while WebCite.org allows on-demand archiving for scholarly relevant work by the citing author.
To your last point, In principle we could also automatically assign a DOI to all content we cache, but have refrained from doing so (so far), as a DOI is supposed to be unique, and we don't know if the webpage/blog/webdocument we are archiving already has a DOI assigned. If somebody has any thoughts on this (is this a problem? should we assign DOIs to cached copies?), please email me.
G. Eysenbach
WebCite initiator (www.webcitation.org)
Posted by: Gunther Eysenbach | November 21, 2007 05:06 PM
First- I should make sure to emphasize that my above comments were strictly meant to elicit feedback on how the service might evolve (and tie into the DOI) if CrossRef were to run it.
But on to your points...
First- Cliff Lynch quite rightly set me straight on terminology and pointed out that, when talking about this, I should be using the word "archive" instead of "cache."
As for the archiving infrastructure, I suspect that publishers and (particularly) librarians would be happier if a service like WebCite used existing standards and infrastructure for archiving. Ideally, they'd like multiple, redundant infrastructures. So while I understand that WebCite can support multiple, library-run dark mirrors of the service, it would seem a natural extension to have it support things like IA, Portico or CLOCKSS. Which of these is preferable? I'm afraid I don't know. I have just been using them as examples.
While WebCite is older than IA, the latter has a user base that is orders of magnitude larger. Consequently their infrastructure is formidable (http://www.archive.org/web/hardware.php) and the number of parties who have an interest in making sure it continues to exist is large. Finally, IA also supports on-demand archiving (as opposed to crawler-based) through its commercial Archive-It service.
The integration of DOIs into the system is an interesting problem and probably requires much more detailed thought- but in the scenario you paint, there is no reason why a hash of the document can't be registered along with the DOI's metadata so that we could essentially detect when somebody is trying to assign a DOI to content that already has it. Clearly this would mean that even a slightly changed page might get a new DOI assigned to it (and get the content re-archived), but this is a versioning issue and might actually be desirable behavior in the case of reliably citing ever-changing ephemera.
Posted by: Geoffrey Bilder | November 22, 2007 05:38 AM
WebCite has no interest to compete with the Internet Archive. In fact, development of WebCite (created in 1998) stopped when Google and IA emerged, as everybody thought they would solve our problems in academic publishing. But they haven't. After a few studies came out discussing this problem (http://www.sciencemag.org/cgi/content/short/302/5646/787) referring to this as an issue "calling for an immediate response" by publishers and authors, development of the system was revitalized, as a system with specific features for the scholarly publishing world.
I see WebCite as complementary to generic archives such as IA. As I said elsewhere, WebCite will be depositing archived content in IA.
Thus, WebCite really is a publisher-run front-end to other archives which has the capacity to offer services which are specifically designed for scholarly authors and publishers wish to cite unstable webmaterial. DOIs, interfaces for publishers to upload NLM-XML tagged manuscripts for us to comb through the references, webcitation impact factors etc are all examples for this.
As to the statement that "IA has a user base that is orders of magnitude larger", it depends on what users you are looking at. How many scholarly references do you see where citing authors link to the archived version in the Internet Archive, or even have paid for the archive-it service to create a real-time snapshot of the webpage/webdocument they cite, compared to scholarly users and publishers using WebCite? IA is still primarily a crawler-based general purpose archive, while WebCite is a specialized service for scholarly authors, which will always be free for the citing author, as opposed to the archive-on-demand offering of IA, which is really more intended for large organizations than for individual researchers who wish to preserve a cited resource.
We are currently using exactly this method to determine whether we have to store another physical copy of something that has been archived before, or whether we just point to an already exsting archived copy. As I said, we could assign DOIs immediately. Still, things like dynamic advertisements, dynamic displays of date and time on webpages etc. make this an extremely complex problem.Posted by: Gunther Eysenbach | November 24, 2007 12:46 PM