Coming to you Live from Wikipedia

We’ve been collecting citation events from Wikipedia for some time. We’re now pleased to announce a live stream of citations, as they happen, when they happen. Project this on your wall and watch live DOI citations as people edit Wikipedia, round the world.

View live stream »

In the hours since this feature launched, there are events from Indonesian, Portugese, Ukrainian, Serbian and English Wikipedias (in that order).

Live event stream

The usual weasel words apply. This is a labs project and so may not be 100% stable. If you experience any problems please email labs@crossref.org .

January 2015 DOI Outage: Followup Report

Background

On January 20th, 2015 the main DOI HTTP proxy at doi.org experienced a partial, rolling global outage. The system was never completely down, but for at least part of the subsequent 48 hours, up to 50% of DOI resolution traffic was effectively broken. This was true for almost all DOI registration agencies, including CrossRef, DataCite and mEDRA.

At the time we kept people updated on what we knew via Twitter, mailing lists and our technical blog at CrossTech. We also promised that, once we’d done a thorough investigation, we’d report back. Well, we haven’t finished investigating all implications of the outage. There are both substantial technical and governance issues to investigate. But last week we provided a preliminary report to the CrossRef board on the basic technical issues, and we thought we’d share that publicly now.

The Gory Details

First, the outage of January 20th was not caused by a software or hardware failure, but was instead due to an administrative error at the Corporation for National Research Initiatives (CNRI). The domain name “doi.org” is managed by CNRI on behalf of the International DOI Foundation (IDF). The domain name was not on “auto-renew” and CNRI staff simply forgot to manually renew the domain. Once the domain name was renewed, it took about 48 hours for the fix to propagate through the DNS system and for the DOI resolution service to return to normal. Working with CNRI we analysed traffic through the Handle HTTP proxy and here’s the graph:

Chart of Handle HTTP proxy traffic during outage

The above graph shows traffic over a 24 hour period on each day from January 12, 2015 through February 10th, 2015. The heavy blue line for January 20th and the heavy red line for January 21st show how referrals declined as the doi.org domain was first deleted, and then added back to DNS.

It could have been much worse. The domain registrar (GoDaddy) at least had a “renewal grace and registry redemption period” which meant that even though CNRI forgot to pay its bill to renew the domain, the domain was simply “parked” and could easily be renewed by them. This is the standard setting for GoDaddy. Cheaper domain registrars might not include this kind of protection by default. Had there been no grace period, then it would have been possible for somebody other than CNRI to quickly buy the domain name as soon as it expired. There are many automated processes which search for and register recently expired domain names. Had this happened, at the very least it would have been expensive for CNRI to buy the domain back. The interruption to DOI resolutions during this period would have also been almost complete.

So we got off relatively easy. The domain name is now on auto-renew. The outage was not as bad as it could have been. It was addressed quickly and we can be reasonably confident that the same administrative error will not happen again. CrossRef even managed to garner some public praise for the way in which we handled the outage. It is tempting to heave a sigh of relief and move on.

We also know that everybody involved at CNRI, the IDF and CrossRef have felt truly dreadful about what happened. So it is also tempting to not re-open old wounds.

But it would be a mistake if we did not examine a fundamental strategic issue that this partial outage has raised: How can CrossRef claim that its DOIs are ‘persistent’ if CrossRef does not control some of the key infrastructure on which it depends? What can we do to address these dependencies?

What do we mean by “persistent?”

@kaythaney tweets on definition of "persistent"

@kaythaney tweets on definition of “persistent”

To start with, we should probably explore what we mean by ‘persistent’. We use the word “persistent” or “persistence” about 470 times on the CrossRef web site. The word “persistent” appears central to our image of ourselves and of the services that we provide. We describe our core, mandatory service as the “CrossRef Persistent Citation Infrastructure.”

The primary sense of the word “persistent” in the New Oxford American Dictionary is:

Continuing firmly or obstinately in a course of action in spite of difficulty or opposition.

We play on this sense of the word as a synonym for “stubborn” when we half-jokingly say that, “CrossRef DOIs are as persistent as CrossRef staff.” Underlying this joke is a truth, which is that persistence is primarily a social issue, not a technical issue.

Yet presumably we once chose to use the word “persistent” instead of “perpetual” or “permanent” for other reasons. “Persistence” implies longevity, without committing to “forever.” Scholarly publishers, perhaps more than most industries, understand the long term. After all, the scholarly record dates back to at least 1665 and we know that the scholarly community values even our oldest journal backfiles. By using the word “persistent” as opposed to the more emphatic “permanent” we are essentially acknowledging that we, as an industry, understand the complexity and expense of stewarding the content for even a few hundred years to say nothing of “forever.” Only the chronologically naïve would recklessly coin terms like “permalink” for standard HTTP links which have a documented half-life of well under a decade.

So “persistent” implies longevity- without committing to forever- but this still begs questions. What time span is long enough to qualify as “persistent?” What, in particular, do we mean by “persistent” when we talk about CrossRef’s “Persistent Citation Infrastructure?” or of CrossRef DOIs being “persistent identifiers?”

What do we mean by “persistent identifiers?”

@violetailik tweets on outage and implication for term "persistent identifier"

]5 @violetailik tweets on outage and implication for term “persistent identifier”

First, we often make the mistake of talking about “persistent identifiers” as if there is some technical magic that makes them continue working when things like HTTP URIs break. The very term “persistent identifier” encourages this kind of magical thinking and, ideally, we would instead talk about “persist-able” identifiers. That is, those that have some form of indirection built into them. There are many technologies that do this- Handles, DOIs, Purls, ARKs and every URL shortener in existence. Each of them simply introduces a pointer mapping between an identifier and location where a resource or content resides. This mapping can be updated when the content moves, thus preserving the link. Of course, just because an identifier is persist-able doesn’t mean it is persistent. If Purls or DOIs are not updated when content moves, then they are no more persistent than normal URLs.

Andrew Treloar points out that when we talk about “persistent identifiers,” we tend to conflate several things:

  1. The persistence of the identifier- that is the token or string itself.
  2. The persistence of the thing being pointed at by the identifier. For example, the content.
  3. The persistence of the mapping of the identifier to the thing being identified.
  4. The persistence of the resolver that allows one to follow the mapping of the identifier to the thing being identified.
  5. The persistence of a mechanism for updating the mapping of the identifier to the thing being identified.

If any of the above fails, then “persistence” fails. This is probably why we tend to conflate them in the first place.

Each of these aspects of “persistence” is worthy of much closer scrutiny, however, in the most recent case of the January outage of “doi.org,” the problem specifically occurred with item “D”- the persistence of the resolver. When CNRI failed to renew the domain name for “doi.org” on time, the DOI resolver was rendered unavailable to a large percentage of people over a period of about 48 hours as global DNS servers first removed, and then added back the “doi.org” domain.

Turtles all the way down*

The initial public reaction to the outage was, almost unanimous in one respect- people assumed that the problem originated with CrossRef.

@iainh_z tweets to CrossRef enquiring about failed DOI resoluton

@iainh_z tweets to CrossRef enquiring about failed DOI resoluton

@LibSkrat tweets at CrossRef about DOI outage

@LibSkrat tweets at CrossRef about DOI outage

This is both surprising and unsurprising. It is surprising because we have fairly recent data indicating that lots of people recognise the DOI brand, but not the CrossRef brand. Chances are, that this relatively superficial “brand” awareness does not correlate with understanding how the system works or how it relates to persistence. It is likely plenty of people clicked on DOIs at the time of the outage and, when they didn’t work, simply shrugged or cursed under their breath. They were aware of the term ‘DOI’ but not of the promise of “persistence”. Hence, they did not take to twitter to complain about it, and if they did, they probably wouldn’t have known who to complain to or even how to complain to them (neither CNRI or the IDF has a Twitter account).

But the focus on CrossRef is also unsurprising. CrossRef is by far the largest and most visible DOI Registration Agency. Many otherwise knowledgeable people in the industry simply don’t know that there are even other RAs.

They also generally didn’t know of the strategic dependencies that exist in the CrossRef system. By “strategic dependencies” we are not talking about the vendors, equipment and services that virtually every online enterprise depends on. These kinds of services are largely fungible. Their failures may be inconvenient and even dramatic, but they are rarely existential.

Instead we are talking about dependencies that underpin CrossRef’s ability to deliver on its mission. Dependencies that not only affect CrossRef’s operations, but also its ability to self-govern and meet the needs of its membership. In this case there are three major dependencies: Two of which are specific to CrossRef and other DOI registration agencies and one which is shared by virtually all online enterprises today. The organizations are: The International DOI Foundation (IDF), Corporation for National Research Initiatives (CNRI) and the Internet Corporation for Assigned Names and Numbers (ICANN).

Dependency of RAs on IDF, CNRI and ICANN. Turtles all the way down.

Dependency of RAs on IDF, CNRI and ICANN. Turtles all the way down.

Each of these agencies has technology, governance and policy impacts on CrossRef and the other DOI registration agencies, but here we will focus on the technological dependencies.

At the top of the diagram are a subset of the various DOI Registration Agencies. Each RA uses the DOI for a particular constituency (e.g. scholarly publishers) and application (e.g. citation). Sometimes these constituencies/applications overlap (as with mEDRA, CrossRef and DataCite), but sometimes they are orthogonal to the other RAs, as is the case with EIDR. All, however, are members of the IDF.

The IDF sets technical policies and development agendas for the DOI infrastructure. This includes recommendations about how RAs should display and link DOIs. Of course all of these decisions have an impact on the RAs. However, the IDF provides little technical infrastructure of its own as it has no full-time staff. Instead it outsources the operation of the system to CNRI, this includes the management of the doi.org domain which the IDF owns.

The actual DOI infrastructure is hosted on a platform called the Handle System which was developed by and is currently run by CNRI. The Handle System is part of a quite complex and sophisticated platform for managing digital objects that was originally developed for DARPA. A subset of the Handle system is designated for use by DOIs and is identified by the “10” prefix (e.g. 10.5555/12345678). The Handle system itself is not based on HTTP (the web protocol). Indeed, one of the much touted features of the Handle System is that it isn’t based on any specific resolution technology. This was seen as a great virtue in the late 1990s when the DOI system was developed and the internet had just witnessed an explosion of seemingly transient, competing protocols (e.g. Gopher, WAIS, Archie, HyperWave/Hyper-G, HTTP, etc.). But what looked like a wild-west of protocols quickly settled into an HTTP hegemony. In practice, virtually all DOI interactions with the Handle system are via HTTP and so, in order to interact with the web, the Handle System employs a “Handle proxy” which translates back and forth between HTTP, and the native Handle system. This all may sound complicated, and the backend of the Handle system is really very sophisticated, but it turns out that the DOI really uses only a fraction of the Handle system’s features. In fact, the vast majority of DOI interactions merely use the Handle system as a giant lookup table which allows one to translate an identifier into a web location. For example, it will take a DOI Handle like this:

10.5555/12345678

and redirect it to (as of this writing) the following URL:

http://psychoceramics.labs.crossref.org/10.5555-12345678.html

This whole transformation is normally never seen by a user. It is handled transparently by the web browser, which does the lookup and redirection in the background using HTTP and talking to the Handle Proxy. In the late 1990s, even doing this simple translation quickly, at scale with a robust distributed infrastructure, was not easy. These days however we see dozens if not hundreds of “URL Shorteners” doing exactly the same thing at far greater scale than the Handle System.

It may seem a shame that more of the Handle Systems features are not used, but the truth is the much touted platform independence of the Handle System rapidly became more of a liability and impediment to persistence than an aid. To be blunt, if in X years a new technology comes out that supersedes the web, what do we think the societal priority is going to be?

  • To provide a robust and transparent transition from the squillions of existing HTTP URI identifiers that the entire world depends on?
  • To provide a robust and transparent transition from the tiny subset of Handle-based identifiers that are used by about a hundred million specialist resources?

Quite simply, the more the Handle/DOI systems diverge from common web protocols and practice, then the more we will jeopardise the longevity of our so-called persistent identifiers.

So, in the end, DOI registration agencies really only use the Handle system for translating web addresses. All of the other services and features one might associate with DOIs (reference resolution, metadata lookup, content negotiation, OAI-PMH, REST APIs, CrossMark, CrossCheck, TDM Services, FundRef etc) are all provided at the RA level.

But this address resolution is still critical. And it is exactly what failed for many users on January 20th 2015. And to be clear, it wasn’t the robust and scaleable Handle System that failed. It wasn’t the Handle Proxy that failed. And it certainly wasn’t any RA-controlled technology that failed. These systems were all up and running. What happened was that the standard handle proxy that the IDF recommends RAs use, “dx.doi.org”, was effectively rendered invisible to wide portions the internet because the “doi.org” domain was not renewed. This underscores two important points.

The first is that it doesn’t much matter what precisely caused the outage. In this case it was an administrative error. But the effect would have been similar if the Handle proxies had failed of if the Handle system itself had somehow collapsed. In the end, CrossRef and all DOI registration agencies are existentially dependent on the Handle system running and being accessible.

The second is that the entire chain of dependencies from the RAs down through CNRI are also dependent on the DNS system which, in turn, is governed by ICANN. We should really not be making too much of the purported technology independence of the DOI and Handle systems. To be fair, this limitation is inherent to all persistent identifier schemes that aim to work with the web. It really is “turtles all the way down.

What didn’t fail on January 19th/20th and why?

You may have noticed a lot of hedging in our description of the outage of January 19th/20th. For one thing, we use the term “rolling outage.” Access to the Handle Proxy via “dx.doi.org” was never completely unavailable during the period. As we’ve explained, this is because the error was discovered very quickly and the domain was renewed hours after it expired. The nature of DNS propagation meant that even as some DNS servers were deleting the “doi.org” entry, others were adding it back to their tables. In some ways this was really confusing because it meant it was difficult to predict where the system was working and where it wasn’t. Ultimately it all stabilised after the standard 48-hour DNS propagation cycle.

But there were also some Handle-based services that simply were not affected at all by the outage. During the outage, a few people asked us if there was an alternative way to resolve DOIs. The answer was “yes,” there were several. It turns out that “doi.org” is not the only DNS name that points to the Handle Proxy. People could easily substitute “dx.doi.org” with “dx.crossref.org” or “dx.medra.org” or “hdl.handle.net” and “resolve” any DOI. Many of CrossRef’s internal services use these internal names and so the services continued to work. This is partly why we only discovered the “doi.org” was down from people reporting it on Twitter.

And, of course, there were other services that were not affected by the outage. CrossMark, the REST API, and CrossRef Metadata Search all continued to work during the outage.

Protecting ourselves

So what can we do to reduce our dependencies and/or the risks intrinsic to those dependencies?

Obviously, the simplest way to have avoided the outage would have been to ensure that the “doi.org” domain was set to automatically renew. That’s been done. Is there anything else we should do? A few ideas have been floated that might allow us to provide even more resilience. They range greatly in complexity and involvement.

  1. Provide well-publicised public status dashboards that show what systems are up and which clearly map dependencies so that people could, for instance, see that the doi.org server was not visible to systems that depended on it. Of course, if such a dashboard had been hosted at doi.org, nobody would have been able to connect to it. Stoopid turtles.
  2. Encourage DOI RAs to have the members point to Handle proxies using domain names under the RA’s control. Simply put, if CrossRef members had been using “dx.crossref.org” instead of “dx.doi.org”, then CrossRef DOIs would have continued to work throughout the outage of “doi.org”. The same with mEDRA, and the other RAs. This way each RA would have control over another critical piece of their infrastructure. It would also mean that if any single RA made a similar domain name renewal mistake, the impact would be isolated to a particular constituency. Finally, using RA-specific domains for resolving DOIs might also make it clear that different DOIs are managed by different RAs and might have different services associated with them. Perhaps CrossRef would spend less time supporting non-CrossRef DOIs?
  3. Provide a parallel, backup resolution technology that could be pointed to in the event of a catastrophic Handle System failure. For example we could run a parallel system based on PURLs, ARKs or another persist-able identifier infrastructure.
  4. Explore working with ICANN to get the handle resolvers moved under the special “.arpa” top level domain (TLD). This TLD (RFC 3172) is reserved for services that are considered to be “critical to the operation of the internet.” This is an option that was first discussed at a meeting of persistent identifier providers in 2011.

These are all tactical approaches to addressing the specific technical problem of the Handle System becoming unavailable, but they do not address deeper issues relating to our strategic dependence on several third parties. Even though the IDF and CNRI provide us with pretty simple and limited functionality, that functionality is critical to our operations and our claim to be providing persistent identifiers. Yet these technologies are not in our direct control. We had to scramble to get hold of people to fix the problem. For a while, we were not able to tell our users or members what was happening because we did not know ourselves.

The irony is that CrossRef was held to account, and we were in the firing line the entire time. Again, this was almost unavoidable. In addition to being the largest DOI RA, we are also the only RA that has any significant social media presence and support resources. Still, it meant that we were the public face of the outage while the IDF and CNRI remained in the background.

And this is partly why our board has encouraged us to investigate another option:

  1. Explore what it would take to remove CrossRef dependencies on the IDF and CNRI.

CrossRef is just part of a chain of dependencies the goes from our publisher members down through the IDF, CNRI and, ultimately, ICANN. Our claim to providing a persistent identifier structure depends entirely on the IDF and CNRI. Here we have explored some of the technical dependencies. But there are also complex governance and policy implications of these dependencies. Each organization has membership rules, guidelines and governance structures which can impact CrossRef members. Indeed, the IDF and CNRI are themselves members of groups (ISO and DONA, respectively) which might ultimately have policy or governance impact for DOI registration agencies. We will need to understand the strategic implications of these non technical dependencies as well.

Note that the CrossRef board has merely asked us to “explore” what it would take to remove dependencies. They have not asked us to actually take any action. CrossRef has been massively supportive of the IDF and CNRI, and they have been massively supportive of us. Still, over the years we have all grown and our respective circumstances have changed. It is important that occasionally we question what we might have once considered to be axioms. As we discussed above, we use the term “persistent” which, in turn, is a synonym for “stubborn.” At the very least we need to document the inter-dependencies that we have so that we can understand just how stubborn we can reasonably expect our identifiers to be.

The outage of January 20th was a humbling experience. But in a way we were lucky: Forgetting to renew the domain name was a silly and prosaic way to partially bring down a persistent identifier infrastructure, but it was also relatively easy to fix. Inevitably, there was a little snark and some pointed barbs directed at us during the outage, but we were truly overwhelmed by the support and constructive criticism we received as well. We have also been left with a clear message that, in order for this good-will to continue, we need to follow-up with a public, detailed and candid analysis of our infrastructure and its dependencies. Consider this to be the first section of a multi-part report.

@kevingashley tweets asking for followup analysis

@kevingashley tweets asking for followup analysis

@WilliamKilbride tweets asking for followup and lessons learned

@WilliamKilbride tweets asking for followup and lessons learned

Image Credits

Turtle image CC-BY “Unrecognised MJ” from the Noun Project

Real-time Stream of DOIs being cited in Wikipedia

TL;DR

Watch a real-time stream of DOIs being cited (and “un-cited!” ) in Wikipedia articles across the world: http://goo.gl/0AknMJ

Background

For years we’ve known that the Wikipedia was a major referrer of CrossRef DOIs and about a year ago we confirmed that, in fact, the Wikipedia is the 8th largest refer of CrossRef DOIs. We know that people follow the DOIs, too. This despite a fraction of Wikipedia citations to the scholarly literature even using DOIs. So back in August we decided to create a Wikimedia Ambassador programme. The goal of the programme was to promote the use of persistent identifiers in citation and attribution in Wikipedia articles. We would do this through outreach and through the development of better citation-related tools.

Remember when we originally wrote about our experiments with the PLOS ALM code and how that has transitioned into the DOI Event Tracking Pilot? In those posts we mentioned that one of the hurdles in gathering information about DOI events is the actual process of polling third party APIs for activity related to millions of DOIs. Most parties simply wouldn’t be willing handle the load of a 100K API calls an hour. Besides, polling is a tremendously inefficient process, only a fraction of DOIs are ever going to generate events, but we’d have to poll for each of them, repeatedly, forever, to get an accurate picture of DOI activity. We needed a better way. We needed to see if we could reverse this process and convince some parties to instead “push” us information whenever they saw DOI related events (e.g. citations, downloads, shares, etc). If only we could convince somebody to try this…

Wikipedia DOI Events

In December 2014 we took the opportunity of the 2014 PLOS/CrossRef ALM Workshop in San Francisco too meet with Max Klein and Anthony Di Franco where we kicked off a very exciting project.

There’s always someone editing a Wikipedia somewhere in the world. In fact, you can see a dizzying live stream of edits. We thought that given that there are so many DOIs in Wikipedia, that live stream may contain some diamonds (DOIs are made of diamond, that’s how they can be persistent). Max and Anthony went away and came back with a demo that contains a surprising amount of DOI activity.

That demo is evolving into a concrete service, called Cocytus. It is running at Wikimedia Labs monitoring live edits as you read this.

For now we’re feeding that data into the DOI Events Collection app (which is an off-shoot of the Chronograph project). We are in the process of modifying the Lagotto code so that we can instead push those events into the DOI Event Tracking Instance.

The first DOI event we noticed was delightfully prosaic: The DOI for “The polymath project” is cited by the Wikipedia page for “Polymath Project”. Prosaic perhaps, but the authors of that paper probably want to know. Maybe they can help edit the page.

Or how about this. Someone wrote a a paper about why people edit Wikipedia and then it was cited by Wikipedia. And then the citation was removed. The plot thickens…

We’re interested in seeing how DOIs are used outside of the formal scholarly literature. What does that mean? We don’t fully know, that’s the point. We have retractions in scholarly literature (and our CrossMark metadata and service allow publishers to record that), but it’s a bit different on Wikipedia. Edit wars are fought over … well you can see for yourself.

Citations can slip in and out of articles. We saw the DOI 10.1001/archpediatrics.2011.832 deleted from “Bipolar disorder in children”. If we’d not been monitoring the live feed (we had considered analysing snapshots of the Wikipedia in bulk) we might never have seen that. This is part of what non-traditional citations means, and it wasn’t obvious until we’d seen it.

You can see this activity on the Chronograph’s stream. Or check your favourite DOI. Please be aware that we’re only collecting newly added citations as of today. We do intend to go back and back-fill, but that may take some time- as it * cough * requires polling again.

Some Technical Things

A few interesting things that happened as a result of all this:

Secure URLs

SSL and HTTPS were invented so you could do things like banking on the web without fear of interception or tampering. As the web becomes a more important part of life, many sites are upgrading from HTTP to HTTPS, the secure version. This is not only because your confidential details may be tampered with, but because certain governments might not like you reading certain materials.

Because of this, some time ago, Wikipedia decided to embark on an upgrade to HTTPS last year, and they are a certain way along the path. The IDF, who are responsible for running the DOI system, upgraded to HTTPS this Summer, although most DOIs are referred to by HTTP still.

We met with Dario Taraborelli at the ALM workshop and discussed the DOI referral data that is fed into the Chronograph. We put two and two together and realised that Wikipedia was linking to DOIs (which are mostly HTTP) from pages which might be served over HTTPS. New policies in HTML5 specify that referrer URL headers shouldn’t be sent from HTTPS to HTTP (in case there was something secret in them). The upshot of this is, if someone’s browsing Wikipedia via HTTPS and click on a normal DOI, we won’t know that the user came from Wikipedia. Not a huge problem today, but as Wikipedia switches over to entirely secure, we’re going to miss out on very useful information.

Fortunately, the HTML5 specification includes a way to fix this (without leaking sensitive information). We discussed this with Dario, and he did some research, and came up with a suggestion, which got discussed. It’s fascinating to watch a democratic process like this take place and take part in it.

We’re waiting to see how the discussion turns out, and hope that it all works out so we can continue to report on how amazing Wikipedia is at sending people to scholarly literature.

How shall I cite thee?

Another discussion grew out of that process, and we started talking to a Wikipedian called Nemo (note to Latin scholars: we weren’t just talking to ourselves). Nemo (real name Federico Leva) had a few suggestions of his own. Another way to solve the referrer problem is by using HTTPS URLs (HTML5 allows browsers to send the referrer domain when going from HTTPS to HTTPS).

This means going back to all the articles that use DOIs and change them from HTTP to HTTPS. Not as simple as it sounds, and it doesn’t sound simple. We started looking into how DOIs were cited on Wikipedia.

After some research we found that there are more ways that we expected to cite DOIs.

First, there’s the URL. You can see it in action in this article. URLs can take various forms.

  • http://dx.doi.org/10.5555/12345678
  • http://doi.org/10.5555/12345678
  • https://dx.doi.org/10.5555/12345678
  • https://doi.org/10.5555/12345678
  • http://doi.org/hvx
  • https://doi.org/hvx

Second there’s the official template tag, seen in action here:

<ref name="SCI-20140731">{{cite journal |title=Sustained miniaturization and anatomical innovation in the dinosaurian ancestors of birds |url=http://www.sciencemag.org/content/345/6196/562 |date=1 August 2014 |journal=[[Science (journal)|Science]] |volume=345 |issue=6196 |pages=562–566 |doi=10.1126/science.1252243 |accessdate=2 August 2014 |last1=Lee |first1=Michael S. Y. |first2=Andrea|last2=Cau |first3=Darren|last3=Naish|first4=Gareth J.|last4=Dyke}}</ref>

There’s a DOI in there somewhere. This is the best way to cite DOIs, firstly as it’s actually a proper traditional citation and there’s nothing magic about DOIs, secondly because it’s a template tag and can be re-rendered to look slightly different if needed.

Third there’s the old official DOI template tag that’s now discouraged:

<ref name="Example2006">{{Cite doi|10.1146/annurev.earth.33.092203.122621}}</ref> 

And then there’s another one.

{{doi|10.5555/123456789}}

Knowing all this helps us find DOIs. But if we want to convert DOIs links in Wikipedia to use HTTPS, it means that there are more template tags to modify and more pages to re-render.

Nemo also put DOIs on the Interwiki Map which should make automatically changing some of the URLs a lot easier.

We’re very grateful to Nemo for his suggestions and work on this. We’ll report back!

The elephant in the room

Those of you who know how DOIs work will have spotted an unsecured elephant in the room. When you visit a DOI, you visit the URL, which hits the DOI resolver proxy server, which returns a message to your browser to redirect to the landing page on the publisher’s site.

Securely talking to the DOI resolver by using HTTPS instead of HTTP means that no-one can eavesdrop and see which DOI you are visiting, or tamper with the result and send you off to a different page. But the page you are sent to will be, in nearly all cases, still HTTP. Upgrading infrastructure isn’t trivial, and, with over 4000 members (mostly publishers), most CrossRef DOIs will still redirect to standard HTTP pages for the foreseeable future.

You can keep as secure as possible by using HTTPS Everywhere.

Fin

There’s lots going on, watch this space to see developments. Thanks for reading this, and all the links. We’d love to know what you think.

Bootnote

Not long after this blog post was published we saw something very interesting.

Interesting DOI

That’s no DOI. We like interesting things, but they can panic us. This turned out to be a great example of why this kind of thing can be useful. A minute’s digging and we found the article edit:

Wikipedia typo

It turns out that this was a typo: someone put a title when they should have put in a DOI. And, as the event shows, this was removed from the Wikipedia article.

CrossRef’s DOI Event Tracker Pilot

TL;DR

CrossRef’s “DOI Event Tracker Pilot”- 11 million+ DOIs & 64 million+ events. You can play with it at: http://goo.gl/OxImJa

Tracking DOI Events

So have you been wondering what we’ve been doing since we posted about the experiments we were conducting using PLOS’s open source ALM code? A lot, it turns out. About a week after our post, we were contacted by a group of our members from OASPA who expressed an interest in working with the system. Apparently they were all about to conduct similar experiments using the ALM code, and they thought that it might be more efficient and interesting if they did so together using our installation. Yippee. Publishers working together. That’s what we’re all about.

So we convened the interested parties and had a meeting to discuss what problems they were trying to solve and how CrossRef might be able to help them. That early meeting came to a consensus on a number of issues:

  • The group was interested in exploring the role CrossRef could play in providing an open, common infrastructure to track activities around DOIs, they were not interested in having CrossRef play a role in the value-add services of reporting on an interpreting the meaning of said activities.
  • The working group needed representatives from multiple stakeholders in the industry. Not just open access publishers from OASPA, but from subscription based publishers, funders, researchers and third party service providers as well.
  • That it was desirable to conduct a pilot to see if the proposed approach was both technically feasible and financially sustainable.

And so after that meeting, the “experiment” graduated to becoming a “pilot.” This CrossRef pilot is based on the premise that the infrastructure involved in tracking common information about “DOI events” can be usefully separated from the value-added services of analysing and presenting these events in the form of qualitative indicators. There are many forms of events and interactions which may be of interest. Service providers will wish to analyse, aggregate and present those in a range of different ways depending on the customer and their problem. The capture of the underlying events can be kept separate from those services.

In order to ensure that the CrossRef pilot is not mistaken for some sub rosa attempt to establish new metrics for evaluating scholarly output, we also decided eschew any moniker that includes the word “metrics” or synonyms. So the “ALM Experiment” is dead. Long live the “”DOI Event Tracker” (DET) pilot. Similarly PLOS’s open source “ALM software” has been resurrected under the name “Lagotto.”

The Technical Issues

CrossRef members are interested in knowing about “events” relating to the DOIs that identify their content. But our members face a now-classic problem. There are a large number of sources for scholarly publications (3k+ CrossRef members) and that list is still growing. Similarly, there are an unbounded number of potential sources for usage information. For example:

  • Supplemental and grey literature (e.g. data, software, working papers)
  • Orthogonal professional literature (e.g. patents, legal documents, governmental/NGO/IGO reports, consultation reports, professional trade literature).
  • Scholarly tools (e.g. citation management systems, text and data mining applications).
  • Secondary outlets for scholarly literature (institutional and disciplinary repositories, A&I services).
  • Mainstream media (e.g. BBC, New York Times).
  • Social media (e.g. Wikipedia, Twitter, Facebook, Blogs, Yo).

Finally, there is a broad and growing audience of stakeholders who are interested in seeing how the literature is being used. The audience includes publishers themselves as well as funders, researchers, institutions, policy makers and citizens.

Publishers (or other stakeholders) could conceivably each choose to run their own system to collect this information and redistribute it to interested parties. Or they can work with a vendor to do the same. But either case, they would face the following problems:

  • The N sources will change. New ones will emerge. Old ones will vanish.
  • The N audiences will change. New ones will emerge. Old ones will vanish.
  • Each publisher/vendor will need to deal with N source’s different APIs, rate limits, T&Cs, data licenses, etc. This is a logistical headache for both the publishers/vendors and for the sources.
  • Each audience will need to deal with N publisher/vendor APIs, rate limits, T&Cs, data licenses, etc. This is a logistical headache for both the audiences and for the publishers.
  • If publishers/vendors use different systems which in turn look at different sources, it will be difficult to compare or audit results across publishers/vendors.
  • If a journal moves from one publisher to another, then how are the metrics for that journal’s articles going to follow the journal?

And then there is the simple issue of scale. Most parties will be interested in comparing the data that they collect for their own content, with data about their competitors. Hence, if they all run their own system, they will each be querying much more than their own data. If, for example, just the commercial third-party providers were interested in collecting data covering the formal scholarly literature, they would each find themselves querying the same sources for the same 80 million DOIs. To put this into perspective, to refresh the data for 10 million DOIs once a month, would require sources to support ~ 14K API calls an hour. 60 million DOIs would require 100K API calls an hour. Current standard API caps for many of the sources that people are interested in querying hover around 2K per hour. We may see these sources lift that cap for exceptional cases, but they are unlikely to do so for many different clients all of whom are querying essentially the same thing.

These issues typify the “multiple bilateral relationships” problem that CrossRef was founded to try and ameliorate. When we have many organizations trying to access the exact same APIs to process the exact same data (albeit to different ends), then it seems likely that CrossRef could help make the process more efficient.

Piloting A Proposed Solution

The CrossRef DET pilot aims to show the feasibility of providing a hub for the collection, storage and propagation of DOI events from multiple sources to multiple audiences.

Data Collection

  • Pull: DET will collect DOI event data from sources that are of common interest to the membership, but which are unlikely to make special efforts to accommodate the scholarly communications industry. Examples of this class of source include large, broadly popular services like FaceBook, Twitter, VK, Sina Weibo, etc.
  • Push: DET will allow sources to send DOI event data directly to CrossRef in one of three ways:
    • Standard Linkback: Using standards that are widely used on the web. This will automatically enable linkback-aware systems like WordPress, Moveable Type, etc. to alert DET to DOI events.
    • Scholarly Linkback: A to-be-defined augmented linkback-style API which will be optimized to work with scholarly resources and which will allow for more sophisticated payloads including other identifiers (e.g. ORCIDs, FundRefs), metadata, provenance information and authorization information. This system could be used by tools designed for scholarly communications. So, for example, it could be used by publisher platforms to distribute events related to downloads or comments within their discussion forums. It could also be used by third party scholarly apps like Zotero, Mendeley, Papers, Authorea, IRUS-UK, etc. in order to alert interested parties in events related to specific DOIs.
    • Redirect: DET will also be able to serve as a service discovery layer that will allow sources to push DOI event data directly to an appropriate publisher-controlled endpoint using the above scholarly linkback mechanism. This can be used by sources like repositories in order to send sensitive usage data directly to the relevant publishers.

Data Propagation

Parties may want to use the DET in order to propagate information about DOI events. The system will support two broad data propagation patterns:

  • one-to-many: DOI events that are commonly harvested (pulled) by the DET system from a single source will be distributed freely to anybody who queries the DET API. Similarly, sources that push DOI events via the standard or scholarly linkback mechanisms, will also propagate their DOI events openly to anybody who queries the DET API. DOI events that are propagated in either of these cases will be kept and logged by the DET system along with appropriate provenance information. This will be the most common, default propagation model for the DET system.
  • one-to-one: Sources of DOI events can also report (push) DOI event data directly to owner of the relevant DOI if the DOI owner provides & registers a suitable end-point with the DET system. In these cases, data sources seeking to report information relating to a DOI, will be redirected (with a suitable 30X HTTP status and relevant headers) to the end-point specified by the DOI owner. The DET system will not keep the request or provenance information. One-to-one propagation model is designed to handle use cases where the source of the DOI event has put restrictions on the data and will only share the DOI events with the owner (registrant) of the DOI. This use case may be used, for example, by aggregators or A&I services that want to report confidential data directly back to a publisher. The advantage of the redirect mechanism is that CrossRef is not put into the position of having to secure sensitive data as said data will never reside on CrossRef systems.

Note that the two patterns can be combined. So, for example, a publisher might want to have public social media events reported to the DET and propagated accordingly, but to also to private third parties report confidential information directly to the publisher.

So Where Are We?

So to start with, the DET Working Group has grown substantially since the early days and we have representatives from a wide variety of stakeholders. The group includes:

  • Cameron Neylon, PLOS
  • Chris Shillum, Elsevier
  • Dom Mitchell, Co-action Publishing
  • Euan Adie, Altmetric
  • Jennifer Lin, PLOS
  • Juan Pablo Alperin, PKP
  • Kevin Dolby, Wellcome Trust
  • Liz Ferguson, Wiley
  • Maciej Rymarz, Mendeley
  • Mark Patterson, eLife
  • Martin Fenner, PLOS
  • Mike Thelwell, U Wolverhampton
  • Rachel Craven, BMC
  • Richard O’Beirne, OUP
  • Ruth Ivimey-Cook, eLife
  • Victoria Rao, Elsevier

As well as the usual contingent of CrossRef cat-herders including: Geoffrey Bilder, Rachael Lammey & Joe Wass.

When we announced the then-DET experiment, we said that one of the biggest challenges would be to create something that scaled to industry levels. At launch, we only loaded in about 317,500+ CrossRef DOIs representing publications from 2014 and we could see the system was going to struggle. Since then Martin Fenner and Jennifer Lin at PLOS have been focusing on making sure that the Lagotto code scales appropriately and now it is currently humming along with just over 11.5 million DOIs for which we’ve gathered over 64 million “events.” We aren’t worried about scalability on that front any more.

We’ve also shown that third parties should be able to access the API to provide value added reporting and metrics. As a demonstration of this, PLOS configured a copy of its reporting software “Parascope” to point at the CrossRef DET instance. The next step we’re taking is to start testing the “push” API mechanism and the “point-to-point redirect” API mechanism. For the push API, we should have a really exciting demo available to show within the next few days. And on the point-to-point redirect, we have a sub-group exploring how the point-to-point redirect mechanism could potentially be used for reporting COUNTER stats as a compliment to the Sushi initiative.

The other major outstanding task we have before us is to calculate what the costs will be of running the DET system as a production service. In this case we expect to have some pretty accurate data to go on as we will have had close to half a year of running the pilot with a non-trivial number of DOIs and sources. Note that the work group is concerned to ensure that the underlying data from the system remains open to all. Keeping this raw data open as seen as critical to establishing trust in the metrics and reporting systems that third parties build on the data. The group has also committed to leaving the creation of value-add services to third parties. As such we have been focusing on exploring business models based around service-level-agreement backed versions of the API to complement the free version of the same API. The free API will come with no guarantees of uptime, performance characteristics or support. For those users that depend on the API in order to deliver their services, we will offer paid-for SLA-backed versions of the free APIs. We can then configure our systems so that we can independently scale these SLA-backed APIs in order to meet SLA agreements.

Our goal is to have these calculations complete in time for the working group to make a recommendation to the CrossRef board meeting in July 2015.
Until then, we’ll use CrossTech as a venue for notifying people when we’ve hit new milestones or added new capabilities to the DET Pilot system.

Problems with dx.doi.org on January 20th 2015- what we know.

Hell’s teeth.

So today (January 20th, 2015) the DOI HTTP resolver at dx.doi.org started to fail intermittently around the world. The doi.org domain is managed by CNRI on behalf of the International DOI Foundation. This means that the problem affected all DOI registration agencies including CrossRef, DataCite, mEDRA etc. This also means that more popularly known end-user services like FigShare and Zenodo were affected. The problem has been fixed, but the fix will take some time to propagate throughout the DNS system. You can monitor the progress here:

https://www.whatsmydns.net/#A/doi.org

Now for the embarrassing stuff…

At first lots of people were speculating that the problem had to do with somebody forgetting to renew the dx.doi.org domain name. Our information from CNRI was that the problem had to do with a mistaken change to a DNS record and that the domain name wasn’t the issue. We corrected people who were reporting that domain name renewal as the cause, but eventually we learned that it was actually true. We have had it confirmed that the problem originated with CNRI manually renewing the domain name at the last minute. Ugh. CNRI will issue a statement soon. We’ll link to it as soon as they do. UPDATE (Jan 21st): CNRI has sent CrossRef a statement. They do not have it on their site yet, so we have can included it below.

In the mean time, if you are having trouble resolving DOIs, a neat trick to know is that you can do so using the Handle system directly. For example:

http://hdl.handle.net/10.5555/12345678

CrossRef will, of course, also analyse what occurred, and issue a public report as well. Obviously, this report will include an analysis of how the outage effected DOI referrals to our members.

The amazingly cool thing is that everybody online has been very supportive and has helped us to diagnose the problem. Some have even said that the event underscores a point we often make about so-called “persistent-identifiers”- which is that they are not magic technology; the “persistence” is the result of a social contract. We like to say that CrossRef DOIs are as persistent as CrossRef staff. Well, to that phrase we have to add “and IDF staff” and “CNRI staff” and “ICANN staff”. It is turtles all the way down.

We don’t want to dismiss this event as an inevitable consequence of interdependent systems.And we don’t want to pass the buck. We need to learn something practical from this. How can we guard against this type of problem in the future? Again, people following this issue on Twitter have already been helping with suggestions and ideas. Can we crowd-source the monitoring of persistent identifier SLAs? Could we leverage Wikipedia, Wikidata or something similar to monitor critical identifiers and other infrastructure like purls, DOIs, handles, PMIDs, perma.cc, etc? Should we be looking at designating special exceptions to the normal rules governing DNS names? Do we need to distribute the risk more? Or is it enough cough to simply ensure that somebody, somewhere in the dependency chain had enabled DNS protection and auto-renewal for critical infrastructure DNS names?

Truly, we are humbled. For all the redundancy built into our systems (multiple servers, multiple hosting sites, Raid drives, redundant power), we were undone by a simple administrative task. CrossRef, IDF and CNRI- we all feel a bit crap. But we’ll get back. We’ll fix things. And we’ll let you know how we do it.

We will update this space as we know more. We will also keep people updated on twitter on @CrossRefNews. And we will report back in detail as soon as we can.


CNRI Statement

"The doi.org domain name was inadvertently allowed to expire for a brief period this morning (Jan 20). It was reinstated shortly after 9am this morning as soon as the relevant CNRI employee learned of it. A reminder email sent earlier this month to renew the registration was apparently missed. We sincerely apologize for any difficulties this may have caused. The domain name has since been placed on automatic renewal, which should prevent any repeat of this event."

Linking data and publications

Do you want to see if a CrossRef DOI (typically assigned to publications) refers to DataCite DOIs (typically assigned to data)? Here you go:

http://api.labs.crossref.org/graph/doi/10.4319/lo.1997.42.1.0001

Conversely, do you want to see if a DataCite DOI refers to CrossRef DOIs? Voilà:

http://api.labs.crossref.org/graph/doi/10.1594/pangaea.185321

Background

“How can we effectively integrate data into the scholarly record?” This is the question that has, for the past few years, generated an unprecedented amount of handwringing on the part researchers, librarians, funders and publishers. Indeed, this week I am in Amsterdam to attend the 4th RDA plenary in which this topic will no doubt again garner a lot of deserved attention.

We hope that the small example above will help push the RDAs agenda a little further. Like the recent ODIN project, It illustrates how we can simply combine two existing scholarly infrastructure systems to build important new functionality for integrating research objects into the scholarly literature.

Does it solve all of the problems associated with citing and referring to data? Can the various workgroups at RDA just cancel their data citation sessions and spend the week riding bikes and gorging on croquettes? Of course not. But my guess is that by simply integrating DataCite and CrossRef in this way, we can make a giant push in the right direction.

There are certainly going to be differences between traditional citation and data citation. Some even claim that citing data isn’t “as simple as citing traditional literature.” But this is a caricature of traditional citation. If you believe this, go off an peruse the MLA, Chicago, Harvard, NLM and APA citation guides. Then read Anthony Grafton’s, The Footnote? Are you back yet? Good, so let’s continue…

Citation of any sort is a complex issue- full of subtleties, edge-cases exceptions, disciplinary variations and kludges. Historically, the way to deal with these edge-cases has been social, not technical. For traditional literature we have simply evolved and documented citation practices which generally make contextually-appropriate use of the same technical infrastructure (footnotes, endnotes, metadata, etc.). I suspect the same will be true in citing data. The solutions will not be technical, they will mostly be social. Researchers, and publishers will evolve new, contextually appropriate mechanisms to use existing infrastructure deal with the peculiarities of data citation.

Does this mean that we will never have to develop new systems to handle data citation? Possibly But I don’t think we’ll know what those systems are or how they should work until we’ve actually had researchers attempting to use and adapt the tools we have.

Technical background

About five years ago, CrossRef and DataCite explored the possibility of exposing linkages between DataCite and CrossRef DOIs. Accordingly, we spent some time trying to assemble an example corpus that would illustrate the power of interlinking these identifiers. We encountered a slight problem. We could hardly find any examples. At that time, virtually nobody cited data with DataCite DOIs and, if they did, the CrossRef system did not handle them properly. We had to sit back and wait a while.

And now the situation has changed.

This demonstrator harvests DataCite DOIs using their OAI-PMH API and links them in a graph database with CrossRef DOIs. We have exposed this functionality on the “labs” (i.e. experimental) version of our REST API as a graph resource. So…

You can get a list of CrossRef DOIs that refer to DataCite DOIs as follows:

http://api.labs.crossref.org/graph?rel=cites:*&filter=source:crossref,related-source:datacite

And the converse:

http://api.labs.crossref.org/graph?rel=cites:*&filter=source:datacite,related-source:crossref

Caveats and Weasel Words

  • We have not finished indexing all the links.
  • The API is currently a very early labs project. It is about as reliable as a devolution promise from Westminster.
  • The API is run on a pair of raspberry-pi’s connected to the internet via bluetooth.
  • It is not fast.
  • The representation and the API is under active development.

    Things will change. Watch the CrossRef Labs site for updates on this collaboration with DataCite

Citation needed

Remember when I said that the Wikipedia was the 8th largest referrer of DOI links to published research? This despite only a fraction of eligible references in the free encyclopaedia using DOIs.

We aim to fix that. CrossRef and Wikimedia are launching a new initiative to better integrate scholarly literature in the world’s largest public knowledge space, Wikipedia.

This work will help promote standard links to scholarly references within Wikipedia, which persist over time by ensuring consistent use of DOIs and other citation identifiers in Wikipedia references. CrossRef will support the development and maintenance of Wikipedia’s citation tools on Wikipedia. This work will include bug fixes and performance improvements for existing tools, extending the tools to enable Wikipedia contributors to more easily look up and insert DOIs, and providing a “linkback” mechanism that alerts relevant parties when a persistent identifier is used in a Wikipedia reference.

In addition, CrossRef is creating the role of Wikimedia Ambassador (modeled after Wikimedian-in-Residence) to act as liaison with the Wikimedia community, promote use of scholarly references on Wikipedia, and educate about DOIs and other scholarly identifiers (ORCIDs, PubMed IDs, DataCite DOIs, etc) across Wikimedia projects.

Starting today, CrossRef will be working with Daniel Mietchen to coordinate CrossRef’s Wikimedia-related activities. Daniel’s team will be composed of Max Klein and Matt Senate, who will work to enhance Wikimedia citation tools, and will share the role of Wikipedia ambassador with Dorothy Howard.

Since the beginnings of Wikipedia, Daniel Mietchen has worked to integrate scholarly content into Wikimedia projects. He is part of an impressive community of active Wikipedians and developers who have worked extensively on linking Wikipedia articles to the formal literature and other scholarly resources. We’ve been talking to him about this project for nearly a year, and are happy to finally get it off the ground.

–G

Matt, Max and Daniel at #wikimania2014. Photo by Dorothy.

]7 Matt, Max and Daniel at #wikimania2014. Photo by Dorothy.

wikimania2014

Many Metrics. Such Data. Wow.

many_metrics CrossRef Labs loves to be the last to jump on an internet trend, so what better than than to combine the Doge meme with altmetrics? Want to know how many times a CrossRef DOI is cited by the Wikipedia? http://det.labs.crossref.org/works/doi/10.1371/journal.pone.0086859

Or how many times one has been mentioned in Europe PubMed Central?

http://det.labs.crossref.org/works/doi/10.1016/j.neuropsychologia.2013.10.021

Or DataCite?

http://det.labs.crossref.org/works/doi/10.1111/jeb.12289

Background

Back in 2011 PLOS released its awesome ALM system as open source software (OSS). At CrossRef Labs, we thought it might be interesting to see what would happen if we ran our own instance of the system and loaded it up with a few CrossRef DOIs. So we did. And the code fell over. Oops. Somehow it didn’t like dealing with 10 million DOIs. Funny that.

But the beauty of OSS is that we were able to work with PLOS to scale the code to handle our volume of data. CrossRef contracted with Cottage Labs  and we both worked with PLOS to make changes to the system. These eventually got fed back into the main ALM source on Github. Now everybody benefits from our work. Yay for OSS.

So if you want to know technical details, skip to Details for Propellerheads. But if you want to know why we did this, and what we plan to do with it, read on.

Why?

There are (cough) some problems in our industry that we can best solve with shared infrastructure. When publishers first put scholarly content online, they used to make bilateral reference linking agreements. These agreements allowed them to link citations using each other’s proprietary reference linking APIs. But this system didn’t scale. It was too time-consuming to negotiate all the agreements needed to link to other publishers. And linking through many proprietary citation APIs was too complex and too fragile. So the industry founded CrossRef to create a common, cross-publisher citation linking API. CrossRef has since obviated the need for bilateral linking arrangements.

So-called altmetrics look like they might have similar characteristics. You have ~4000 CrossRef member publishers and N sources (e.g. Twitter, Mendeley, Facebook, CiteULike, etc.) where people use (e.g. discuss, bookmark, annotate, etc.) scholarly publications. Publishers could conceivably each choose to run their own system to collect this information. But if they did, they would face the following problems:

  • The N sources will be volatile. New ones will emerge. Old ones will vanish.
  • Each publisher will need to deal with each source’s different APIs, rate limits, T&Cs, data licenses, etc. This is a logistical headache for both the publishers and for the sources.
  • If publishers use different systems which in turn look at different sources, it will be difficult to compare results across publishers.
  • If a journal moves from one publisher to another, then how are the metrics for that journal’s articles going to follow the journal? This isn’t a complete list, but it shows that there might be some virtue in publishers sharing an infrastructure for collecting this data. But what about commercial providers? Couldn’t they provide these ALM services? Of course – and some of them currently do. But normally they look on the actual collection of this data as a means to an end. The real value they provide is in the analysis, reporting and tools that they build on top of the data. CrossRef has no interest in building front-ends to this data. If there is a role for us to play here, it is simply in the collection and distribution of the data.

No, really, WHY?

Aren’t these altmetrics an ill-conceived and meretricious idea? By providing this kind of information, isn’t CrossRef just encouraging feckless, neoliberal university administrators to hasten academia’s slide into a Stakhanovite dystopia? Can’t these systems be gamed?

FOR THE LOVE OF FSM, WHY IS CROSSREF DABBLING IN SOMETHING OF SUCH QUESTIONABLE VALUE?

takes deep breath. wipes spittle from beard

These are all serious concerns. Goodhart’s Law and all that… If a university’s appointments and promotion committee is largely swayed by Impact Factor, it won’t improve a thing if they substitute or supplement Impact Factor with altmetrics. Amy Brand has repeatedly pointed out, the best institutions simply don’t use metrics this way at all (PowerPoint presentation). They know better.

But yes, it is still likely that some powerful people will come to lazy conclusions based on altmetrics. And following that, other lazy, unscrupulous and opportunistic people will attempt to game said metrics. We may even see an industry emerge to exploit this mess and provide the scholarly equivalent of SEO. Feh. Now I’m depressed and I need a drink.

So again, why is CrossRef doing this? Though we have our doubts about how effective altmetrics will be in evaluating the quality of content, we do believe that they are a useful tool for understanding how scholarly content is used and interpreted. The most eloquent arguments against altmetrics for measuring quality, inadvertently make the case for altmetrics as a tool for monitoring attention.

Critics of altmetrics point out that much of the attention that research receives outside of formal scholarly communications channels can be ascribed to:

  • Puffery. Researchers and/or university/publisher “PR wonks” over-promoting research results.
  • Innocent misinterpretation. A lay audience simply doesn’t understand the research results.
  • Deliberate misinterpretation. Ideologues misrepresent research results to support their agendas.
  • Salaciousness. The research appears to be about sex, drugs, crime, video games or other popular bogeymen.
  • Neurobollocks. A category unto itself these days.

In short, scholarly research might be misinterpreted. Shock horror. Ban all metrics. Whew. That won’t happen again.

Scholarly research has always been discussed outside of formal scholarly venues. Both by scholars themselves and by interested laity. Sometimes these discussions advance the scientific cause. Sometimes they undermine it. The University of Utah didn’t depend on widespread Internet access or social networks to promote yet-to-be peer-reviewed claims about cold fusion. That was just old-fashioned analogue puffery. And the Internet played no role in the Laetrile or DMSO crazes of the 1980s. You see, there were once these things called “newspapers.” And another thing called “television.” And a sophisticated meatspace-based social network called a “town square.”

But there are critical differences between then and now. As citizens get more access to the scholarly literature, it is far more likely that research is going to be discussed outside of formal scholarly venues. Now we can build tools to help researchers track these discussions. Now researchers can, if they need to, engage in the conversations as well. One would think that conscientious researchers would see it as their responsibility to remain engaged, to know how their research is being used. And especially to know when it is being misused.

That isn’t to say that we expect researchers will welcome this task. We are no Pollyannas. Researchers are already famously overstretched. They barely have time to keep up with the formally published literature. It seems cruel to expect them to keep up with the firehose of the Internet as well.

Which gets us back to the value of altmetrics tools. Our hope is that, as altmetrics tools evolve, they will provide publishers and researchers with an efficient mechanism for monitoring the use of their content in non-traditional venues. Just in the way that citations were used before they were distorted into proxies for credit and kudos.

We don’t think altmetrics are there yet. Partly because some parties are still tantalized by the prospect of usurping one metric for another. But mostly because the entire field is still nascent. People don’t yet know how the information can be combined and used effectively. So we still make naive assumptions such as “link=like” and “more=better.” Surely it will eventually occur to somebody that, instead, there may be a connection between repeated headline-grabbing research and academic fraud. A neuroscientist might be interested in a tool that alerts them if the MRI scans in their research paper are being misinterpreted on the web to promote neurobollocks. An immunologist may want to know if their research is being misused by the anti-vaccination movement. Perhaps the real value in gathering this data will be seen when somebody builds tools to help researchers DETECT puffery, social-citation cabals, and misinterpretation of research results?

But CrossRef won’t be building those tools. What we might be able to do is help others overcome another hurdle that blocks the development of more sophisticated tools; getting hold of the needed data in the first place. This is why we are dabbling in altmetrics.

Wikipedia is already the 8th largest referrer of CrossRef DOIs. Note that this doesn’t just mean that the Wikipedia cites lots of CrossRef DOIs, it means that people actually click on and follow those DOIs to the scholarly literature. As scholarly communication transcends traditional outlets and as the audience for scholarly research broadens, we think that it will be more important for publishers and researcher to be aware of how their research is being discussed and used. They may even need to engage more with non-scholarly audiences. In order to do this, they need to be aware of the conversations. CrossRef is providing this experimental data source in the hope that we can spur the development of more sophisticated tools for detecting and analyzing these conversations. Thankfully, this is an inexpensive experiment to conduct – largely thanks to the decision on the part of PLOS to open source its ALM code.

What Now?

CrossRef’s instance of PLOS’s ALM code is an experiment. We mentioned that we had encountered scalability problems and that we had resolved some of them. But there are still big scalability issues to address. For example, assuming a response time of 1 second, if we wanted to poll the English-language version of the Wikipedia to see what had cited each of the 65 million DOIs held in CrossRef, the process would take years to complete. But this is how the system is designed to work at the moment. It polls various source APIs to see if a particular DOI is “mentioned”. Parallelizing the queries might reduce the amount of time it takes to poll the Wikipedia, but it doesn’t reduce the work. Another obvious way in which we could improve the scalability of the system is to add a push mechanism to supplement the pull mechanism. Instead of going out and polling the Wikipedia 65 million times, we could establish a “scholarly linkback” mechanism that would allow third parties to alert us when DOIs and other scholarly identifiers are referenced (e.g. cited, bookmarked, shared). If the Wikipedia used this, then even in an extreme case scenario (i.e. everything in Wikipedia cites at least one CrossRef DOI), this would mean that we would only need to process ~ 4 million trackbacks.

The other significant advantage of adding a push API is that it would take the burden off of CrossRef to know what sources we want to poll. At the moment, if a new source comes online, we’d need to know about it and build a custom plugin to poll their data. This needlessly disadvantages new tools and services as it means that their data will not be gathered until they are big enough for us to pay attention to. If the service in question addresses a niche of the scholarly ecosystem, they may never become big enough. But if we allow sources to push data to us using a common infrastructure, then new sources do not need to wait for us to take notice before they can participate in the system.

Supporting (potentially) many new sources will raise another technical issue- tracking and maintaining the provenance of the data that we gather. The current ALM system does a pretty good job of keeping data, but if we ever want third parties to be able to rely on the system, we probably need to extend the provenance information so that the data is cheaply and easily auditable.

Perhaps the most important thing we want to learn from running this experimental ALM instance is: what it would take to run the system as a production service? What technical resources would it require? How could they be supported? And from this we hope to gain enough information to decide whether the service is worth running and, if so, by whom. CrossRef is just one of several organizations that could run such a service, but it is not clear if it would be the best one. We hope that as we work with PLOS, our members and the rest of the scholarly community, we’ll get a better idea of how such a service should be governed and sustained.

Details for Propellerheads

Warning, Caveats and Weasel Words

The CrossRef ALM instance is a CrossRef Labs project. It is running on R&D equipment in a non-production environment administered by an orangutang on a diet of Redbulls and vodka.

So what is working?

The system has been initially loaded with 317,500+  CrossRef DOIs representing publications from 2014. We will load more DOIs in reverse chronological order until we get bored or until the system falls over again.

We have activated the following sources:

  • PubMed
  • DataCite
  • PubMedCentral Europe Citations and Usage
  • We have data from the following sources but will need some work to achieve stability:

  • Facebook
  • Wikipedia
  • CiteULike
  • Twitter
  • Reddit
  • Some of them are faster than others. Some are more temperamental than others. WordPress, for example, seems to go into a sulk and shut itself off  after approximately 1,300 API calls.

    In any case, we will be monitoring and tweaking the sources as we gather data. We will also add new sources as we get requested API keys. We will probably even create one or two new sources ourselves. Watch this blog and we’ll update you as we add/tweak sources.

    Dammit, shut up already and tell me how to query stuff.

    You can login to the CrossRef ALM instance simply using a Mozilla Persona (yes, we’d eventually like to support ORCID too). Once logged-in, your account page will list an API key. Using the API key, you can do things like:

    http://det.labs.crossref.org/api/v5/articles?ids=10.1038/nature12990

    And you will see that (as of this writing), said Nature article has been cited by the Wikipedia article here:

    <a href="http://en.wikipedia.org/wiki/HE0107-5240">http://en.wikipedia.org/wiki/HE0107-5240</a>

    PLOS has provided lovely detailed instructions for using the API– So, please, play with the API and see what you make of it. On our side we will be looking at how we can improve performance and expand coverage. We don’t promise much- the logistics here are formidable. As we said above, once you start working with millions of documents, the polling process starts to hit API walls quickly. But that is all part of the experiment. We appreciate your helping us and would like your feedback. We can be contacted at:

     

    labs_email