DOIs and matching regular expressions

We regularly see developers using regular expressions to validate or scrape for DOIs. For modern CrossRef DOIs the regular expression is short


For the 74.9M DOIs we have seen this matches 74.4M of them. If you need to use only one pattern then use this one.

The other 500K are mostly from CrossRef’s early days when the battle between “human-readable” identifiers and “opaque” identifiers was still being fought, the web was still new, and it was expected that “doi” would become as well a supported URI schema name as “gopher”, “wais”, …. Ok, that didn’t go so well.

An early CrossRef’s member was John Wiley & Sons. They faced the need to design DOIs without much prior work to lean on. Many of those early DOIs are not expression friendly. Nevertheless, they are still valid and valuable permanent links to the work’s version of record. You can catch 300K more DOIs with


While the DOI caught is likely to be the DOI within the text it may also contain trailing characters that, due to the lack of a space, are caught up with the DOI. Even the recommended expression catches DOIs ending with periods, colons, semicolons, hyphens, and underscores. Most DOIs found in the wild are presented within some visual design program. While pleasant to look at the visual design can misdirect machines. Is the period at the end of the line part of the DOI or part of the design? Is that endash actually a hyphen? These issues lead to a DOI bycatch.

Adding the following 3 expressions with the previous 2 leaves only 72K DOIs uncaught. To catch these 72K would require a dozen or more additional patterns. Each additional pattern, unfortunately, weakens the overall precision of the catch. More bycatch.




CrossRef is not the only DOI Registration Agency and while our members account for 65-75% of all registered DOIs this means there are tens of millions of DOIs that we have not seen. Luckily, the newer RAs and their publishers can copy our successes and avoid our mistakes.


Rehashing PIDs without stabbing myself in the eyeball

Anybody who knows me or reads this blog is probably aware that I don’t exactly hold back when discussing problems with the DOI system. But just occasionally I find myself actually defending the thing…

About once a year somebody suggests that we could replace existing persistent citation identifiers (e.g. DOIs) with some new technology that would fix some of the weaknesses of the current systems. Usually said person is unhappy that current systems like DOI, Handle, Ark,, etc. depend largely on a social element to update the pointers between the identifier and the current location of the resource being identified. It just seems manifestly old-fashioned and ridiculous that we should still depend on bags of meat to keep our digital linking infrastructure from falling apart.

In the past, I’ve threatened to stab myself in the eyeball if I was forced to have the discussion again. But the dirty little secret is that I play this game myself sometimes. After all, the best thing a mission-driven membership organisation could do for its members would be to fulfil its mission and put itself out of business. If we could come up with a technical fix that didn’t require the social component, it would save our members a lot of money and effort.

When one of these ideas is posed, there is a brief flurry of activity as another generation goes through the same thought processes and (so far) comes to the same conclusions.

The proposals I’ve seen generally fall into one of the following groups:

  • Replace persistent identifiers (PIDs) with hashes, checksums, etc.
  • Just use search (often, but not always coupled with 1 above)
  • Automagically create PIDs out of metadata.
  • Automagically redirect broken citations to archived versions of the content identified
  • And more recently… use the blockchain

I thought it might help advance the discussion and avoid a bunch of dead ends if I summarised (rehashed?) some of the issues that should be considered when exploring these options.

Warning: Refers to FRBR terminology. Those of a sensitive disposition might want to turn away now.

  • DOIs, PMIDs, etc. and other persistent identifiers are primarily used by our community as “citation identifiers”. We generally cite at the “expression” level.
  • Consider the difference between how a “citation identifier” a “work identifier” and a “content verification identifier” might function.
  • How do you deal with “equivalent manifestations” of the same expression. For example the ePub, PDF and HTML representations of the same article are intellectually equivalent and interchangeable when citing. The same applies to csv & tsv representations of the same dataset. So, for example, how do hashes work here as a citation identifier?
  • Content can be changed in ways that typically doesn’t effect the interpretation or crediting of the work. For example, by reformatting, correcting spelling, etc. In these cases the copies should share the same citation identifier, but the hashes will be different.
  • Content that is virtually identical (and shares the same hash) might be republished in different venues (e.g. a normal issue and a thematic issue). Context in citation is important. How do you point somebody at the copy in the correct context?
  • Some copies of an article or dataset are stewarded by publishers. That is, if there is an update, errata, corrigenda, retraction/withdrawal, they can reflect that on the stewarded copy, not on copies they don’t host or control. Location is, in fact, important here.
  • Some copies of content will be nearly identical, but will differ in ways that would affect the interpretation and/or crediting of the work. A corrected number in a table for example. How would you create a citation form a search that would differentiate the correct version from the incorrect version?
  • Some content might be restricted, private or under embargo. For example private patient data, sensitive data about archaeological finds or the migratory patterns of endangered animals.
  • Some content is behind paywalls (cue jeremiads)
  • Content is increasingly composed of static and dynamic elements. How do you identify the parts that can be hashed?
  • How do you create an identifier out of metadata and not have them look like this?

This list is a starting point that should allow people to avoid a lot of blind alleys.

In the mean time, good luck to those seeking alternatives to the current crop of persistent citation identifier systems. I’m not convinced it is possible to replace them, but if it is- I hope I beat you to it. :-) And I hope I can avoid stabbing myself in the eye.


Coming to you Live from Wikipedia

We’ve been collecting citation events from Wikipedia for some time. We’re now pleased to announce a live stream of citations, as they happen, when they happen. Project this on your wall and watch live DOI citations as people edit Wikipedia, round the world.

View live stream »

In the hours since this feature launched, there are events from Indonesian, Portugese, Ukrainian, Serbian and English Wikipedias (in that order).

Live event stream

The usual weasel words apply. This is a labs project and so may not be 100% stable. If you experience any problems please email .


January 2015 DOI Outage: Followup Report


On January 20th, 2015 the main DOI HTTP proxy at experienced a partial, rolling global outage. The system was never completely down, but for at least part of the subsequent 48 hours, up to 50% of DOI resolution traffic was effectively broken. This was true for almost all DOI registration agencies, including CrossRef, DataCite and mEDRA.

At the time we kept people updated on what we knew via Twitter, mailing lists and our technical blog at CrossTech. We also promised that, once we’d done a thorough investigation, we’d report back. Well, we haven’t finished investigating all implications of the outage. There are both substantial technical and governance issues to investigate. But last week we provided a preliminary report to the CrossRef board on the basic technical issues, and we thought we’d share that publicly now.

The Gory Details

First, the outage of January 20th was not caused by a software or hardware failure, but was instead due to an administrative error at the Corporation for National Research Initiatives (CNRI). The domain name “” is managed by CNRI on behalf of the International DOI Foundation (IDF). The domain name was not on “auto-renew” and CNRI staff simply forgot to manually renew the domain. Once the domain name was renewed, it took about 48 hours for the fix to propagate through the DNS system and for the DOI resolution service to return to normal. Working with CNRI we analysed traffic through the Handle HTTP proxy and here’s the graph:

Chart of Handle HTTP proxy traffic during outage

The above graph shows traffic over a 24 hour period on each day from January 12, 2015 through February 10th, 2015. The heavy blue line for January 20th and the heavy red line for January 21st show how referrals declined as the domain was first deleted, and then added back to DNS.

It could have been much worse. The domain registrar (GoDaddy) at least had a “renewal grace and registry redemption period” which meant that even though CNRI forgot to pay its bill to renew the domain, the domain was simply “parked” and could easily be renewed by them. This is the standard setting for GoDaddy. Cheaper domain registrars might not include this kind of protection by default. Had there been no grace period, then it would have been possible for somebody other than CNRI to quickly buy the domain name as soon as it expired. There are many automated processes which search for and register recently expired domain names. Had this happened, at the very least it would have been expensive for CNRI to buy the domain back. The interruption to DOI resolutions during this period would have also been almost complete.

So we got off relatively easy. The domain name is now on auto-renew. The outage was not as bad as it could have been. It was addressed quickly and we can be reasonably confident that the same administrative error will not happen again. CrossRef even managed to garner some public praise for the way in which we handled the outage. It is tempting to heave a sigh of relief and move on.

We also know that everybody involved at CNRI, the IDF and CrossRef have felt truly dreadful about what happened. So it is also tempting to not re-open old wounds.

But it would be a mistake if we did not examine a fundamental strategic issue that this partial outage has raised: How can CrossRef claim that its DOIs are ‘persistent’ if CrossRef does not control some of the key infrastructure on which it depends? What can we do to address these dependencies?

What do we mean by “persistent?”

@kaythaney tweets on definition of "persistent"

@kaythaney tweets on definition of “persistent”

To start with, we should probably explore what we mean by ‘persistent’. We use the word “persistent” or “persistence” about 470 times on the CrossRef web site. The word “persistent” appears central to our image of ourselves and of the services that we provide. We describe our core, mandatory service as the “CrossRef Persistent Citation Infrastructure.”

The primary sense of the word “persistent” in the New Oxford American Dictionary is:

Continuing firmly or obstinately in a course of action in spite of difficulty or opposition.

We play on this sense of the word as a synonym for “stubborn” when we half-jokingly say that, “CrossRef DOIs are as persistent as CrossRef staff.” Underlying this joke is a truth, which is that persistence is primarily a social issue, not a technical issue.

Yet presumably we once chose to use the word “persistent” instead of “perpetual” or “permanent” for other reasons. “Persistence” implies longevity, without committing to “forever.” Scholarly publishers, perhaps more than most industries, understand the long term. After all, the scholarly record dates back to at least 1665 and we know that the scholarly community values even our oldest journal backfiles. By using the word “persistent” as opposed to the more emphatic “permanent” we are essentially acknowledging that we, as an industry, understand the complexity and expense of stewarding the content for even a few hundred years to say nothing of “forever.” Only the chronologically naïve would recklessly coin terms like “permalink” for standard HTTP links which have a documented half-life of well under a decade.

So “persistent” implies longevity- without committing to forever- but this still begs questions. What time span is long enough to qualify as “persistent?” What, in particular, do we mean by “persistent” when we talk about CrossRef’s “Persistent Citation Infrastructure?” or of CrossRef DOIs being “persistent identifiers?”

What do we mean by “persistent identifiers?”

@violetailik tweets on outage and implication for term "persistent identifier"

]5 @violetailik tweets on outage and implication for term “persistent identifier”

First, we often make the mistake of talking about “persistent identifiers” as if there is some technical magic that makes them continue working when things like HTTP URIs break. The very term “persistent identifier” encourages this kind of magical thinking and, ideally, we would instead talk about “persist-able” identifiers. That is, those that have some form of indirection built into them. There are many technologies that do this- Handles, DOIs, Purls, ARKs and every URL shortener in existence. Each of them simply introduces a pointer mapping between an identifier and location where a resource or content resides. This mapping can be updated when the content moves, thus preserving the link. Of course, just because an identifier is persist-able doesn’t mean it is persistent. If Purls or DOIs are not updated when content moves, then they are no more persistent than normal URLs.

Andrew Treloar points out that when we talk about “persistent identifiers,” we tend to conflate several things:

  1. The persistence of the identifier- that is the token or string itself.
  2. The persistence of the thing being pointed at by the identifier. For example, the content.
  3. The persistence of the mapping of the identifier to the thing being identified.
  4. The persistence of the resolver that allows one to follow the mapping of the identifier to the thing being identified.
  5. The persistence of a mechanism for updating the mapping of the identifier to the thing being identified.

If any of the above fails, then “persistence” fails. This is probably why we tend to conflate them in the first place.

Each of these aspects of “persistence” is worthy of much closer scrutiny, however, in the most recent case of the January outage of “,” the problem specifically occurred with item “D”- the persistence of the resolver. When CNRI failed to renew the domain name for “” on time, the DOI resolver was rendered unavailable to a large percentage of people over a period of about 48 hours as global DNS servers first removed, and then added back the “” domain.

Turtles all the way down*

The initial public reaction to the outage was, almost unanimous in one respect- people assumed that the problem originated with CrossRef.

@iainh_z tweets to CrossRef enquiring about failed DOI resoluton

@iainh_z tweets to CrossRef enquiring about failed DOI resoluton

@LibSkrat tweets at CrossRef about DOI outage

@LibSkrat tweets at CrossRef about DOI outage

This is both surprising and unsurprising. It is surprising because we have fairly recent data indicating that lots of people recognise the DOI brand, but not the CrossRef brand. Chances are, that this relatively superficial “brand” awareness does not correlate with understanding how the system works or how it relates to persistence. It is likely plenty of people clicked on DOIs at the time of the outage and, when they didn’t work, simply shrugged or cursed under their breath. They were aware of the term ‘DOI’ but not of the promise of “persistence”. Hence, they did not take to twitter to complain about it, and if they did, they probably wouldn’t have known who to complain to or even how to complain to them (neither CNRI or the IDF has a Twitter account).

But the focus on CrossRef is also unsurprising. CrossRef is by far the largest and most visible DOI Registration Agency. Many otherwise knowledgeable people in the industry simply don’t know that there are even other RAs.

They also generally didn’t know of the strategic dependencies that exist in the CrossRef system. By “strategic dependencies” we are not talking about the vendors, equipment and services that virtually every online enterprise depends on. These kinds of services are largely fungible. Their failures may be inconvenient and even dramatic, but they are rarely existential.

Instead we are talking about dependencies that underpin CrossRef’s ability to deliver on its mission. Dependencies that not only affect CrossRef’s operations, but also its ability to self-govern and meet the needs of its membership. In this case there are three major dependencies: Two of which are specific to CrossRef and other DOI registration agencies and one which is shared by virtually all online enterprises today. The organizations are: The International DOI Foundation (IDF), Corporation for National Research Initiatives (CNRI) and the Internet Corporation for Assigned Names and Numbers (ICANN).

Dependency of RAs on IDF, CNRI and ICANN. Turtles all the way down.

Dependency of RAs on IDF, CNRI and ICANN. Turtles all the way down.

Each of these agencies has technology, governance and policy impacts on CrossRef and the other DOI registration agencies, but here we will focus on the technological dependencies.

At the top of the diagram are a subset of the various DOI Registration Agencies. Each RA uses the DOI for a particular constituency (e.g. scholarly publishers) and application (e.g. citation). Sometimes these constituencies/applications overlap (as with mEDRA, CrossRef and DataCite), but sometimes they are orthogonal to the other RAs, as is the case with EIDR. All, however, are members of the IDF.

The IDF sets technical policies and development agendas for the DOI infrastructure. This includes recommendations about how RAs should display and link DOIs. Of course all of these decisions have an impact on the RAs. However, the IDF provides little technical infrastructure of its own as it has no full-time staff. Instead it outsources the operation of the system to CNRI, this includes the management of the domain which the IDF owns.

The actual DOI infrastructure is hosted on a platform called the Handle System which was developed by and is currently run by CNRI. The Handle System is part of a quite complex and sophisticated platform for managing digital objects that was originally developed for DARPA. A subset of the Handle system is designated for use by DOIs and is identified by the “10” prefix (e.g. 10.5555/12345678). The Handle system itself is not based on HTTP (the web protocol). Indeed, one of the much touted features of the Handle System is that it isn’t based on any specific resolution technology. This was seen as a great virtue in the late 1990s when the DOI system was developed and the internet had just witnessed an explosion of seemingly transient, competing protocols (e.g. Gopher, WAIS, Archie, HyperWave/Hyper-G, HTTP, etc.). But what looked like a wild-west of protocols quickly settled into an HTTP hegemony. In practice, virtually all DOI interactions with the Handle system are via HTTP and so, in order to interact with the web, the Handle System employs a “Handle proxy” which translates back and forth between HTTP, and the native Handle system. This all may sound complicated, and the backend of the Handle system is really very sophisticated, but it turns out that the DOI really uses only a fraction of the Handle system’s features. In fact, the vast majority of DOI interactions merely use the Handle system as a giant lookup table which allows one to translate an identifier into a web location. For example, it will take a DOI Handle like this:


and redirect it to (as of this writing) the following URL:

This whole transformation is normally never seen by a user. It is handled transparently by the web browser, which does the lookup and redirection in the background using HTTP and talking to the Handle Proxy. In the late 1990s, even doing this simple translation quickly, at scale with a robust distributed infrastructure, was not easy. These days however we see dozens if not hundreds of “URL Shorteners” doing exactly the same thing at far greater scale than the Handle System.

It may seem a shame that more of the Handle Systems features are not used, but the truth is the much touted platform independence of the Handle System rapidly became more of a liability and impediment to persistence than an aid. To be blunt, if in X years a new technology comes out that supersedes the web, what do we think the societal priority is going to be?

  • To provide a robust and transparent transition from the squillions of existing HTTP URI identifiers that the entire world depends on?
  • To provide a robust and transparent transition from the tiny subset of Handle-based identifiers that are used by about a hundred million specialist resources?

Quite simply, the more the Handle/DOI systems diverge from common web protocols and practice, then the more we will jeopardise the longevity of our so-called persistent identifiers.

So, in the end, DOI registration agencies really only use the Handle system for translating web addresses. All of the other services and features one might associate with DOIs (reference resolution, metadata lookup, content negotiation, OAI-PMH, REST APIs, CrossMark, CrossCheck, TDM Services, FundRef etc) are all provided at the RA level.

But this address resolution is still critical. And it is exactly what failed for many users on January 20th 2015. And to be clear, it wasn’t the robust and scaleable Handle System that failed. It wasn’t the Handle Proxy that failed. And it certainly wasn’t any RA-controlled technology that failed. These systems were all up and running. What happened was that the standard handle proxy that the IDF recommends RAs use, “”, was effectively rendered invisible to wide portions the internet because the “” domain was not renewed. This underscores two important points.

The first is that it doesn’t much matter what precisely caused the outage. In this case it was an administrative error. But the effect would have been similar if the Handle proxies had failed of if the Handle system itself had somehow collapsed. In the end, CrossRef and all DOI registration agencies are existentially dependent on the Handle system running and being accessible.

The second is that the entire chain of dependencies from the RAs down through CNRI are also dependent on the DNS system which, in turn, is governed by ICANN. We should really not be making too much of the purported technology independence of the DOI and Handle systems. To be fair, this limitation is inherent to all persistent identifier schemes that aim to work with the web. It really is “turtles all the way down.

What didn’t fail on January 19th/20th and why?

You may have noticed a lot of hedging in our description of the outage of January 19th/20th. For one thing, we use the term “rolling outage.” Access to the Handle Proxy via “” was never completely unavailable during the period. As we’ve explained, this is because the error was discovered very quickly and the domain was renewed hours after it expired. The nature of DNS propagation meant that even as some DNS servers were deleting the “” entry, others were adding it back to their tables. In some ways this was really confusing because it meant it was difficult to predict where the system was working and where it wasn’t. Ultimately it all stabilised after the standard 48-hour DNS propagation cycle.

But there were also some Handle-based services that simply were not affected at all by the outage. During the outage, a few people asked us if there was an alternative way to resolve DOIs. The answer was “yes,” there were several. It turns out that “” is not the only DNS name that points to the Handle Proxy. People could easily substitute “” with “” or “” or “” and “resolve” any DOI. Many of CrossRef’s internal services use these internal names and so the services continued to work. This is partly why we only discovered the “” was down from people reporting it on Twitter.

And, of course, there were other services that were not affected by the outage. CrossMark, the REST API, and CrossRef Metadata Search all continued to work during the outage.

Protecting ourselves

So what can we do to reduce our dependencies and/or the risks intrinsic to those dependencies?

Obviously, the simplest way to have avoided the outage would have been to ensure that the “” domain was set to automatically renew. That’s been done. Is there anything else we should do? A few ideas have been floated that might allow us to provide even more resilience. They range greatly in complexity and involvement.

  1. Provide well-publicised public status dashboards that show what systems are up and which clearly map dependencies so that people could, for instance, see that the server was not visible to systems that depended on it. Of course, if such a dashboard had been hosted at, nobody would have been able to connect to it. Stoopid turtles.
  2. Encourage DOI RAs to have the members point to Handle proxies using domain names under the RA’s control. Simply put, if CrossRef members had been using “” instead of “”, then CrossRef DOIs would have continued to work throughout the outage of “”. The same with mEDRA, and the other RAs. This way each RA would have control over another critical piece of their infrastructure. It would also mean that if any single RA made a similar domain name renewal mistake, the impact would be isolated to a particular constituency. Finally, using RA-specific domains for resolving DOIs might also make it clear that different DOIs are managed by different RAs and might have different services associated with them. Perhaps CrossRef would spend less time supporting non-CrossRef DOIs?
  3. Provide a parallel, backup resolution technology that could be pointed to in the event of a catastrophic Handle System failure. For example we could run a parallel system based on PURLs, ARKs or another persist-able identifier infrastructure.
  4. Explore working with ICANN to get the handle resolvers moved under the special “.arpa” top level domain (TLD). This TLD (RFC 3172) is reserved for services that are considered to be “critical to the operation of the internet.” This is an option that was first discussed at a meeting of persistent identifier providers in 2011.

These are all tactical approaches to addressing the specific technical problem of the Handle System becoming unavailable, but they do not address deeper issues relating to our strategic dependence on several third parties. Even though the IDF and CNRI provide us with pretty simple and limited functionality, that functionality is critical to our operations and our claim to be providing persistent identifiers. Yet these technologies are not in our direct control. We had to scramble to get hold of people to fix the problem. For a while, we were not able to tell our users or members what was happening because we did not know ourselves.

The irony is that CrossRef was held to account, and we were in the firing line the entire time. Again, this was almost unavoidable. In addition to being the largest DOI RA, we are also the only RA that has any significant social media presence and support resources. Still, it meant that we were the public face of the outage while the IDF and CNRI remained in the background.

And this is partly why our board has encouraged us to investigate another option:

  1. Explore what it would take to remove CrossRef dependencies on the IDF and CNRI.

CrossRef is just part of a chain of dependencies the goes from our publisher members down through the IDF, CNRI and, ultimately, ICANN. Our claim to providing a persistent identifier structure depends entirely on the IDF and CNRI. Here we have explored some of the technical dependencies. But there are also complex governance and policy implications of these dependencies. Each organization has membership rules, guidelines and governance structures which can impact CrossRef members. Indeed, the IDF and CNRI are themselves members of groups (ISO and DONA, respectively) which might ultimately have policy or governance impact for DOI registration agencies. We will need to understand the strategic implications of these non technical dependencies as well.

Note that the CrossRef board has merely asked us to “explore” what it would take to remove dependencies. They have not asked us to actually take any action. CrossRef has been massively supportive of the IDF and CNRI, and they have been massively supportive of us. Still, over the years we have all grown and our respective circumstances have changed. It is important that occasionally we question what we might have once considered to be axioms. As we discussed above, we use the term “persistent” which, in turn, is a synonym for “stubborn.” At the very least we need to document the inter-dependencies that we have so that we can understand just how stubborn we can reasonably expect our identifiers to be.

The outage of January 20th was a humbling experience. But in a way we were lucky: Forgetting to renew the domain name was a silly and prosaic way to partially bring down a persistent identifier infrastructure, but it was also relatively easy to fix. Inevitably, there was a little snark and some pointed barbs directed at us during the outage, but we were truly overwhelmed by the support and constructive criticism we received as well. We have also been left with a clear message that, in order for this good-will to continue, we need to follow-up with a public, detailed and candid analysis of our infrastructure and its dependencies. Consider this to be the first section of a multi-part report.

@kevingashley tweets asking for followup analysis

@kevingashley tweets asking for followup analysis

@WilliamKilbride tweets asking for followup and lessons learned

@WilliamKilbride tweets asking for followup and lessons learned

Image Credits

Turtle image CC-BY “Unrecognised MJ” from the Noun Project


Real-time Stream of DOIs being cited in Wikipedia


Watch a real-time stream of DOIs being cited (and “un-cited!” ) in Wikipedia articles across the world:


For years we’ve known that the Wikipedia was a major referrer of CrossRef DOIs and about a year ago we confirmed that, in fact, the Wikipedia is the 8th largest refer of CrossRef DOIs. We know that people follow the DOIs, too. This despite a fraction of Wikipedia citations to the scholarly literature even using DOIs. So back in August we decided to create a Wikimedia Ambassador programme. The goal of the programme was to promote the use of persistent identifiers in citation and attribution in Wikipedia articles. We would do this through outreach and through the development of better citation-related tools.

Remember when we originally wrote about our experiments with the PLOS ALM code and how that has transitioned into the DOI Event Tracking Pilot? In those posts we mentioned that one of the hurdles in gathering information about DOI events is the actual process of polling third party APIs for activity related to millions of DOIs. Most parties simply wouldn’t be willing handle the load of a 100K API calls an hour. Besides, polling is a tremendously inefficient process, only a fraction of DOIs are ever going to generate events, but we’d have to poll for each of them, repeatedly, forever, to get an accurate picture of DOI activity. We needed a better way. We needed to see if we could reverse this process and convince some parties to instead “push” us information whenever they saw DOI related events (e.g. citations, downloads, shares, etc). If only we could convince somebody to try this…

Wikipedia DOI Events

In December 2014 we took the opportunity of the 2014 PLOS/CrossRef ALM Workshop in San Francisco too meet with Max Klein and Anthony Di Franco where we kicked off a very exciting project.

There’s always someone editing a Wikipedia somewhere in the world. In fact, you can see a dizzying live stream of edits. We thought that given that there are so many DOIs in Wikipedia, that live stream may contain some diamonds (DOIs are made of diamond, that’s how they can be persistent). Max and Anthony went away and came back with a demo that contains a surprising amount of DOI activity.

That demo is evolving into a concrete service, called Cocytus. It is running at Wikimedia Labs monitoring live edits as you read this.

For now we’re feeding that data into the DOI Events Collection app (which is an off-shoot of the Chronograph project). We are in the process of modifying the Lagotto code so that we can instead push those events into the DOI Event Tracking Instance.

The first DOI event we noticed was delightfully prosaic: The DOI for “The polymath project” is cited by the Wikipedia page for “Polymath Project”. Prosaic perhaps, but the authors of that paper probably want to know. Maybe they can help edit the page.

Or how about this. Someone wrote a a paper about why people edit Wikipedia and then it was cited by Wikipedia. And then the citation was removed. The plot thickens…

We’re interested in seeing how DOIs are used outside of the formal scholarly literature. What does that mean? We don’t fully know, that’s the point. We have retractions in scholarly literature (and our CrossMark metadata and service allow publishers to record that), but it’s a bit different on Wikipedia. Edit wars are fought over … well you can see for yourself.

Citations can slip in and out of articles. We saw the DOI 10.1001/archpediatrics.2011.832 deleted from “Bipolar disorder in children”. If we’d not been monitoring the live feed (we had considered analysing snapshots of the Wikipedia in bulk) we might never have seen that. This is part of what non-traditional citations means, and it wasn’t obvious until we’d seen it.

You can see this activity on the Chronograph’s stream. Or check your favourite DOI. Please be aware that we’re only collecting newly added citations as of today. We do intend to go back and back-fill, but that may take some time- as it * cough * requires polling again.

Some Technical Things

A few interesting things that happened as a result of all this:

Secure URLs

SSL and HTTPS were invented so you could do things like banking on the web without fear of interception or tampering. As the web becomes a more important part of life, many sites are upgrading from HTTP to HTTPS, the secure version. This is not only because your confidential details may be tampered with, but because certain governments might not like you reading certain materials.

Because of this, some time ago, Wikipedia decided to embark on an upgrade to HTTPS last year, and they are a certain way along the path. The IDF, who are responsible for running the DOI system, upgraded to HTTPS this Summer, although most DOIs are referred to by HTTP still.

We met with Dario Taraborelli at the ALM workshop and discussed the DOI referral data that is fed into the Chronograph. We put two and two together and realised that Wikipedia was linking to DOIs (which are mostly HTTP) from pages which might be served over HTTPS. New policies in HTML5 specify that referrer URL headers shouldn’t be sent from HTTPS to HTTP (in case there was something secret in them). The upshot of this is, if someone’s browsing Wikipedia via HTTPS and click on a normal DOI, we won’t know that the user came from Wikipedia. Not a huge problem today, but as Wikipedia switches over to entirely secure, we’re going to miss out on very useful information.

Fortunately, the HTML5 specification includes a way to fix this (without leaking sensitive information). We discussed this with Dario, and he did some research, and came up with a suggestion, which got discussed. It’s fascinating to watch a democratic process like this take place and take part in it.

We’re waiting to see how the discussion turns out, and hope that it all works out so we can continue to report on how amazing Wikipedia is at sending people to scholarly literature.

How shall I cite thee?

Another discussion grew out of that process, and we started talking to a Wikipedian called Nemo (note to Latin scholars: we weren’t just talking to ourselves). Nemo (real name Federico Leva) had a few suggestions of his own. Another way to solve the referrer problem is by using HTTPS URLs (HTML5 allows browsers to send the referrer domain when going from HTTPS to HTTPS).

This means going back to all the articles that use DOIs and change them from HTTP to HTTPS. Not as simple as it sounds, and it doesn’t sound simple. We started looking into how DOIs were cited on Wikipedia.

After some research we found that there are more ways that we expected to cite DOIs.

First, there’s the URL. You can see it in action in this article. URLs can take various forms.


Second there’s the official template tag, seen in action here:

<ref name="SCI-20140731">{{cite journal |title=Sustained miniaturization and anatomical innovation in the dinosaurian ancestors of birds |url= |date=1 August 2014 |journal=[[Science (journal)|Science]] |volume=345 |issue=6196 |pages=562–566 |doi=10.1126/science.1252243 |accessdate=2 August 2014 |last1=Lee |first1=Michael S. Y. |first2=Andrea|last2=Cau |first3=Darren|last3=Naish|first4=Gareth J.|last4=Dyke}}</ref>

There’s a DOI in there somewhere. This is the best way to cite DOIs, firstly as it’s actually a proper traditional citation and there’s nothing magic about DOIs, secondly because it’s a template tag and can be re-rendered to look slightly different if needed.

Third there’s the old official DOI template tag that’s now discouraged:

<ref name="Example2006">{{Cite doi|10.1146/}}</ref> 

And then there’s another one.


Knowing all this helps us find DOIs. But if we want to convert DOIs links in Wikipedia to use HTTPS, it means that there are more template tags to modify and more pages to re-render.

Nemo also put DOIs on the Interwiki Map which should make automatically changing some of the URLs a lot easier.

We’re very grateful to Nemo for his suggestions and work on this. We’ll report back!

The elephant in the room

Those of you who know how DOIs work will have spotted an unsecured elephant in the room. When you visit a DOI, you visit the URL, which hits the DOI resolver proxy server, which returns a message to your browser to redirect to the landing page on the publisher’s site.

Securely talking to the DOI resolver by using HTTPS instead of HTTP means that no-one can eavesdrop and see which DOI you are visiting, or tamper with the result and send you off to a different page. But the page you are sent to will be, in nearly all cases, still HTTP. Upgrading infrastructure isn’t trivial, and, with over 4000 members (mostly publishers), most CrossRef DOIs will still redirect to standard HTTP pages for the foreseeable future.

You can keep as secure as possible by using HTTPS Everywhere.


There’s lots going on, watch this space to see developments. Thanks for reading this, and all the links. We’d love to know what you think.


Not long after this blog post was published we saw something very interesting.

Interesting DOI

That’s no DOI. We like interesting things, but they can panic us. This turned out to be a great example of why this kind of thing can be useful. A minute’s digging and we found the article edit:

Wikipedia typo

It turns out that this was a typo: someone put a title when they should have put in a DOI. And, as the event shows, this was removed from the Wikipedia article.


CrossRef’s DOI Event Tracker Pilot


CrossRef’s “DOI Event Tracker Pilot”- 11 million+ DOIs & 64 million+ events. You can play with it at:

Tracking DOI Events

So have you been wondering what we’ve been doing since we posted about the experiments we were conducting using PLOS’s open source ALM code? A lot, it turns out. About a week after our post, we were contacted by a group of our members from OASPA who expressed an interest in working with the system. Apparently they were all about to conduct similar experiments using the ALM code, and they thought that it might be more efficient and interesting if they did so together using our installation. Yippee. Publishers working together. That’s what we’re all about.

So we convened the interested parties and had a meeting to discuss what problems they were trying to solve and how CrossRef might be able to help them. That early meeting came to a consensus on a number of issues:

  • The group was interested in exploring the role CrossRef could play in providing an open, common infrastructure to track activities around DOIs, they were not interested in having CrossRef play a role in the value-add services of reporting on an interpreting the meaning of said activities.
  • The working group needed representatives from multiple stakeholders in the industry. Not just open access publishers from OASPA, but from subscription based publishers, funders, researchers and third party service providers as well.
  • That it was desirable to conduct a pilot to see if the proposed approach was both technically feasible and financially sustainable.

And so after that meeting, the “experiment” graduated to becoming a “pilot.” This CrossRef pilot is based on the premise that the infrastructure involved in tracking common information about “DOI events” can be usefully separated from the value-added services of analysing and presenting these events in the form of qualitative indicators. There are many forms of events and interactions which may be of interest. Service providers will wish to analyse, aggregate and present those in a range of different ways depending on the customer and their problem. The capture of the underlying events can be kept separate from those services.

In order to ensure that the CrossRef pilot is not mistaken for some sub rosa attempt to establish new metrics for evaluating scholarly output, we also decided eschew any moniker that includes the word “metrics” or synonyms. So the “ALM Experiment” is dead. Long live the “”DOI Event Tracker” (DET) pilot. Similarly PLOS’s open source “ALM software” has been resurrected under the name “Lagotto.”

The Technical Issues

CrossRef members are interested in knowing about “events” relating to the DOIs that identify their content. But our members face a now-classic problem. There are a large number of sources for scholarly publications (3k+ CrossRef members) and that list is still growing. Similarly, there are an unbounded number of potential sources for usage information. For example:

  • Supplemental and grey literature (e.g. data, software, working papers)
  • Orthogonal professional literature (e.g. patents, legal documents, governmental/NGO/IGO reports, consultation reports, professional trade literature).
  • Scholarly tools (e.g. citation management systems, text and data mining applications).
  • Secondary outlets for scholarly literature (institutional and disciplinary repositories, A&I services).
  • Mainstream media (e.g. BBC, New York Times).
  • Social media (e.g. Wikipedia, Twitter, Facebook, Blogs, Yo).

Finally, there is a broad and growing audience of stakeholders who are interested in seeing how the literature is being used. The audience includes publishers themselves as well as funders, researchers, institutions, policy makers and citizens.

Publishers (or other stakeholders) could conceivably each choose to run their own system to collect this information and redistribute it to interested parties. Or they can work with a vendor to do the same. But either case, they would face the following problems:

  • The N sources will change. New ones will emerge. Old ones will vanish.
  • The N audiences will change. New ones will emerge. Old ones will vanish.
  • Each publisher/vendor will need to deal with N source’s different APIs, rate limits, T&Cs, data licenses, etc. This is a logistical headache for both the publishers/vendors and for the sources.
  • Each audience will need to deal with N publisher/vendor APIs, rate limits, T&Cs, data licenses, etc. This is a logistical headache for both the audiences and for the publishers.
  • If publishers/vendors use different systems which in turn look at different sources, it will be difficult to compare or audit results across publishers/vendors.
  • If a journal moves from one publisher to another, then how are the metrics for that journal’s articles going to follow the journal?

And then there is the simple issue of scale. Most parties will be interested in comparing the data that they collect for their own content, with data about their competitors. Hence, if they all run their own system, they will each be querying much more than their own data. If, for example, just the commercial third-party providers were interested in collecting data covering the formal scholarly literature, they would each find themselves querying the same sources for the same 80 million DOIs. To put this into perspective, to refresh the data for 10 million DOIs once a month, would require sources to support ~ 14K API calls an hour. 60 million DOIs would require 100K API calls an hour. Current standard API caps for many of the sources that people are interested in querying hover around 2K per hour. We may see these sources lift that cap for exceptional cases, but they are unlikely to do so for many different clients all of whom are querying essentially the same thing.

These issues typify the “multiple bilateral relationships” problem that CrossRef was founded to try and ameliorate. When we have many organizations trying to access the exact same APIs to process the exact same data (albeit to different ends), then it seems likely that CrossRef could help make the process more efficient.

Piloting A Proposed Solution

The CrossRef DET pilot aims to show the feasibility of providing a hub for the collection, storage and propagation of DOI events from multiple sources to multiple audiences.

Data Collection

  • Pull: DET will collect DOI event data from sources that are of common interest to the membership, but which are unlikely to make special efforts to accommodate the scholarly communications industry. Examples of this class of source include large, broadly popular services like FaceBook, Twitter, VK, Sina Weibo, etc.
  • Push: DET will allow sources to send DOI event data directly to CrossRef in one of three ways:
    • Standard Linkback: Using standards that are widely used on the web. This will automatically enable linkback-aware systems like WordPress, Moveable Type, etc. to alert DET to DOI events.
    • Scholarly Linkback: A to-be-defined augmented linkback-style API which will be optimized to work with scholarly resources and which will allow for more sophisticated payloads including other identifiers (e.g. ORCIDs, FundRefs), metadata, provenance information and authorization information. This system could be used by tools designed for scholarly communications. So, for example, it could be used by publisher platforms to distribute events related to downloads or comments within their discussion forums. It could also be used by third party scholarly apps like Zotero, Mendeley, Papers, Authorea, IRUS-UK, etc. in order to alert interested parties in events related to specific DOIs.
    • Redirect: DET will also be able to serve as a service discovery layer that will allow sources to push DOI event data directly to an appropriate publisher-controlled endpoint using the above scholarly linkback mechanism. This can be used by sources like repositories in order to send sensitive usage data directly to the relevant publishers.

Data Propagation

Parties may want to use the DET in order to propagate information about DOI events. The system will support two broad data propagation patterns:

  • one-to-many: DOI events that are commonly harvested (pulled) by the DET system from a single source will be distributed freely to anybody who queries the DET API. Similarly, sources that push DOI events via the standard or scholarly linkback mechanisms, will also propagate their DOI events openly to anybody who queries the DET API. DOI events that are propagated in either of these cases will be kept and logged by the DET system along with appropriate provenance information. This will be the most common, default propagation model for the DET system.
  • one-to-one: Sources of DOI events can also report (push) DOI event data directly to owner of the relevant DOI if the DOI owner provides & registers a suitable end-point with the DET system. In these cases, data sources seeking to report information relating to a DOI, will be redirected (with a suitable 30X HTTP status and relevant headers) to the end-point specified by the DOI owner. The DET system will not keep the request or provenance information. One-to-one propagation model is designed to handle use cases where the source of the DOI event has put restrictions on the data and will only share the DOI events with the owner (registrant) of the DOI. This use case may be used, for example, by aggregators or A&I services that want to report confidential data directly back to a publisher. The advantage of the redirect mechanism is that CrossRef is not put into the position of having to secure sensitive data as said data will never reside on CrossRef systems.

Note that the two patterns can be combined. So, for example, a publisher might want to have public social media events reported to the DET and propagated accordingly, but to also to private third parties report confidential information directly to the publisher.

So Where Are We?

So to start with, the DET Working Group has grown substantially since the early days and we have representatives from a wide variety of stakeholders. The group includes:

  • Cameron Neylon, PLOS
  • Chris Shillum, Elsevier
  • Dom Mitchell, Co-action Publishing
  • Euan Adie, Altmetric
  • Jennifer Lin, PLOS
  • Juan Pablo Alperin, PKP
  • Kevin Dolby, Wellcome Trust
  • Liz Ferguson, Wiley
  • Maciej Rymarz, Mendeley
  • Mark Patterson, eLife
  • Martin Fenner, PLOS
  • Mike Thelwell, U Wolverhampton
  • Rachel Craven, BMC
  • Richard O’Beirne, OUP
  • Ruth Ivimey-Cook, eLife
  • Victoria Rao, Elsevier

As well as the usual contingent of CrossRef cat-herders including: Geoffrey Bilder, Rachael Lammey & Joe Wass.

When we announced the then-DET experiment, we said that one of the biggest challenges would be to create something that scaled to industry levels. At launch, we only loaded in about 317,500+ CrossRef DOIs representing publications from 2014 and we could see the system was going to struggle. Since then Martin Fenner and Jennifer Lin at PLOS have been focusing on making sure that the Lagotto code scales appropriately and now it is currently humming along with just over 11.5 million DOIs for which we’ve gathered over 64 million “events.” We aren’t worried about scalability on that front any more.

We’ve also shown that third parties should be able to access the API to provide value added reporting and metrics. As a demonstration of this, PLOS configured a copy of its reporting software “Parascope” to point at the CrossRef DET instance. The next step we’re taking is to start testing the “push” API mechanism and the “point-to-point redirect” API mechanism. For the push API, we should have a really exciting demo available to show within the next few days. And on the point-to-point redirect, we have a sub-group exploring how the point-to-point redirect mechanism could potentially be used for reporting COUNTER stats as a compliment to the Sushi initiative.

The other major outstanding task we have before us is to calculate what the costs will be of running the DET system as a production service. In this case we expect to have some pretty accurate data to go on as we will have had close to half a year of running the pilot with a non-trivial number of DOIs and sources. Note that the work group is concerned to ensure that the underlying data from the system remains open to all. Keeping this raw data open as seen as critical to establishing trust in the metrics and reporting systems that third parties build on the data. The group has also committed to leaving the creation of value-add services to third parties. As such we have been focusing on exploring business models based around service-level-agreement backed versions of the API to complement the free version of the same API. The free API will come with no guarantees of uptime, performance characteristics or support. For those users that depend on the API in order to deliver their services, we will offer paid-for SLA-backed versions of the free APIs. We can then configure our systems so that we can independently scale these SLA-backed APIs in order to meet SLA agreements.

Our goal is to have these calculations complete in time for the working group to make a recommendation to the CrossRef board meeting in July 2015.
Until then, we’ll use CrossTech as a venue for notifying people when we’ve hit new milestones or added new capabilities to the DET Pilot system.


Problems with on January 20th 2015- what we know.

Hell’s teeth.

So today (January 20th, 2015) the DOI HTTP resolver at started to fail intermittently around the world. The domain is managed by CNRI on behalf of the International DOI Foundation. This means that the problem affected all DOI registration agencies including CrossRef, DataCite, mEDRA etc. This also means that more popularly known end-user services like FigShare and Zenodo were affected. The problem has been fixed, but the fix will take some time to propagate throughout the DNS system. You can monitor the progress here:

Now for the embarrassing stuff…

At first lots of people were speculating that the problem had to do with somebody forgetting to renew the domain name. Our information from CNRI was that the problem had to do with a mistaken change to a DNS record and that the domain name wasn’t the issue. We corrected people who were reporting that domain name renewal as the cause, but eventually we learned that it was actually true. We have had it confirmed that the problem originated with CNRI manually renewing the domain name at the last minute. Ugh. CNRI will issue a statement soon. We’ll link to it as soon as they do. UPDATE (Jan 21st): CNRI has sent CrossRef a statement. They do not have it on their site yet, so we have can included it below.

In the mean time, if you are having trouble resolving DOIs, a neat trick to know is that you can do so using the Handle system directly. For example:

CrossRef will, of course, also analyse what occurred, and issue a public report as well. Obviously, this report will include an analysis of how the outage effected DOI referrals to our members.

The amazingly cool thing is that everybody online has been very supportive and has helped us to diagnose the problem. Some have even said that the event underscores a point we often make about so-called “persistent-identifiers”- which is that they are not magic technology; the “persistence” is the result of a social contract. We like to say that CrossRef DOIs are as persistent as CrossRef staff. Well, to that phrase we have to add “and IDF staff” and “CNRI staff” and “ICANN staff”. It is turtles all the way down.

We don’t want to dismiss this event as an inevitable consequence of interdependent systems.And we don’t want to pass the buck. We need to learn something practical from this. How can we guard against this type of problem in the future? Again, people following this issue on Twitter have already been helping with suggestions and ideas. Can we crowd-source the monitoring of persistent identifier SLAs? Could we leverage Wikipedia, Wikidata or something similar to monitor critical identifiers and other infrastructure like purls, DOIs, handles, PMIDs,, etc? Should we be looking at designating special exceptions to the normal rules governing DNS names? Do we need to distribute the risk more? Or is it enough cough to simply ensure that somebody, somewhere in the dependency chain had enabled DNS protection and auto-renewal for critical infrastructure DNS names?

Truly, we are humbled. For all the redundancy built into our systems (multiple servers, multiple hosting sites, Raid drives, redundant power), we were undone by a simple administrative task. CrossRef, IDF and CNRI- we all feel a bit crap. But we’ll get back. We’ll fix things. And we’ll let you know how we do it.

We will update this space as we know more. We will also keep people updated on twitter on @CrossRefNews. And we will report back in detail as soon as we can.

CNRI Statement

"The domain name was inadvertently allowed to expire for a brief period this morning (Jan 20). It was reinstated shortly after 9am this morning as soon as the relevant CNRI employee learned of it. A reminder email sent earlier this month to renew the registration was apparently missed. We sincerely apologize for any difficulties this may have caused. The domain name has since been placed on automatic renewal, which should prevent any repeat of this event."


Linking data and publications

Do you want to see if a CrossRef DOI (typically assigned to publications) refers to DataCite DOIs (typically assigned to data)? Here you go:

Conversely, do you want to see if a DataCite DOI refers to CrossRef DOIs? Voilà:


“How can we effectively integrate data into the scholarly record?” This is the question that has, for the past few years, generated an unprecedented amount of handwringing on the part researchers, librarians, funders and publishers. Indeed, this week I am in Amsterdam to attend the 4th RDA plenary in which this topic will no doubt again garner a lot of deserved attention.

We hope that the small example above will help push the RDAs agenda a little further. Like the recent ODIN project, It illustrates how we can simply combine two existing scholarly infrastructure systems to build important new functionality for integrating research objects into the scholarly literature.

Does it solve all of the problems associated with citing and referring to data? Can the various workgroups at RDA just cancel their data citation sessions and spend the week riding bikes and gorging on croquettes? Of course not. But my guess is that by simply integrating DataCite and CrossRef in this way, we can make a giant push in the right direction.

There are certainly going to be differences between traditional citation and data citation. Some even claim that citing data isn’t “as simple as citing traditional literature.” But this is a caricature of traditional citation. If you believe this, go off an peruse the MLA, Chicago, Harvard, NLM and APA citation guides. Then read Anthony Grafton’s, The Footnote? Are you back yet? Good, so let’s continue…

Citation of any sort is a complex issue- full of subtleties, edge-cases exceptions, disciplinary variations and kludges. Historically, the way to deal with these edge-cases has been social, not technical. For traditional literature we have simply evolved and documented citation practices which generally make contextually-appropriate use of the same technical infrastructure (footnotes, endnotes, metadata, etc.). I suspect the same will be true in citing data. The solutions will not be technical, they will mostly be social. Researchers, and publishers will evolve new, contextually appropriate mechanisms to use existing infrastructure deal with the peculiarities of data citation.

Does this mean that we will never have to develop new systems to handle data citation? Possibly But I don’t think we’ll know what those systems are or how they should work until we’ve actually had researchers attempting to use and adapt the tools we have.

Technical background

About five years ago, CrossRef and DataCite explored the possibility of exposing linkages between DataCite and CrossRef DOIs. Accordingly, we spent some time trying to assemble an example corpus that would illustrate the power of interlinking these identifiers. We encountered a slight problem. We could hardly find any examples. At that time, virtually nobody cited data with DataCite DOIs and, if they did, the CrossRef system did not handle them properly. We had to sit back and wait a while.

And now the situation has changed.

This demonstrator harvests DataCite DOIs using their OAI-PMH API and links them in a graph database with CrossRef DOIs. We have exposed this functionality on the “labs” (i.e. experimental) version of our REST API as a graph resource. So…

You can get a list of CrossRef DOIs that refer to DataCite DOIs as follows:*&filter=source:crossref,related-source:datacite

And the converse:*&filter=source:datacite,related-source:crossref

Caveats and Weasel Words

  • We have not finished indexing all the links.
  • The API is currently a very early labs project. It is about as reliable as a devolution promise from Westminster.
  • The API is run on a pair of raspberry-pi’s connected to the internet via bluetooth.
  • It is not fast.
  • The representation and the API is under active development.

    Things will change. Watch the CrossRef Labs site for updates on this collaboration with DataCite