7 minute read.
DOI-like strings and fake DOIs
Crossref discourages our members from using DOI-like strings or fake DOIs.
<img class="alignnone wp-image-1850 size-thumbnail" src="/wp/blog/uploads/2016/06/prohibited-150x150.png" alt="discouraged" width="150" height="150" srcset="/wp/blog/uploads/2016/06/prohibited-150x150.png 150w, /wp/blog/uploads/2016/06/prohibited-300x300.png 300w, /wp/blog/uploads/2016/06/prohibited.png 729w" sizes="(max-width: 150px) 85vw, 150px" />
Recently we have seen quite a bit of debate around the use of so-called “fake-DOIs.” We have also been quoted as saying that we discourage the use of “fake DOIs” or “DOI-like strings”. This post outlines some of the cases in which we’ve seen fake DOIs used and why we recommend against doing so.
Some of our members use DOI-like strings as internal identifiers for their manuscript tracking systems. These only get registered as real DOIs with Crossref once an article is published. This seems relatively harmless, except that, frequently, the unregistered DOI-like strings for unpublished (e.g. under review or rejected manuscripts) content ‘escape’ into the public as well. People attempting to use these DOI-like strings get understandably confused and angry when they don’t resolve or otherwise work as DOIs. After years of experiencing the frustration that these DOI-like things cause, we have taken to recommending that our members not use DOI-like strings as their internal identifiers.
Using DOI-like strings in access control compliance applications
We’ve also had members use DOI-like strings as the basis for systems that they use to detect and block tools designed to bypass the member’s access control system and bulk-download content. The methods employed by our members have fallen into two broad categories:
- Spider (or robot) traps.
- Proxy bait.
<img class="alignnone wp-image-1849 size-thumbnail" src="/wp/blog/uploads/2016/06/web-150x150.png" alt="spider trap" width="150" height="150" srcset="/wp/blog/uploads/2016/06/web-150x150.png 150w, /wp/blog/uploads/2016/06/web-300x300.png 300w, /wp/blog/uploads/2016/06/web.png 729w" sizes="(max-width: 150px) 85vw, 150px" />
A “spider trap” is essentially a tripwire that allows a site owner to detect when a spider/robot is crawling their site to download content. The technique involves embedding a special trigger URL in a public page on a web site. The URL is embedded such that a normal user should not be able see it or follow it, but an automated bot (aka “spider”) will detect it and follow it. The theory is that when one of these trap URLs is followed, the website owner can then conclude that the ip address from which it was followed harbours a bot and take action. Usually the action is to inform the organisation from which the bot is connecting and to ask them to block it. But sometimes triggering a spider trap has resulted in the IP address associated with it being instantly cut off. This, in turn, can affect an entire university’s access to said member’s content.
When a spider/bot trap includes a DOI-like string, then we have seen some particularly pernicious problems as they can trip-up legitimate tools and activities as well. For example, a bibliographic management browser plugin might automatically extract DOIs and retrieve metadata on pages visited by a researcher. If the plugin were to pick up one of these spider traps DOI-like strings, it might inadvertently trigger the researcher being blocked- or worse- the researcher’s entire university being blocked. In the past, this has even been a problem for Crossref itself. We periodically run tools to test DOI resolution and to ensure that our members are properly displaying DOIs, Crossmarks, and metadata as per their member obligations. We’ve occasionally been blocked when we ran across the spider traps as well.
<img class="alignnone wp-image-1848 size-thumbnail" src="/wp/blog/uploads/2016/06/bait-150x150.png" alt="proxy bait" width="150" height="150" srcset="/wp/blog/uploads/2016/06/bait-150x150.png 150w, /wp/blog/uploads/2016/06/bait-300x300.png 300w, /wp/blog/uploads/2016/06/bait.png 729w" sizes="(max-width: 150px) 85vw, 150px" />
Using proxy bait is similar to using a spider trap, but it has an important difference. It does not involve embedding specially crafted DOI like strings on the member’s website itself. The DOI-like strings are instead fed directly to tools designed to subvert the member’s access control systems. These tools, in turn, use proxies on a subscriber’s network to retrieve the “bait” DOI-like string. When the member sees one of these special DOI-like strings being requested from a particular institution, they then know that said institution’s network harbours a proxy. In theory this technique never exposes the DOI-like strings to the public and automated tools should not be able to stumble upon them. However, recently one of our members had some of these DOI-like strings “escape” into the public and at least one of them was indexed by Google. The problem was compounded because people clicking on these DOI-like strings sometimes ended having their university’s IP address banned from the member’s web site. As you can imagine, there has been a lot of gnashing of teeth. We are convinced, in this case, that the member was doing their best to make sure the DOI-like strings never entered the public. But they did nonetheless. We think this just underscores how hard it is to ensure DOI-like strings remain private and why we recommend our members not use them.
Pedantry and terminology
Notice that we have not used the phrase “fake DOI” yet. This is because, internally, at least, we have distinguished between “DOI-like strings” and “fake DOIs.” The terminology might be daft, but it is what we’ve used in the past and some of our members at least will be familiar with it. We don’t expect anybody outside of Crossref to know this.
To us, the following is not a DOI:
It is simply a string of alphanumeric characters that copy the DOI syntax. We call them “DOI-like strings.” It is not registered with any DOI registration agency and one cannot lookup metadata for it. If you try to “resolve” it, you will simply get an error. Here, you can try it. Don’t worry- clicking on it will not disable access for your university.
The following is what we have sometimes called a “fake DOI”
It is registered with Crossref, resolves to a fake article in a fake journal called The Journal of Psychoceramics (the study of Cracked Pots) run by a fictitious author (Josiah Carberry) who has a fake ORCID (http://orcid.org/0000-0002-1825-0097) but who is affiliated with a real university (Brown University).
Again, you can try it.
And you can even look up metadata for it.
Our dirty little secret is that this “fake DOI” was registered and is controlled by Crossref.
Why does this exist? Aren’t we subverting the scholarly record? Isn’t this awful? Aren’t we at the very least hypocrites? And how does a real university feel about having this fake author and journal associated with them?
Well- the DOI is using a prefix that we use for testing. It follows a long tradition of test identifiers starting with “5”. Fake phone numbers in the US start with “555”. Many credit card companies reserve fake numbers starting with “5”. For example, Mastercard’s are “5555555555554444” and “5105105105105100.”
We have created this fake DOI, the fake journal and the fake ORCID so that we can test our systems and demonstrate interoperable features and tools. The fake author, Josiah Carberry, is a long-running joke at Brown University. He even has a Wikipedia entry. There are also a lot of other DOIs under the test prefix “5555.”
We acknowledge that the term “fake DOI” might not be the best in this case- but it is a term we’ve used internally at least and it is worth distinguishing it from the case of DOI-like strings mentioned above.
But back to the important stuff….
As far as we know, none of our members has ever registered a “fake DOI” (as defined above) in order to detect and prevent the circumvention of their access control systems. If they had, we would consider it much more serious than the mere creation of DOI-like strings. The information associated with registered DOIs becomes part of the persistent scholarly citation record. Many, many third party systems and tools make use of our API and metadata including bibliographic management tools, TDM tools, CRIS systems, altmetrics services, etc. It would be a very bad thing if people started to worry that the legitimate use of registered DOIs could inadvertently block them from accessing content. Crossref DOIs are designed to encourage discovery and access- not block it.
And again, we have absolutely no evidence that any of our members has registered fake DOIs.
But just in case, we will continue to discourage our members from using DOI-like strings and/or registering fake DOIs.
This has been a public service announcement from the identifier dweebs at Crossref.
Unless otherwise noted, included images purchased from The Noun Project