Linking data and publications

Do you want to see if a CrossRef DOI (typically assigned to publications) refers to DataCite DOIs (typically assigned to data)? Here you go:

http://api.labs.crossref.org/graph/doi/10.4319/lo.1997.42.1.0001

Conversely, do you want to see if a DataCite DOI refers to CrossRef DOIs? Voilà:

http://api.labs.crossref.org/graph/doi/10.1594/pangaea.185321

Background

“How can we effectively integrate data into the scholarly record?” This is the question that has, for the past few years, generated an unprecedented amount of handwringing on the part researchers, librarians, funders and publishers. Indeed, this week I am in Amsterdam to attend the 4th RDA plenary in which this topic will no doubt again garner a lot of deserved attention.

We hope that the small example above will help push the RDAs agenda a little further. Like the recent ODIN project, It illustrates how we can simply combine two existing scholarly infrastructure systems to build important new functionality for integrating research objects into the scholarly literature.

Does it solve all of the problems associated with citing and referring to data? Can the various workgroups at RDA just cancel their data citation sessions and spend the week riding bikes and gorging on croquettes? Of course not. But my guess is that by simply integrating DataCite and CrossRef in this way, we can make a giant push in the right direction.

There are certainly going to be differences between traditional citation and data citation. Some even claim that citing data isn’t “as simple as citing traditional literature.” But this is a caricature of traditional citation. If you believe this, go off an peruse the MLA, Chicago, Harvard, NLM and APA citation guides. Then read Anthony Grafton’s, The Footnote? Are you back yet? Good, so let’s continue…

Citation of any sort is a complex issue- full of subtleties, edge-cases exceptions, disciplinary variations and kludges. Historically, the way to deal with these edge-cases has been social, not technical. For traditional literature we have simply evolved and documented citation practices which generally make contextually-appropriate use of the same technical infrastructure (footnotes, endnotes, metadata, etc.). I suspect the same will be true in citing data. The solutions will not be technical, they will mostly be social. Researchers, and publishers will evolve new, contextually appropriate mechanisms to use existing infrastructure deal with the peculiarities of data citation.

Does this mean that we will never have to develop new systems to handle data citation? Possibly But I don’t think we’ll know what those systems are or how they should work until we’ve actually had researchers attempting to use and adapt the tools we have.

Technical background

About five years ago, CrossRef and DataCite explored the possibility of exposing linkages between DataCite and CrossRef DOIs. Accordingly, we spent some time trying to assemble an example corpus that would illustrate the power of interlinking these identifiers. We encountered a slight problem. We could hardly find any examples. At that time, virtually nobody cited data with DataCite DOIs and, if they did, the CrossRef system did not handle them properly. We had to sit back and wait a while.

And now the situation has changed.

This demonstrator harvests DataCite DOIs using their OAI-PMH API and links them in a graph database with CrossRef DOIs. We have exposed this functionality on the “labs” (i.e. experimental) version of our REST API as a graph resource. So…

You can get a list of CrossRef DOIs that refer to DataCite DOIs as follows:

http://api.labs.crossref.org/graph?rel=cites:*&filter=source:crossref,related-source:datacite

And the converse:

http://api.labs.crossref.org/graph?rel=cites:*&filter=source:datacite,related-source:crossref

Caveats and Weasel Words

  • We have not finished indexing all the links.
  • The API is currently a very early labs project. It is about as reliable as a devolution promise from Westminster.
  • The API is run on a pair of raspberry-pi’s connected to the internet via bluetooth.
  • It is not fast.
  • The representation and the API is under active development.

    Things will change. Watch the CrossRef Labs site for updates on this collaboration with DataCite

Citation needed

Remember when I said that the Wikipedia was the 8th largest referrer of DOI links to published research? This despite only a fraction of eligible references in the free encyclopaedia using DOIs.

We aim to fix that. CrossRef and Wikimedia are launching a new initiative to better integrate scholarly literature in the world’s largest public knowledge space, Wikipedia.

This work will help promote standard links to scholarly references within Wikipedia, which persist over time by ensuring consistent use of DOIs and other citation identifiers in Wikipedia references. CrossRef will support the development and maintenance of Wikipedia’s citation tools on Wikipedia. This work will include bug fixes and performance improvements for existing tools, extending the tools to enable Wikipedia contributors to more easily look up and insert DOIs, and providing a “linkback” mechanism that alerts relevant parties when a persistent identifier is used in a Wikipedia reference.

In addition, CrossRef is creating the role of Wikimedia Ambassador (modeled after Wikimedian-in-Residence) to act as liaison with the Wikimedia community, promote use of scholarly references on Wikipedia, and educate about DOIs and other scholarly identifiers (ORCIDs, PubMed IDs, DataCite DOIs, etc) across Wikimedia projects.

Starting today, CrossRef will be working with Daniel Mietchen to coordinate CrossRef’s Wikimedia-related activities. Daniel’s team will be composed of Max Klein and Matt Senate, who will work to enhance Wikimedia citation tools, and will share the role of Wikipedia ambassador with Dorothy Howard.

Since the beginnings of Wikipedia, Daniel Mietchen has worked to integrate scholarly content into Wikimedia projects. He is part of an impressive community of active Wikipedians and developers who have worked extensively on linking Wikipedia articles to the formal literature and other scholarly resources. We’ve been talking to him about this project for nearly a year, and are happy to finally get it off the ground.

–G

Matt, Max and Daniel at #wikimania2014. Photo by Dorothy.

]7 Matt, Max and Daniel at #wikimania2014. Photo by Dorothy.

wikimania2014

Many Metrics. Such Data. Wow.

many_metrics

CrossRef Labs loves to be the last to jump on an internet trend, so what better than than to combine the Doge meme with altmetrics?

Want to know how many times a CrossRef DOI is cited by the Wikipedia?

http://alm.labs.crossref.org/articles/info:doi/10.1371/journal.pone.0086859

Or how many times it has been mentioned in Europe PubMed Central ?

http://alm.labs.crossref.org/articles/info:doi/10.5860/choice.51-3037

Or DataCite?

http://alm.labs.crossref.org/articles/info:doi/10.1111/jeb.12289

Background

Back in 2011 PLOS released its awesome ALM system as open source software (OSS). At CrossRef Labs, we thought it might be interesting to see what would happen if we ran our own instance of the system and loaded it up with a few CrossRef DOIs. So we did. And the code fell over. Oops. Somehow it didn’t like dealing with 10 million DOIs. Funny that.

But the beauty of OSS is that we were able to work with PLOS to scale the code to handle our volume of data. CrossRef contracted with Cottage Labs  and we both worked with PLOS to make changes to the system. These eventually got fed back into the main ALM source on Github. Now everybody benefits from our work. Yay for OSS.

So if you want to know technical details, skip to Details for Propellerheads. But if you want to know why we did this, and what we plan to do with it, read on.

Why?

There are (cough) some problems in our industry that we can best solve with shared infrastructure. When publishers first put scholarly content online, they used to make bilateral reference linking agreements. These agreements allowed them to link citations using each other’s proprietary reference linking APIs. But this system didn’t scale. It was too time-consuming to negotiate all the agreements needed to link to other publishers. And linking through many proprietary citation APIs was too complex and too fragile. So the industry founded CrossRef to create a common, cross-publisher citation linking API. CrossRef has since obviated the need for bilateral linking arrangements.

So-called altmetrics look like they might have similar characteristics. You have ~4000 CrossRef member publishers and N sources (e.g. Twitter, Mendeley, Facebook, CiteULike, etc.) where people use (e.g. discuss, bookmark, annotate, etc.) scholarly publications. Publishers could conceivably each choose to run their own system to collect this information. But if they did, they would face the following problems:

  • The N sources will be volatile. New ones will emerge. Old ones will vanish.
  • Each publisher will need to deal with each source’s different APIs, rate limits, T&Cs, data licenses, etc. This is a logistical headache for both the publishers and for the sources.
  • If publishers use different systems which in turn look at different sources, it will be difficult to compare results across publishers.
  • If a journal moves from one publisher to another, then how are the metrics for that journal’s articles going to follow the journal?

This isn’t a complete list, but it shows that there might be some virtue in publishers sharing an infrastructure for collecting this data. But what about commercial providers? Couldn’t they provide these ALM services? Of course – and some of them currently do. But normally they look on the actual collection of this data as a means to an end. The real value they provide is in the analysis, reporting and tools that they build on top of the data. CrossRef has no interest in building front-ends to this data. If there is a role for us to play here, it is simply in the collection and distribution of the data.

No, really, WHY?

Aren’t these altmetrics an ill-conceived and meretricious idea? By providing this kind of information, isn’t CrossRef just encouraging feckless, neoliberal university administrators to hasten academia’s slide into a Stakhanovite dystopia? Can’t these systems be gamed?

FOR THE LOVE OF FSM, WHY IS CROSSREF DABBLING IN SOMETHING OF SUCH QUESTIONABLE VALUE?

takes deep breath. wipes spittle from beard

These are all serious concerns. Goodhart’s Law and all that… If a university’s appointments and promotion committee is largely swayed by Impact Factor, it won’t improve a thing if they substitute or supplement Impact Factor with altmetrics. Amy Brand has repeatedly pointed out, the best institutions simply don’t use metrics this way at all (PowerPoint presentation). They know better.

But yes, it is still likely that some powerful people will come to lazy conclusions based on altmetrics. And following that, other lazy, unscrupulous and opportunistic people will attempt to game said metrics. We may even see an industry emerge to exploit this mess and provide the scholarly equivalent of SEO. Feh. Now I’m depressed and I need a drink.

So again, why is CrossRef doing this? Though we have our doubts about how effective altmetrics will be in evaluating the quality of content, we do believe that they are a useful tool for understanding how scholarly content is used and interpreted. The most eloquent arguments against altmetrics for measuring quality, inadvertently make the case for altmetrics as a tool for monitoring attention.

Critics of altmetrics point out that much of the attention that research receives outside of formal scholarly communications channels can be ascribed to:

  • Puffery. Researchers and/or university/publisher “PR wonks” over-promoting research results.
  • Innocent misinterpretation. A lay audience simply doesn’t understand the research results.
  • Deliberate misinterpretation. Ideologues misrepresent research results to support their agendas.
  • Salaciousness. The research appears to be about sex, drugs, crime, video games or other popular bogeymen.
  • Neurobollocks. A category unto itself these days.

In short, scholarly research might be misinterpreted. Shock horror. Ban all metrics. Whew. That won’t happen again.

Scholarly research has always been discussed outside of formal scholarly venues. Both by scholars themselves and by interested laity. Sometimes these discussions advance the scientific cause. Sometimes they undermine it. The University of Utah didn’t depend on widespread Internet access or social networks to promote yet-to-be peer-reviewed claims about cold fusion. That was just old-fashioned analogue puffery. And the Internet played no role in the Laetrile or DMSO crazes of the 1980s. You see, there were once these things called “newspapers.” And another thing called “television.” And a sophisticated meatspace-based social network called a “town square.”

But there are critical differences between then and now. As citizens get more access to the scholarly literature, it is far more likely that research is going to be discussed outside of formal scholarly venues. Now we can build tools to help researchers track these discussions. Now researchers can, if they need to, engage in the conversations as well. One would think that conscientious researchers would see it as their responsibility to remain engaged, to know how their research is being used. And especially to know when it is being misused.

That isn’t to say that we expect researchers will welcome this task. We are no Pollyannas. Researchers are already famously overstretched. They barely have time to keep up with the formally published literature. It seems cruel to expect them to keep up with the firehose of the Internet as well.

Which gets us back to the value of altmetrics tools. Our hope is that, as altmetrics tools evolve, they will provide publishers and researchers with an efficient mechanism for monitoring the use of their content in non-traditional venues. Just in the way that citations were used before they were distorted into proxies for credit and kudos.

We don’t think altmetrics are there yet. Partly because some parties are still tantalized by the prospect of usurping one metric for another. But mostly because the entire field is still nascent. People don’t yet know how the information can be combined and used effectively. So we still make naive assumptions such as “link=like” and “more=better.” Surely it will eventually occur to somebody that, instead, there may be a connection between repeated headline-grabbing research and academic fraud. A neuroscientist might be interested in a tool that alerts them if the MRI scans in their research paper are being misinterpreted on the web to promote neurobollocks. An immunologist may want to know if their research is being misused by the anti-vaccination movement. Perhaps the real value in gathering this data will be seen when somebody builds tools to help researchers DETECT puffery, social-citation cabals, and misinterpretation of research results?

But CrossRef won’t be building those tools. What we might be able to do is help others overcome another hurdle that blocks the development of more sophisticated tools; getting hold of the needed data in the first place. This is why we are dabbling in altmetrics.

Wikipedia is already the 8th largest referrer of CrossRef DOIs. Note that this doesn’t just mean that the Wikipedia cites lots of CrossRef DOIs, it means that people actually click on and follow those DOIs to the scholarly literature. As scholarly communication transcends traditional outlets and as the audience for scholarly research broadens, we think that it will be more important for publishers and researcher to be aware of how their research is being discussed and used. They may even need to engage more with non-scholarly audiences. In order to do this, they need to be aware of the conversations. CrossRef is providing this experimental data source in the hope that we can spur the development of more sophisticated tools for detecting and analyzing these conversations. Thankfully, this is an inexpensive experiment to conduct – largely thanks to the decision on the part of PLOS to open source its ALM code.

What Now?

CrossRef’s instance of PLOS’s ALM code is an experiment. We mentioned that we had encountered scalability problems and that we had resolved some of them. But there are still big scalability issues to address. For example, assuming a response time of 1 second, if we wanted to poll the English-language version of the Wikipedia to see what had cited each of the 65 million DOIs held in CrossRef, the process would take years to complete. But this is how the system is designed to work at the moment. It polls various source APIs to see if a particular DOI is “mentioned”. Parallelizing the queries might reduce the amount of time it takes to poll the Wikipedia, but it doesn’t reduce the work. Another obvious way in which we could improve the scalability of the system is to add a push mechanism to supplement the pull mechanism. Instead of going out and polling the Wikipedia 65 million times, we could establish a “scholarly linkback” mechanism that would allow third parties to alert us when DOIs and other scholarly identifiers are referenced (e.g. cited, bookmarked, shared). If the Wikipedia used this, then even in an extreme case scenario (i.e. everything in Wikipedia cites at least one CrossRef DOI), this would mean that we would only need to process ~ 4 million trackbacks.

The other significant advantage of adding a push API is that it would take the burden off of CrossRef to know what sources we want to poll. At the moment, if a new source comes online, we’d need to know about it and build a custom plugin to poll their data. This needlessly disadvantages new tools and services as it means that their data will not be gathered until they are big enough for us to pay attention to. If the service in question addresses a niche of the scholarly ecosystem, they may never become big enough. But if we allow sources to push data to us using a common infrastructure, then new sources do not need to wait for us to take notice before they can participate in the system.

Supporting (potentially) many new sources will raise another technical issue- tracking and maintaining the provenance of the data that we gather. The current ALM system does a pretty good job of keeping data, but if we ever want third parties to be able to rely on the system, we probably need to extend the provenance information so that the data is cheaply and easily auditable.

Perhaps the most important thing we want to learn from running this experimental ALM instance is: what it would take to run the system as a production service? What technical resources would it require? How could they be supported? And from this we hope to gain enough information to decide whether the service is worth running and, if so, by whom. CrossRef is just one of several organizations that could run such a service, but it is not clear if it would be the best one. We hope that as we work with PLOS, our members and the rest of the scholarly community, we’ll get a better idea of how such a service should be governed and sustained.

Details for Propellerheads

Warning, Caveats and Weasel Words

The CrossRef ALM instance is a CrossRef Labs project. It is running on R&D equipment in a non-production environment administered by an orangutang on a diet of Redbulls and vodka.

So what is working?

The system has been initially loaded with 317,500+  CrossRef DOIs representing publications from 2014. We will load more DOIs in reverse chronological order until we get bored or until the system falls over again.

We have activated the following sources:

  • PubMed
  • DataCite
  • PubMedCentral Europe Citations and Usage

We have data from the following sources but will need some work to achieve stability:

  • Facebook
  • Wikipedia
  • CiteULike
  • Twitter
  • Reddit

Some of them are faster than others. Some are more temperamental than others. WordPress, for example, seems to go into a sulk and shut itself off  after approximately 1,300 API calls.

In any case, we will be monitoring and tweaking the sources as we gather data. We will also add new sources as we get requested API keys. We will probably even create one or two new sources ourselves. Watch this blog and we’ll update you as we add/tweak sources.

Dammit, shut up already and tell me how to query stuff.

You can login to the CrossRef ALM instance simply using a Mozilla Persona (yes, we’d eventually like to support ORCID too). Once logged-in, your account page will list an API key. Using the API key, you can do things like:

http://alm.labs.crossref.org/api/v3/articles?api_key=API_KEY&ids=10.1038/nature12990

And you will see that (as of this writing), said Nature article has been cited by the Wikipedia article here:

http://en.wikipedia.org/wiki/HE0107-5240

PLOS has provided lovely detailed instructions for using the APISo, please, play with the API and see what you make of it. On our side we will be looking at how we can improve performance and expand coverage. We don’t promise much- the logistics here are formidable. As we said above, once you start working with millions of documents, the polling process starts to hit API walls quickly. But that is all part of the experiment. We appreciate your helping us and would like your feedback. We can be contacted at:

 labs_email

 

DOIs unambiguously and persistently identify published, trustworthy, citable online scholarly literature. Right?

The South Park movie , “Bigger, Longer & Uncut” has a DOI:

a) http://dx.doi.org/10.5240/B1FA-0EEC-C316-3316-3A73-L

So does the pornographic movie, “Young Sex Crazed Nurses”:

b) http://dx.doi.org/10.5240/4CF3-57AB-2481-651D-D53D-Q

And the following DOI points to a fake article on a “Google-Based Alien Detector”:

c) http://dx.doi.org/10.6084/m9.figshare.93964

And the following DOI refers to an infamous fake article on literary theory:

d) http://dx.doi.org/10.2307/466856

This scholarly article discusses the entirely fictitious Australian “Drop Bear”:

e) http://dx.doi.org/10.1080/00049182.2012.731307

The following two DOIs point to the same article- the first DOI points to the final author version, and the second DOI points to the final published version:

f) http://dx.doi.org/10.6084/m9.figshare.96546

g) http://dx.doi.org/10.1007/s10827-012-0416-6

This following two DOIs point to the same article- there is no apparent difference between the two copies:

h) http://dx.doi.org/10.6084/m9.figshare.91541

i) http://dx.doi.org/10.1038/npre.2012.7151.1

Another example where two DOIs point to the same article and there is no apparent difference between the two copies:

j) http://dx.doi.org/10.1364/AO.39.005477

k) http://dx.doi.org/10.3929/ethz-a-005707391

These journals assigned DOIs, but not through CrossRef:

l) http://dx.doi.org/10.3233/BIR-2008-0496

m) http://dx.doi.org/10.6084/m9.figshare.95564

n) http://dx.doi.org/10.3205/cto000081

These two DOIs are assigned to two different data sets by two different RAs:

o) http://dx.doi.org/10.1107/S0108767312019034/eo5016sup1.xls

p) http://dx.doi.org/10.1594/PANGAEA.726855

This DOI appears to have been published, but was not registered until well after it was published. There were 254 unsuccessful attempts to resolve it in September 2012 alone:

q) http://dx.doi.org/10.4233/uuid:995dd18a-dc5d-4a9a-b9eb-a16a07bfcc6d

The owner of prefix, ‘10.4223,’ who is responsible for the above DOI had 378,790 attempted resolutions in September 2012 of which there were 377,001 failures. The top 10 DOI failures for this prefix each garnered over 200 attempted resolutions. As of November 2012 the prefix had only registered 349 DOIs.

Of the above 16 example DOIs 11 cannot be used for CrossCheck or CrossMark. 3 cannot be used with content negotiation. To search metadata for the above examples, you need to visit four sites:

http://search.crossref.org

https://ui.eidr.org/search

https://www.medra.org/en/search.htm

http://search.datacite.org/ui

The 14 examples come from just 4 of the 8 existing DOI registration agencies (RAs) It is virtually impossible for somebody without specialized knowledge to tell which DOIs are CrossRef DOIs and which ones are not.

Background

So DOIs unambiguously and persistently identify published, trustworthy, citable online scholarly literature. Right? Wrong.

The examples above are useful because they help elucidate some misconceptions about the DOI itself, the nature of the DOI registration agencies and, in particular issues being raised by new RAs and new DOI allocation models.

DOIs are just identifiers

CrossRef’s dominance as the primary DOI registration agency makes it easy to assume CrossRef’s *particular* application of the DOI as a scholarly citation identifier is somehow intrinsic to the DOI. The truth is, the DOI has nothing specifically to do with citation or scholarly publishing. It is simply an identifier that can be used for virtually any application. DOIs could be used as serial numbers on car parts, as supply-chain management identifiers for videos and music or as cataloguing numbers for museum artifacts. The first two identifiers listed in the examples (a & b) illustrate this. They both belong to MovieLabs and are part of the EIDR (Entertainment Identifier Registry) effort to create a unique identifier for television and movie assets. At the moment, the DOIs that MoveLabs are assigning are B2B-focused and users are unlikely to see them in the wild. But we should recall that CrossRef’s application of DOIs was also initially considered a B2B identifier- but it has since become widely recognized and depended on by researchers, librarians and third parties. The visibility of EIDR DOIs could change rapidly as they become more popular.

Multiple DOIs can be assigned to the same object

There is no International DOI Foundation (IDF) prohibition against assigning multiple DOIs to the same object. At most the IDF suggests that RAs might coordinate to avoid duplicate assignments, but it provides no guidelines on how such cross-RA checks would work.

CrossRef, in its particular application of the DOI, attempts to ensure that we don’t assign two different copies of the same article with different DOIs, but that is designed in order to avoid having publishers mistakenly making duplicate submissions. Even then, there are subtle exceptions to this rule- the same article, if legitimately published in two different issues (e.g. a regular issue and a thematic issue) will be assigned different DOIs. This is because, though the actual article content might be identical, the *context* in which it is cited is also important to record and distinguish. Finally, of course, we assign multiple DOIs to the same “object” when we assign book-level and chapter level DOIs. Or when we assign DOIs to components or reference work entries.
The likelihood of multiple DOIs being assigned to the same object increases as we have multiple RAs. In the future we might legitimately have a monograph that has different Bowker DOIs for different e-book platforms (Kindle, iPad, Kobo.) yet all three might share the same CrossRef DOI for citation purposes.

Again, the examples show this already happening. The examples f & g are assigned by DataCite (via FigShare) and CrossRef respectively. The first identifies the author version and was presumably assigned by said author. The second identifies the publisher version and was assigned by the publisher.

Although CrossRef, as a publisher-focused RA, might have historically proscribed the assignment of CrossRef DOIs to archive or author versions , there has never been and could never be any such restrictions on other DOI RAs. These are legitimate applications of two citation identifiers to two versions of the same article.

However, the next set of examples, h, i, j and k show what appears to be a slightly different problem. In these cases articles that appear to be in all aspects *identical* have been assigned two separate DOIs by different RAs. In one respect this is a logistical or technical problem- although CrossRef can check for such potential duplicate assignments within its own system, there is no way for us to do this across different RAs. But this is also a marketing and education problem- how do RAs with similar constituencies (publishers, researchers, librarians) and application of the DOI (scholarly citation) educate and inform their members about best practice in applying DOIs in that particular RAs context?

DOI registration agencies are not focused on content types, they are focused on constituencies and applications

The examples f through k also illustrate another area of fuzzy thinking about RAs- that they are somehow built around particular content types. We routinely hear people mistakenly explain that difference between CrossRef and DataCite is that “CrossRef assigns DOIs to journal articles” and that “DataCite assigns DOIs to data.” Sometimes this is supplemented with “and Bowker assigns DOIs to books.” This is nonsense. CrossRef assigns DOIs to data (example o) as well as conference proceedings, programs, images, tables, books, chapters, reference entries, etc. And DataCite covers a similar breadth of content types including articles (examples c, h, f, l, m ). The difference between CrossRef, DataCite and Bowker is their constituencies and applications- not the content types they apply DOIs to. CrossRef’s constituency is publishers. DataCite’s constituency is data repositories, archives and national libraries. But even though CrossRef and DataCite have different constituencies, they share a similar application of the DOI- that is the use of DOI as citation identifiers. This is in contrast to MovieLabs whose application of the DOI is supply chain management.

DOI registration agency constituencies and applications can overlap *or* be entirely separate

Although CrossRef’s constituency is “publishers”, we are catholic in our definition of “publisher” and have several members who run repositories that also “publish” content such as working papers and other grey literature (e.g. Woods Hole Oceanographic Institution, University of Michigan Library, University of Illinois Library). DataCite’s constituency is data repositories, archives and national libraries, but this doesn’t stop DataCite (through CDL/FigShare) from working with the publisher, PLoS, on their “Reproducibility Initiative” which requires the archiving of article-related datasets. PloS has announced that they will host all supplemental data sets on FigShare but will assign DOIs to those items through CrossRef.

CrossRef’s constituency of publishers overlaps heavily with Airiti, JaLC, mEDRA, ISTIC and Bowker. In the case of all but Bowker we also overlap in our application of the DOI in the service of citation identification. Bowker, though it shares CrossRef’s constituency, uses DOIs for supply chain management applications.

Meanwhile, EIDR is an outlier, its constituency does not overlap with CrossRef’s *and* its application of the DOI is different as well.

The relationship between RA constituency overlap (e.g. scholarly publishers vs television/movie studios) and application overlap (e.g. citation identification vs. supply chain management) can be visualized as such:

RA Application/Constituency overlap

The differences (subtle or large) between the various RAs are not evident to anybody without a fairly sophisticated understanding of the identifier space and the constituencies represented by the various RAs. To the ordinary person these are all just DOIs, which in turn are described as simply being “persistent interoperable identifiers.”

Which of course begs the question, what do we mean by “persistent” and “interoperable?”

DOIs only are as persistent as the registration agency’s application warrants.

The word “persistent” does not mean “permanent.” Andrew Treloar is known to point out that the primary sense of the word “persistent” in the New Oxford American Dictionary is:

Continuing firmly or obstinately in a course of action in spite of difficulty or opposition

Yet presumably the IDF once chose to use the word “persistent” instead of “perpetual” or “permanent” for other reasons. “Persistence” implies longevity, without committing to “forever.”

It may sound prissy, but it seems reasonable to expect that the useful life-expectancy for the identifier used for managing inventory of the the movie “Young Sex Crazed Nurses” might be different than the life expectancy for the identifier used to cite Henry Oldenburg’s “Epistle Dedicatory” in the first issue of the Philosophical Transactions. In other words, some RAs have a mandate to be more “obstinate” than others and so their definitions of “persistence” may vary. Different RAs have different service level agreements.

The problem is that ordinary users of the “persistent” DOI have no way of distinguishing between those DOIs that are expected to have a useful life of 5 years and those DOIs that are expected to have a useful lifespan of 300+ years. Unfortunately, if one of the more than 6 million non-CrossRef DOIs breaks today, it will likely be blamed on CrossRef.

Similarly, if a DOI doesn’t work with an existing CrossRef service, like OpenURL lookup, CrossCheck, CrossMark or CrossRef Metadata Search, it will also be laid at the foot of CrossRef. This scenario is likely to become even more complex as different RAs provide different specialized services for their constituencies.

Ironically, the converse doesn’t always apply. CrossRef oftentimes does not get credit for services that we instigated at the IDF level. For instance, FigShare has been widely praised for implementing content negotiation for DOIs even though this initiative had nothing to do with FigShare, instead it was implemented by DataCite with the prodding and active help of CrossRef (DataCite even used CrossRef’s code for a while). To be clear, we don’t begrudge praise for FigShare. We think FigShare is very cool- this just serves as an example of the confusion that is already occurring.

impressed

DOIs are only “interoperable” at a least common denominator level of functionality

There is no question that use of CrossRef DOIs has enabled the interoperability of citations across scholarly publisher sites. The extra level of indirection built into the DOI means that publishers do not have to worry about negotiating multiple bilateral linking agreements and proprietary APIs. Furthermore, at the mundane technical level of following HTTP links, publishers also don’t have to worry about whether the DOI was registered with mEDRA, DataCite or CrossRef as long as the DOI in question was applied with citation linking in mind.

However, what happens if somebody wants to use metadata to search for a particular DOI? What happens if they expect that DOI to work with content negotiation or to enable a CrossCheck analysis or show a CrossMark dialog or carry FundRef data? At this level, the purported interoperability of the DOI system falls apart. A publisher issuing DataCite DOIs cannot use CrossCheck. A user with a mEDRA DOI cannot use it with content negotiation. Somebody searching CrossRef Metadata Search or using CrossRef’s OpenURL API will not find DataCite records. Somebody depositing metadata in an RA other than CrossRef or DataCite will not be able to deposit ORCIDs.

There are no easy or cheap technical solutions to fix this level of incompatibility baring the creation of a superset of all RA functionality at the IDF level. But even if we had a technical solution to this problem- it isn’t clear that such a high-level of interoperability is warranted across all RAs. The degree of interoperability that is desirable between RAs is only in proportion to the degree that they serve overlapping constituencies (e.g. publishers) or use the DOI for overlapping applications (e.g. citation)

DOI Interoperability matters more for some registration agencies than others

This raises the question of what it even means to be “interoperable” between different RAs that share virtually no overlap in constituencies or applications. In what meaningful sense do you make a DOI used for inventory control “interoperable” with a DOI used for identifying citable scholarly works? Do we want to be able to check “Young Sex Crazed Nurses” for plagiarism? Or let somebody know when the South Park movie has been retracted or updated? Do we need to alert somebody when their inventory of citations falls below a certain threshold? Or let them know how many copies of a PDF are left in the warehouse?

The opposite, but equally vexing issue arrises for RAs that actually share constituencies and/or applications. CrossRef, DataCIte and mEDRA have *all* built separate metadata search capabilities, separate deposit APIs, separate OpenURL APIs, and separate stats packages- *all* geared at handling scholarly citation linking.

Finally, it seems a shame that a third party, like ORCID, who wants to enable researchers to add *any* DOI and its associated metadata to their ORCID profile, will end up having to interface with 4-5 different RAs.

Summary and closing thoughts

CrossRef was founded by publishers who were prescient in understanding that, as scholarly content moved online, there was the potential to add great value to publications by directly linking citations to the documents cited. However, publishers also realized that many of the architectural attributes that made the WWW so successful (decentralization, simple protocols for markup, linking and display, etc.), also made the web a fragile platform for persistent citation.

The CrossRef solution to this dilemma was to introduce the use of the DOI identifier as a level of citation indirection in order to layer a persist-able citation infrastructure onto the web. The success of this mechanism has been evident at a number of levels. A first-order effect of the system is that it has allowed publishers to create reliable and persistent links between copies of publisher content. Indeed uptake of the CrossRef system by scholarly and professional publishers has been rapid and almost all serious scholarly publishers are now CrossRef members.

The second order effects of the CrossRef system have also been remarkable. Firstly, just as researchers have long expected that any serious paper-based publication would include citations, now researchers expect that serious online scholarly publications will also support robust online citation linking. Secondly, some have adopted a cargo-cult practice of seeing the mere presence of a DOI on a publication as a putative sign of “citability” or “authority.” Thirdly, interest in use of the DOI as a linking mechanism has started to filter out to researchers themselves, thus potentially extending the use of CrossRef DOIs beyond being primarily a B2B citation convention.

The irony is that although the DOI system was almost single-handedly popularized and promoted by CrossRef, the DOI brand is better known than CrossRef itself. We now find that new RAs like EIDR, DataCite and new services like FigShare are building on the DOI brand and taking it in new directions. As such the first and second order benefits of CrossRef’s pioneering work with DOIs are likely to be effected by the increasing activity of the new DOI RAs as well as the introduction of new models for assigning and maintaining DOIs.

How can you trust that a DOI is persistent if different RAs have different conceptions of persistence? How can you expect the presence of a DOI to indicate “authority” or “scholarliness” if DOIs are being assigned to porn movies? How can you expect a DOI to point to the “published” version of an article when authors can upload and assign DOIs to their own copies of articles?

It is precisely because we think that some of the qualities traditionally (and wrongly) accorded to DOIs (e.g. scholarly, published, stewarded, citable, persistent) are going to be diluted in the long term that we have focused so much of our recent attention on new initiatives that have a more direct and unambiguous connection to assessing the trustworthiness of CrossRef member’s content. CrossCheck and the CrossCheck logos are designed to highlight the role that publishers play in detecting and preventing academic fraud. The CrossMark identification service will serve as a signal to researchers that publishers are committed to maintaining their scholarly content as well as giving scholars the information they need to verify that they are using the most recent and reliable versions of a document. FundRef is designed to make the funding sources for research and articles transparent and easily accessible. And finally we have been both adjusting CrossRef’s branding and display guidelines as well as working with the IDF to refine its branding and display guidelines so as to help clearly differentiate different DOI applications and constituencies.

Whilst it might be worrying to some that DOIs are being applied in ways that CrossRef has not expected and may not have historically endorsed, we should celebrate that the broader scholarly community is finally recognizing the importance of persist-able citation identifiers.

These developments also serve to reinforce a strong trend that we have encountered in several guises before. That is, the complete scholarly citation record is made up of more than citations to the formally published literature. Our work on ORCID underscored that researchers, funding agencies, institutions and publishers are interested in developing a more holistic view of the manifold contributions that are integral to research. The “C” in ORCID stands for “contributor” and ORCID profiles are designed to ultimately allow researchers to record “products” which include not only formal publications, but also data sets, patents, software, web pages and other research outputs. Similarly, CrossRef’s analysis of the CitedBy references revealed that one in fifteen references in the scholarly literature published in 2012 included a plain, ordinary HTTP URI- clear evidence that researchers need to be able to cite informally published content on the web. If the trend in CitedBy data continues, then in two to three years one in ten citations will be of informally published literature.

The developments that we are seeing are a response to the need that users have to persistently identify and cite the full gamut of content types that make up the scholarly literature. If we can not persistently site these content types, the scholarly citation record will grow increasingly porous and structurally unsound.  We can either stand back and let these gaps be filled by other players under their terms and deal reactively with the confusion that is likely to ensue- or we can start working in these areas too and help to make sure that what gets developed interacts with the existing online scholarly citation record in a responsible way.

CrossRef Metadata Search++

We have just released a bunch of new functionality for CrossRef Metadata Search. The tool now supports the following features:

  • A completely new UI
  • Faceted searches
  • Copying of search results as formatted citations using CSL
  • COinS, so that you can easily import results into Zotero and other document management tools
  • An API, so that you can integrate CrossRef Metadata Search into your own applications, plugins, etc.
  • Basic OpenSearch support- so that you can integrate CrossRef Metadata Search into your browser’s search bar.
  • Searching for a particular CrossRef DOI
  • Searching for a particular CrossRef ShortDOI
  • Searching for articles in a particular journal via the journal’s ISSN

At the moment, CrossRef Metadata Search (CRMDS) is a CrossRef Labs project and, as such, should be used with some trepidation. Our goal is to release CRMS as a production service ASAP, but we wanted to get public feedback on the service before making the move to a production system.

PatentCite

If you’ve ever thought that scholarly citation practice was antediluvian and perverse- you should check-out patents some day.

Over the past year of so CrossRef has been working with Cambia and the The Lens to explore how we can better link scholarly literature to and from the patent literature. The first object of our collaboration was to attempt to link patents hosted on the new, beta version of The Lens to the Scholarly literature. To do this, CrossRef and Cambia been enhancing CrossRef’s citation matching mechanisms in order to better resolve the wide variety of eclectic and terse patent citation styles to CrossRef DOIs.

You can see the results of these ongoing attempts on the The Lens beta site where all of The Len’s 8 million+ 80 million+ patents and applications (obtained through subscriptions with WIPO, USPTO, EPO and IP Australia) are starting to be linked directly to the scholarly literature. See, for example:

http://beta.lens.org/lens/patent/US_RE42150_E1/citations

CrossRef has taken this matched data and has now released a CrossRef Labs *experimental* service , called PatentCIte, that allows you to take any CrossRef DOI and see what Patents in the The Lens system cite it.

As with all CrossRef Labs services- this one is likely to be:

a) As stable as the global economy
c) As reliable as a UK train
ii) Out-of-date. It is based on a snapshot of CrossRef /Lens data.
1) As accurate as my list ordering

Howzat for an SLA?

As we get feedback from CrossRef’s membership and as we gain more experience linking Patents to and from the scholarly literature, we will explore including this functionality in our production CitedBY service. But until then- please send us your feedback on this experimental service.

CrossRef and DataCite unify support for HTTP content negotiation

Last year CrossRef and DataCite announced support for HTTP content
negotiation for DOI names. Today, we are pleased to report further
collaboration on the topic. We think it is very important that the two
largest DOI Registration Agencies work together in order to provide
metadata services to DOI names.

The current implementation is documented in detail at
http://crosscite.org/cn. The documentation explains
HTTP content negotiation as implemented by both Registration Agencies
and provides a list of supported content types.

An example application of HTTP content negotiation is a citation
formatting service. You can try it at http://crosscite.org/citeproc.
This service will accept DOIs from both CrossRef and DataCite, unlike the previous formatting service which accepted
only CrossRef DOI names (http://citation.crrd.dyndns.org).
This is
possible because CrossRef and DataCite support a shared, common
metadata format. When you input a DOI into the formatting service, it
doesn’t know where the DOI was registered. The service will make an
HTTP content negotiation request to the global DOI resolver specifying which format of the metadata should be
returned in the HTTP Accept header. The global DOI resolver will
notice (Accept header!) that this is not a regular DOI resolution
request; it will turn to CrossRef or DataCite accordingly for the
relevant metadata instead of redirecting to a landing page. The format
of metadata is shared between both registration agencies so the
formatting service can interpret it without knowledge of the DOI origin.

In summary HTTP content negotiation lets you process a DOI’s
metadata without knowledge of its origin or specifics of the
registration agency.

If you have any problems, email us at tech@datacite.org or
labs@crossref.org. For general discussion please kindly leave a
comment below.