We believe in Persistent Identifiers. We believe in defence in depth. Today we’re excited to announce an upgrade to our data resilience strategy.
Defence in depth means layers of security and resilience, and that means layers of backups. For some years now, our last line of defence has been a reliable, tried-and-tested technology. One that’s been around for a while. Yes, I’m talking about the humble 5¼ inch floppy disk.
Recording data citations supports data reuse and aids research integrity and reproducibility. Crossref makes it easy for our members to submit data citations to support the scholarly record.
TL;DR Citations are essential/core metadata that all members should submit for all articles, conference proceedings, preprints, and books. Submitting data citations to Crossref has long been possible. And it’s easy, you just need to:
Include data citations in the references section as you would for any other citation Include a DOI or other persistent identifier for the data if it is available - just as you would for any other citation Submit the references to Crossref through the content registration process as you would for any other record And your data citations will flow through all the normal processes that Crossref applies to citations.
At Crossref, we care a lot about the completeness and quality of metadata. Gathering robust metadata from across the global network of scholarly communication is essential for effective co-creation of the research nexus and making the inner workings of academia traceable and transparent. We invest time in community initiatives such as Metadata 20/20 and Better Together webinars. We encourage members to take time to look up their participation reports, and our team can support you if you’re looking to understand and improve any aspects of metadata coverage of your content.
What’s in the metadata matters because it is So.Heavily.Used.
You might be tired of hearing me say it but that doesn’t make it any less true. Our open APIs now see over 1 billion queries per month. The metadata is ingested, displayed and redistributed by a vast, global array of systems and services that in whole or in part are often designed to point users to relevant content. It’s also heavily used by researchers, who author the content that is described in the metadata they analyze.
We missed an error that led to resource resolution URLs of some 500,000+ records to be incorrectly updated. We have reverted the incorrect resolution URLs affected by this problem. And, we’re putting in place checks and changes in our processes to ensure this does not happen again.
How we got here
Our technical support team was contacted in late June by Wiley about updating resolution URLs for their content. It’s a common request of our technical support team, one meant to make the URL update process more efficient, but this was a particularly large request. Shortly thereafter, we were provided with nearly 1,200 separate files by Atypon on behalf of Wiley in order to update the resolution URLs of ~9 million records. We manually spot checked over 50 of these files, because, prior to this issue, our technical support team did not have a mechanism to automatically check for errors. That labor intensive review did not turn up any problems. That is, those 50 samples had no errors with the headers, like were found later.
Among the files we didn’t check, there were headers included in the files with different owning fromPrefix and acquiring toPrefix members’ DOI prefixes. In a URL update request, the prefixes should always be the same.
And still other files included requests to update records with DOIs that had never even been registered. Here are some examples:
In the example above, these fictional DOIs are both under prefix 10.5555. Thus, the result of this request will ONLY be that the resolution URLs of DOI 10.5555/doi1 and 10.5555/doi2 are updated in the metadata.
In this second example, these fictional DOIs are both under prefix 10.5555, but because the toPrefix in the header differs from the fromPrefix, the result of this request will be that the resolution URLs of 10.5555/doi1 and 10.5555/doi2 are updated in the metadata AND the owning prefix of both records will be transferred from prefix 10.5555 to prefix 10.9876.
We kicked off the URL update request on 30 June and all legitimate DOIs whose files were free of errors were updated by 7 July (yes, it takes about a week to update the resolution URLs for ~9 million records).
On 9 July, Peter Strickland of the International Union of Crystallography, one of 22 members affected by this mistake, contacted us to enquire how/why much of their content was resolving to incorrect URLs and why ownership of their content appeared within our search interface to be Wiley. Peter was rightly concerned. We were, too. Our technical support team quickly elevated this issue, because, frankly, this is not the first time our finicky URL update process has caused unwanted metadata updates, albeit not quite at this volume.
How we investigated the problem
We rallied our internal team. We investigated and discovered that we believed that some ~600,000 DOIs were erroneously included and updated in the requested 1,200 files. We later extended that estimate to include other conditions, in order to be as cautious as we could, to over 1 million DOIs. In the end, we determined that the incorrect files attempted updates of 1,228,041 DOIs. Due to the errors in the files (i.e., erroneous headers and non-registered DOIs), we only actually updated and then reverted 520,512 DOIs. The other 700,000+ DOIs were never updated (because of errors in the original files provided to us) or simply had never been registered with us.
Prior to this mistake, Crossref had never reverted a member’s metadata update before. To be clear, and as I said above, we have had other URL update mistakes over the years, like this one; they were just smaller in scale. We knew there were holes in our process that needed to be plugged. And we knew we needed a better solution for members to manage these updates themselves without our manual intervention. So, while there were mistakes made in the files supplied to us, this was our error and we’re fixing it; more on that below.
For this situation, we quickly realized that reversion of the metadata update was the best option for us, albeit we did not have an existing process in place to execute that reversion. That’s because we only keep the current version of each metadata record. We couldn’t back out of the change; we couldn’t simply restore these records to the metadata registered with us as of late June, because we no longer had an easily accessible, central record of those previous resolution URLs. What we did have was a record of all the previous submissions made against each DOI, so our technical team, focused their efforts there.
How we fixed all those records
We had two errors to correct: the ownership transfers (those records that had inadvertent and mismatched from/to prefixes) and the incorrect resolution URLs. We reverted all of the ownership transfers on 9 July and then double and triple checked that ownership during the week of 12 July to ensure we didn’t miss anything.
The resolution reversion was more complicated. We invested in creating a patch to identify the records that had been updated by our team, and then extract the last legitimate resolution URL registered with us by the owning member in order to revert the metadata for each record. In order to provide confidence that this mistake was contained, we also built a check into the patch to ensure that those DOIs that did have their ownership temporarily transferred were not updated during the few days that ownership was incorrect. That check helped us determine that none of the 520,512 DOIs were incorrectly updated beyond this mistaken URL update request.
The technical team built and tested this patch. The tests turned up gaps in the patch, so we refined it during the week of 2021 July 12. We kicked off the reversion of these records on Monday, 19 July at 20:05 UTC and the patch completed all reversions at 20:14 UTC, Thursday, 22 July.
In the end, we successfully reverted all of the resolution URLs for those 520,512 DOIs we identified; provided daily updates and apologies to the 22 affected members; together we worked some longer hours; and persevered.
We don’t want this to ever happen again. Like, never. We clearly need to make changes to our internal processes to prevent this in the future.
Here’s what’s ahead:
We are building a checker that we can run URL update files through to automate and our checks. This means we will be able to check every single file in a large batch, rather than relying on manual and labor intensive spot-checking;
As said above, one compounding issue in this mistake was the mismatched from/to prefixes in the file headers. Our technical support team uses the same file headers to transfer ownership/stewardship of a record or set of records between members AND to update resolution URLs. These two tasks are almost never legitimately completed in the same file. That is, there is usually a lag between ownership transfers and resolution URL updates (most members will request an ownership transfer and then a month or two later update their URLs). Because of this, simply decoupling these two tasks (feel free to follow our work at this link) would help eliminate a glaring risk, so we’re working on that too;
Lastly, we’re researching ways we can streamline resource resolution URL updates. You can also monitor our progress on this one. No promises or specifics yet, but we’re eager to reduce toil on our technical support team, avoid problems like this one, and provide members safe and straightforward ways to better update your metadata.
Thanks for the support of the whole Crossref team and our community - and for reading this far! Never a dull moment…