In August 2022, the United States Office of Science and Technology Policy (OSTP) issued a memo (PDF) on ensuring free, immediate, and equitable access to federally funded research (a.k.a. the “Nelson memo”). Crossref is particularly interested in and relevant for the areas of this guidance that cover metadata and persistent identifiers—and the infrastructure and services that make them useful.
Funding bodies worldwide are increasingly involved in research infrastructure for dissemination and discovery.
Preprints have become an important tool for rapidly communicating and iterating on research outputs. There is now a range of preprint servers, some subject-specific, some based on a particular geographical area, and others linked to publishers or individual journals in addition to generalist platforms. In 2016 the Crossref schema started to support preprints and since then the number of metadata records has grown to around 16,000 new preprint DOIs per month.
TL;DR One of the things that makes me glad to work at Crossref is the principles to which we hold ourselves, and the most public and measurable of those must be the Principles of Open Scholarly Infrastructure, or POSI, for short. These ambitions lay out how we want to operate - to be open in our governance, in our membership and also in our source code and data. And it’s that openness of source code that’s the reason for my post today - on 26th September 2022, our first collaboration with the JSON Forms open-source project was released into the wild.
Ans: metadata and services are all underpinned by POSI.
Leading into a blog post with a question always makes my brain jump ahead to answer that question with the simplest answer possible. I was a nightmare English Literature student. ‘Was Macbeth purely a villain?’ ‘No’. *leaves exam*
Just like not giving one-word answers to exam questions, playing our role in the integrity of the scholarly record and helping our members enhance theirs takes thought, explanation, transparency, and work.
We missed an error that led to resource resolution URLs of some 500,000+ records to be incorrectly updated. We have reverted the incorrect resolution URLs affected by this problem. And, we’re putting in place checks and changes in our processes to ensure this does not happen again.
How we got here
Our technical support team was contacted in late June by Wiley about updating resolution URLs for their content. It’s a common request of our technical support team, one meant to make the URL update process more efficient, but this was a particularly large request. Shortly thereafter, we were provided with nearly 1,200 separate files by Atypon on behalf of Wiley in order to update the resolution URLs of ~9 million records. We manually spot checked over 50 of these files, because, prior to this issue, our technical support team did not have a mechanism to automatically check for errors. That labor intensive review did not turn up any problems. That is, those 50 samples had no errors with the headers, like were found later.
Among the files we didn’t check, there were headers included in the files with different owning fromPrefix and acquiring toPrefix members’ DOI prefixes. In a URL update request, the prefixes should always be the same.
And still other files included requests to update records with DOIs that had never even been registered. Here are some examples:
In the example above, these fictional DOIs are both under prefix 10.5555. Thus, the result of this request will ONLY be that the resolution URLs of DOI 10.5555/doi1 and 10.5555/doi2 are updated in the metadata.
In this second example, these fictional DOIs are both under prefix 10.5555, but because the toPrefix in the header differs from the fromPrefix, the result of this request will be that the resolution URLs of 10.5555/doi1 and 10.5555/doi2 are updated in the metadata AND the owning prefix of both records will be transferred from prefix 10.5555 to prefix 10.9876.
We kicked off the URL update request on 30 June and all legitimate DOIs whose files were free of errors were updated by 7 July (yes, it takes about a week to update the resolution URLs for ~9 million records).
On 9 July, Peter Strickland of the International Union of Crystallography, one of 22 members affected by this mistake, contacted us to enquire how/why much of their content was resolving to incorrect URLs and why ownership of their content appeared within our search interface to be Wiley. Peter was rightly concerned. We were, too. Our technical support team quickly elevated this issue, because, frankly, this is not the first time our finicky URL update process has caused unwanted metadata updates, albeit not quite at this volume.
How we investigated the problem
We rallied our internal team. We investigated and discovered that we believed that some ~600,000 DOIs were erroneously included and updated in the requested 1,200 files. We later extended that estimate to include other conditions, in order to be as cautious as we could, to over 1 million DOIs. In the end, we determined that the incorrect files attempted updates of 1,228,041 DOIs. Due to the errors in the files (i.e., erroneous headers and non-registered DOIs), we only actually updated and then reverted 520,512 DOIs. The other 700,000+ DOIs were never updated (because of errors in the original files provided to us) or simply had never been registered with us.
Prior to this mistake, Crossref had never reverted a member’s metadata update before. To be clear, and as I said above, we have had other URL update mistakes over the years, like this one; they were just smaller in scale. We knew there were holes in our process that needed to be plugged. And we knew we needed a better solution for members to manage these updates themselves without our manual intervention. So, while there were mistakes made in the files supplied to us, this was our error and we’re fixing it; more on that below.
For this situation, we quickly realized that reversion of the metadata update was the best option for us, albeit we did not have an existing process in place to execute that reversion. That’s because we only keep the current version of each metadata record. We couldn’t back out of the change; we couldn’t simply restore these records to the metadata registered with us as of late June, because we no longer had an easily accessible, central record of those previous resolution URLs. What we did have was a record of all the previous submissions made against each DOI, so our technical team, focused their efforts there.
How we fixed all those records
We had two errors to correct: the ownership transfers (those records that had inadvertent and mismatched from/to prefixes) and the incorrect resolution URLs. We reverted all of the ownership transfers on 9 July and then double and triple checked that ownership during the week of 12 July to ensure we didn’t miss anything.
The resolution reversion was more complicated. We invested in creating a patch to identify the records that had been updated by our team, and then extract the last legitimate resolution URL registered with us by the owning member in order to revert the metadata for each record. In order to provide confidence that this mistake was contained, we also built a check into the patch to ensure that those DOIs that did have their ownership temporarily transferred were not updated during the few days that ownership was incorrect. That check helped us determine that none of the 520,512 DOIs were incorrectly updated beyond this mistaken URL update request.
The technical team built and tested this patch. The tests turned up gaps in the patch, so we refined it during the week of 2021 July 12. We kicked off the reversion of these records on Monday, 19 July at 20:05 UTC and the patch completed all reversions at 20:14 UTC, Thursday, 22 July.
In the end, we successfully reverted all of the resolution URLs for those 520,512 DOIs we identified; provided daily updates and apologies to the 22 affected members; together we worked some longer hours; and persevered.
We don’t want this to ever happen again. Like, never. We clearly need to make changes to our internal processes to prevent this in the future.
Here’s what’s ahead:
We are building a checker that we can run URL update files through to automate and our checks. This means we will be able to check every single file in a large batch, rather than relying on manual and labor intensive spot-checking;
As said above, one compounding issue in this mistake was the mismatched from/to prefixes in the file headers. Our technical support team uses the same file headers to transfer ownership/stewardship of a record or set of records between members AND to update resolution URLs. These two tasks are almost never legitimately completed in the same file. That is, there is usually a lag between ownership transfers and resolution URL updates (most members will request an ownership transfer and then a month or two later update their URLs). Because of this, simply decoupling these two tasks (feel free to follow our work at this link) would help eliminate a glaring risk, so we’re working on that too;
Lastly, we’re researching ways we can streamline resource resolution URL updates. You can also monitor our progress on this one. No promises or specifics yet, but we’re eager to reduce toil on our technical support team, avoid problems like this one, and provide members safe and straightforward ways to better update your metadata.
Thanks for the support of the whole Crossref team and our community - and for reading this far! Never a dull moment…