In my previous blog post, Matchmaker, matchmaker, make me a match, I compared four approaches for reference matching. The comparison was done using a dataset composed of automatically-generated reference strings. Now it’s time for the matching algorithms to face the real enemy: the unstructured reference strings deposited with Crossref by some members. Are the matching algorithms ready for this challenge? Which algorithm will prove worthy of becoming the guardian of the mighty citation network? Buckle up and enjoy our second matching battle!
Matching (or resolving) bibliographic references to target records in the collection is a crucial algorithm in the Crossref ecosystem. Automatic reference matching lets us discover citation relations in large document collections, calculate citation counts, H-indexes, impact factors, etc. At Crossref, we currently use a matching approach based on reference string parsing. Some time ago we realized there is a much simpler approach. And now it is finally battle time: which of the two approaches is better?
At Crossref Labs, we often come across interesting research questions and try to answer them by analyzing our data. Depending on the nature of the experiment, processing over 100M records might be time-consuming or even impossible. In those dark moments we turn to sampling and statistical tools. But what can we infer from only a sample of the data?
As the linking hub for scholarly content, it’s our job to tame URLs and put in their place something better. Why? Most URLs suffer from link rot and can be created, deleted or changed at any time. And that’s a problem if you’re trying to cite them.
This is a joint blog post with Dario Taraborelli, coming from WikiCite 2016.
In 2014 we were taking our first steps along the path that would lead us to Crossref Event Data. At this time I started looking into the DOI resolution logs to see if we could get any interesting information out of them. This project, which became Chronograph, showed which domains were driving traffic to Crossref DOIs.
You can read about the latest results from this analysis in the “Where do DOI Clicks Come From” blog post.
Having this data tells us, amongst other things:
Jennifer Lin – 2016 January 08
In Crossref LabsDataEvent DataFundersIdentifiersLinked DataMetadataOrcidXml
In the 2015 Crossref Annual Meeting, I introduced a metaphor for the work that we do at Crossref. I re-present it here for broader discussion as this narrative continues to play a guiding role in the development of products and services this year.
At Crossref, we make research outputs easy to find, cite, link, and assess through DOIs. Publishers register their publications and deposit metadata through a variety of channels (XML, CSV, PDF, manual entry), which we process and transform into Crossref XML for inclusion into our corpus. This data infrastructure which makes possible scholarly communications without restrictions on publisher, subject area, geography, etc. is far more than a reference list, index or directory.
If you’re anything like us at Crossref Labs (and we know some of you are) you would have been very excited about the launch of the Raspberry Pi Zero a couple of days ago. In case you missed it, this is a new edition of the tiny low-priced Raspberry Pi computer. Very tiny and very low-priced. At $5 we just had to have one, and ordered one before we knew exactly what we want to do with it. You would have done the same. Bad luck if it was out of stock.
Skimming the headlines on Hacker News yesterday morning, I noticed something exciting. A dump of all the submissions to Reddit since 2006. “How many of those are DOIs?”, I thought. Reddit is a very broad community, but has some very interesting parts, including some great science communication. How much are DOIs used in Reddit?
(There has since been a discussion about this blog post on Hacker News)
We have a whole strategy for DOI Event Tracking, but nothing beats a quick hack or is more irresistible than a data dump.
ROR announces the first Org ID prototype
2019 February 10
Request for feedback on grant identifier metadata
2019 February 07
Underreporting of matched references in Crossref metadata
2019 February 05