Joe Wass – 2017 September 25
I’m here in Toronto and looking forward to a busy week. Maddy Watson and I are in town for the 4:AM Altmetrics Conference, as well as the altmetrics17 workshop and Hack-day. I’ll be speaking at each, and for those of you who aren’t able to make it, I’ve combined both presentations into a handy blog post, which follows on from my last one.
But first, nothing beats a good demo. Take a look at our live stream. This shows the Events passing through Crossref Event Data, live, as they happen. You may need to wait a few seconds before you see anything.
You may know about Crossref. If you don’t, we are a non-profit organisation that works with Publishers (getting on for nine thousand) to register scholarly publications, issue Persistent Identifiers (DOIs) and maintain the infrastructure required to keep them working. If you don’t know what a DOI is, it’s a link that looks like this:
When you click on that, you’ll be taken to the landing page for that article. If the landing page moves, the DOI can be updated so you’re taken to the right place. This is why Crossref was created in the first place: to register Persistent Identifiers to combat link rot and to allow Publishers to work together and cite each other’s content. A DOI is a single, canonical identifier that can be used to refer to scholarly content.
Not only that, we combine that with metadata and links. Links to authors via ORCIDs, references and citations via DOIs, funding bodies and grant numbers, clinical trials… the list goes on. All of this data is provided by our Publisher members and most of it is made available via our free API.
Because we are the central place that publishers register their content, and we’ve got approaching 100 million items of Registered Content, we thought that we could also curate and collect altmetrics type data for our corpus of publications. After all, a reference from a Tweet to an article is a link, just like a citation between two articles is a link.
So, a few years back we thought we would try and track altmetrics for DOIs. This was done as a Crossref Labs experiment. We grabbed a copy of PLOS ALM (since renamed Lagotto), loaded a sample of DOIs into it and watched as it struggled to keep up.
It was a good experiment, as it showed that we weren’t asking exactly the right questions. There were a few things that didn’t quite fit. Firstly, it required every DOI to be loaded into it up-front, and, in some cases, for the article landing page for every DOI to be known. This doesn’t scale to tens of millions. Secondly, it had to scan over every DOI on a regular schedule and make an API query for each one. That doesn’t scale either. Thirdly, the kind of data it was requesting was usually in the form of a count. It asked the question:
“How many tweets are there for this article as of today?”
This fulfilled the original use case for PLOS ALM at PLOS. But when running it at Crossref, on behalf of every publisher out there, the results raised more questions than they answered. Which was good, because it was a Labs Experiment.
The whole journey to Crossref Event Data has been a process of working out how to ask the right question. There are a number of ways in which “How many tweets are there for this article as of today?” isn’t the right question. It doesn’t answer:
We took one step closer toward the right question. Instead of asking “how many tweets for this article are there as of today” we asked:
“What activity is happening on Twitter concerning this article?”
If we record each activity we can include information that answers all of the above questions. So instead of collecting data like this:
We’re collecting data like this:
Now we’re collecting individual links between tweets and DOIs, we’re closer to all the other kinds of links that we store. It’s like the “traditional” links that we already curate except:
This last point caused us to scratch our heads for a bit. We used to collect links within the ‘traditional’ scholarly literature. Generally, journal articles:
Now we’re collecting links between things that aren’t seen as ‘traditional’ scholarship and don’t play by the rules.
The first thing we found is that blog authors don’t reference the literature using DOIs. Instead they use article landing pages. This meant that we had to put in the work to collect links to article landing pages and turn them back into DOIs so that they can be referenced in a stable, link-rot-proof way.
When we looked at Wikipedia we noticed that, as pages are edited, references are added and removed all the time. If our data set reflected this, it would have to evolve over time, with items popping into existence and then vanishing again. This isn’t good.
Our position in the scholarly community is to provide data and infrastructure that others can use to create services, enrich and build things. Curating an ever changing data set, where things can disappear, is not a great idea and is hard to work with.
We realised that a plain old link store (also known as an assertion store, triple store, etc.) wasn’t the right approach as it didn’t capture the nuance in the data with sufficient transparency. At least, it didn’t tell the whole picture.
We settled on a new architecture, and Crossref Event Data as we now know it was born. Instead of a dataset that changes over time, we have a continual stream of Events, where each Event tells a new part of the story. An Event is true at the time it is published, but if we find new information we don’t edit Events, we add new ones.
An Event is the way that we tell you that we observed a link. It includes the link, in “subject - relation type - object” format, but so much more. We realised that one question won’t do, so Events now answer the following questions:
I’ll come back to the “how do you know” a bit later.
So, an Event is a package that contains a link plus lots of extra information required to interpret and make sense of it. But how do we choose what comprises an Event?
An Event is created every time we notice an interaction between something we can observe out on the web and a piece of registered content. This simple description gives rise to some interesting quirks.
It means that every time we see a tweet that mentions an article, for example, we create an Event. If a tweet mentions two articles, there are two events. That means that “the number of Twitter events” is not the same as “the number of tweets”.
It means that every time we see a link to a piece of registered content in a webpage, we create an Event. The Event Data system currently tries to visit each webpage once, but we reserve the right to visit a webpage more than once. This means that the number of Events for a particular webpage doesn’t mean there are that many references.
We might go back and check a webpage in future to see if it still has the same links. If it does, we might generate a new set of Events to indicate that.
Because of the evolving nature of Wikipedia, we attempt to visit every page revision and document the links we find. This means that if an article has a very active edit history, and therefore a large number of edits, we will see repeated Events to the literature, once for every version of the page that makes references. So the number of Events in Wikipedia doesn’t mean the number of references.
An Event is created every time we notice an interaction. Each source (Reddit, Wikipedia, Twitter, blogs, the web at large) has different quirks, and you need to understand the underlying source in order to understand the Events.
If you want to create a metric based on counting things, you have a lot of decisions to make. Do you care about bots? Do you care about citation rings? Do you care about retweets? Do you care about whether people use DOIs or article landing pages? Do you care what text people included in their tweet? The answer to each of these questions means that you’ll have to look at each data point and decide to put a weighting or score on it.
If you wanted to measure how blogged about a particular article was, you would have to look at the blogs to work out if they all had unique content. For example, Google’s Blogger platform can publish the same blog post under multiple domain names.
A blog full of link spam is still a blog. You may be doing a study into reputable blogs, so you may want to whitelist the set of domain names to exclude less reputable blogs. Or you may be doing a study into blog spam, so lower quality blogs is precisely what you’re interested in,
If you wanted to measure how discussed an article was on Reddit, you might want to go to the conversation and see if people were actually talking about it, or whether it was an empty discussion. You might want to look at the author of the post to see if they were a regular poster, whether they were a bot or an active member of the community.
If you wanted to measure how referenced an article was in Wikipedia, you might want to look at the history of each reference to see if it was deleted immediately. Or if it existed for 50% of the time, and to give a weighting.
We don’t do any scoring, we just record everything we observe. We know that everyone will have different needs, be producing different outcomes and use different methodologies. So it’s important that we tell you everything we know.
So that’s an Event. It’s not just a link, it’s the observation of a link, coupled with extra information to help you understand it.
But what if the Event isn’t enough? To come back to the earlier question, “how do you know?”
Events don’t exist in isolation. Data must be collected and processed. Each Agent in Crossref Event Data monitors a particular data source and feeds data into the system, which goes and retrieves webpages so it can make observations. Things can go wrong.
Any one of these things might prevent an Event from being collected:
This is a fact of life, and we can only operate on a best-effort basis. If we don’t have an Event, it doesn’t mean it didn’t happen.
This doesn’t mean that we just give up. Our system generates copious logs. It details every API call it made, the response it got, every scan it made, every URL it looked at. This amounts to about a gigabyte of data per day. If you want to find out why there was no Wikipedia data at a given point in time, you can go back to the log data and see what happened. If you want to see why there was no Event for an article by publisher X, you can look at the logs and see, for example, that Publisher X prevented the bot from visiting.
Every Event that does exist has a link to an Evidence Record, which corresponds with the logs. The Evidence Record tells you:
Artifacts are versioned files that contain information that Agents use. For example, there’s a list of domain names, a list of DOI prefixes, a list of blog feed urls, and so on. By indicating which version of these Artifacts were used, we can explain why we visited a certain domain and not another.
All the code is open source. The Evidence Record says which version of each Agent was running so you can see precisely which algorithms were used to generate the data.
Between the Events, Evidence Records, Evidence Logs, Artifacts and Open Source software, we can pinpoint precisely how the system behaved and why. If you have any questions about how a given Event was (or wasn’t) generated, every byte of explanation is freely available.
This forms our “Transparency first” idea. We start the whole process with an open Artifact Registry. Open source software then produces open Evidence Records. The Evidence Record is then consulted and turned into Events. All the while, copious logs are being generated. We’ve designed the system to be transparent, and for each step to be open to inspection.
We’re currently in Beta. We have over thirty million Events in our API, and they’re just waiting for you to use them!
Head over to the User Guide and get stuck in!
If you are in Toronto, come and say hi to Maddy or me.
2017 October 16
2017 October 06
2017 October 02