Hello from sunny Girona! I’m heading to PIDapalooza, the Persistent Identifier festival, as it returns for its second year. It’s all about to kick off.
One of the themes this year is “bridging worlds”: how to bring together different communities and the identifiers they use. Something I really enjoyed about PIDapalooza last year was the variety of people who came. We heard about some “traditional” identifier systems (at least, it seems that way to us): DOIs for publications, DOIs for datasets, ORCIDs for researchers. But, gathered in Reykjavik, under dark Icelandic skies, I met oceanographic surveyors assigning DOIs to drilling equipment, heard stories of identifiers in Chinese milk production and consoled librarians trying navigate the identifier landscape.
In addition to the usual scholarly publishing and science communication crowd, it was encouraging to see a real diversity of people from different walks of life encounter the same problems and work on them them collaboratively. The thing that brought everyone together was the understanding that if we’re going to reliably reference things – be they researchers, articles they write, or ships they sail – we need to give them identifiers. And those identifiers should be as good as possible: persistent, resolvable, interoperable.
Who cares about PIDs?
At the turn of the century, a handful of publishers came together to create Crossref (or CrossRef as it was in those days). It was becoming increasingly important to be able to store references in machine-readable format, but publishers were faced with a problem. If an author wants to cite an article, they’ll do so without worrying who published it. This means they needed an identifier system that worked across all publishers. Thus the Crossref DOI was born.
Today we’re heading toward 10,000 members, and the thing that they have in common is that they all produce scholarly content and care about how it’s referenced. As a trade association, we effectively act on behalf of all of our members, allowing them to register their content, share metadata and links, and assign an identifier.
But there’s a whole world out there. Publications have never been the be-all and end-all of scholarship, but they have been a backbone. But more and more scholarship, especially science, is done outside journal publishing. Sometimes it’s done on platforms that care about the scholarly record as much as publishers. And sometimes it isn’t.
Lots of people use Twitter to talk about science. Some are scientists, some aren’t. Scientific articles are linked from news reports and discussed on blogs. Gone are the days of scholarly articles being cited only by other scholarly articles. We see links coming in from all over the place. And, although not all of this can be counted as the “scholarly record”, some of it could be.
The barrier-to-entry for journals publishing means that science journals contain only science articles. The barrier-to-entry for Twitter means that anyone can, and does, publish there. My Twitter feed is finely balanced between bibliometrics research, marine biology and pictures of snow leopards with Japanese captions. I don’t understand all of it, but I like looking at the pictures.
Back in the days when the only references to scholarly publications were from other scholarly publications, it was easy to keep track of those references. When an article was published, its references went into a citation database. This happened because the publisher considered this important.
But Twitter, the publisher of tweets, doesn’t care. It is used for a huge variety of communications and although some people choose to use it to engage in scholarship, we’re just a blip on their radar. The same goes for Reddit, a platform that describes itself as “the front page of the Internet”. There are communities engaged in scientific discussions, but Reddit doesn’t feel the need to publish its bibliographic references.
Nor should it.
Bridging those who care with those who don’t
The barrier-to-entry for contributing to scientific discussions has lowered, meaning that the role of more non-specialist platforms has increased.
I imagine that there are other communities out there who have their own concerns about the web. Maybe there are model train enthusiasts who want to keep track of every reference to a particular model. Or political commentators who want to keep track of how certain politicians and policies are discussed. As the scholarly community embraces new platforms for communicating, we should recognise that we are part of a broader universe of people using those platforms for more diverse reasons.
Gone are the days when the only way to reply to an article was by writing a letter to the editor. But also gone are the days when you could guarantee that your letter wouldn’t appear next to cat pictures (assuming you weren’t writing to the Journal of Feline Medicine & Surgery). As a specialist community cohabiting online spaces with non-specialists, it falls to us to do whatever we need to adapt that space and make it our own. In our case, this means recording bibliographic references as and where they occur.
Something like this happened once before. As traditional publishers went online, they created Crossref to build and maintain the necessary infrastructure. We’re acting on behalf of the community again to collect links from non-traditional sources. Because we can’t go to platforms like Twitter and say “please deposit your references”, we’re doing the opposite. We identify a platform, then work out how to scrape its content and extract links.
Working at scale
So we’re broadening out the universe of references that we would like to track from “traditional scholarly publishing” to “the entire web”. There are four broad challenges inherent in this, and we think that Crossref infrastructure is the right way to meet them.
The first challenge is physically finding the links. Because social media platforms aren’t specialised for scholarly publishing, they don’t have the same mechanisms in place for capturing bibliographic references. This means that we have to do it ourselves by scraping webpages for references. As the standard-bearer for scholarly PIDs, we think we can do a good job of this.
The second challenge is doing this at the scale of the web. Because we might, in theory, find a link on any webpage, there is a literally infinite number of publishing platforms. From big websites like BBC News down to tiny blogs run out of a bedroom. It would be impossible to partner with each of these individually. The way to solve this is to run a centralised service which goes out and contacts as many sources as possible. This role is a collaborative one. Our system is open to inspection, suggestions and contributions from the community.
The third challenge is the sheer number of publishers. Because they all register content with us, we are in good position to track their DOIs. In addition to that, every member of Crossref publishes content on their own platform, and has their own set of websites to track. We monitor our members' websites and create a central list of domains that we look for. If this wasn’t done centrally, each publisher would have to run its own web crawlers and perform the same work, only to filter out their own links.
The fourth challenge is how to get all that data to the public. Even if every publisher were able to run their own infrastructure, it would make it very difficult to consume. Through Crossref metadata services, publishers have built a system where you can look up metadata and link to articles without worrying who published them. We think that the same approach should apply to this new link data.
For these reasons, we’re building Crossref Event Data: a system that monitors as many platforms as we can think of, and brings them into one place, and serves the whole community.
If you’ve been following along you’ll know that my last metaphor was the process of refining crude oil. I like metaphors, and mixing them. After all, you can’t mix a good metaphor without breaking a few eggs into the mixing bowl. Today’s metaphors are bridges. And not just one.
Bridge 1: PIDs and URLs
In the world of Persistent Identifiers, we’re quite good at linking. Organizations like Crossref, DataCite and ORCID run separate systems but we work together to record and exchange links. But the web is different. There’s no single organization in control and there are many organizations working to catalogue it. Event Data is our offering: bridging the web with our identifiers.
Bridge 2: Scholarly link providers
Of course, some platforms and systems do care about persistence and Persistent Identifiers. Event Data is an open platform, and we’re collaborating with a few providers to publish links.
We’ve partnered with The Lens to include Patent to DOI references. We’re working with F1000 to include links between reviews and articles. Hopefully we’ll see more organizations use Event Data to publish their links.
Bridge 3: Crossref / DataCite
Event Data is a collaborative project between DataCite and Crossref. When Crossref Registered Content contains a reference to a DataCite DOI we put it into Event Data. DataCite do the same in reverse. This means that Event Data contains a huge number of article - dataset links.
Bridge 4: Traditional discussions vs new ones
At each moment, scholarly discussions are happening in the literature, on various social media platforms and on the web at large. They are all talking about the same thing, but are spread out. Event Data collects links wherever we find them and brings them into one place. By doing this we hope we can help bring those conversations together.
Bridge 5: Bridging bibliometricians and altmetricians to data sources
Capturing links from social media to published literature underpins the field of altmetrics. By collecting this data and making it available under open licenses, we bring it to altmetrics researchers. We don’t provide metrics, but we do provide the data points that can form the basis for research.
Without infrastructure for collecting data, researchers would have to perform the same work over and over again. Because the data is all open, we allow datasets to be republished, reworked and replicated.
Bridge 6: Bridging the Evidence Gap
Running Event Data involves collecting a lot of data - gigabytes per day - and boiling it down into hundreds of thousands of individual Events per day. People consuming the data may want to do further boiling down. At every point of the process we record the input data that we were working from, the internal thought process of the system, and the Events that were produced. A researcher can use the Evidence Logs to trace through the entire process that led to an Event.
We’re a bridge from websites and social media to data consumers. But we take the role very seriously, and there’s nothing hidden. A glass bridge, you could say.
It’s not all plain sailing. There are a few challenges along the way to collecting this data which anyone who wanted to collect this kind of information would face. By collecting it in a central place and running an open platform we can solve each problem once, and improve our process as a community.
One problem is choosing what to include. We include any link that we find from a non-publisher website. That means that invariably some of the links are from spam. This problem isn’t new: we see low-quality articles being published in traditional journals from time to time. We try to include all of the data we can find and pass it onto consumers. They might want to whitelist certain sources, or they may want all of the data because they’re trying to study scholarly spam. We have decided to provide data as Events, which strike the balance between atomicity and usefulness.
Another, which I talked about at last year’s PIDapalooza, is how we track article landing pages. Read the blog post, the user guide or hop in a time machine if you’re interested.
The thing about bridges…
… is that they help people get where they’re going. With a few notable exceptions, they’re not the main attraction. We play a humble part in scholarly publishing, helping collect and distribute metadata. Most of what we do goes unseen, and helps people create tools, platforms and research. Event Data is an API, and whilst we hope people will build all kinds of things with it, including altmetrics tools, we’re not making another metric.
All of which brings me to my talk, which I’m giving on Wednesday: Bridging persistent and not-so-persistent identifiers. I would tell you about it, but there isn’t much more left to say.
If you want to find out more, we’re currently in Beta, and open for business. Head over to the User Guide to get started!