crossref text and data mining

Join a free Introduction to CrossRef Text and Data Mining webinar.

Date: Thursday, October 22, 2015
Time: 8:00 am (San Francisco), 11:00 am (New York), 4:00 pm (London)
Moderator: Rachael Lammey

Or listen to a pre-recorded webinar.

Researchers are increasingly interested in text and data mining published scholarly content. This poses technical and logistical problems for scholarly researchers and publishers alike.

  • Researchers find it impractical to negotiate multiple bilateral agreements with subscription-based publishers in order to get authorisation to text and data mine subscribed content.
  • Subscription-based publishers find it impractical to negotiate multiple bilateral agreements with researchers and institutions in order to authorise text and data mining of subscribed content.
  • All parties would benefit from support of standard APIs and data representations in order to enable text and data mining across both open access and subscription-based publishers.

How Does CrossRef’s Text and Data Mining Service Work?

The CrossRef Text and Data Mining service addresses the issue of text and data mining scholarly literature by providing a CrossRef Metadata API that can be used by researchers to access the full text of content identified by CrossRef DOIs across publisher sites and regardless of their business model. Both components are free to use by researchers and the public.

CrossRef Metadata API

Currently researchers who wish to text and data mine published literature (text and data mining users) have no common or simple way of accessing the full text of the content they wish to mine. This is true both of subscription-based content as well as of open access content. Consequently, users who want to mine the content currently do so in one of two ways:

  • They negotiate with relevant publishers to have the content delivered to them- either via physical media or bulk data transfer (e.g. ftp).
  • They “screen-scrape” the publisher’s website.

The problem with the first option is that it doesn’t scale well across multiple publishers and text and data mining users. It also presents synchronisation problems if the text and data mining users want an ongoing feed of refreshed content. The problem with the second option is that “screen scraping” is an inefficient, fragile and error prone mechanism for identifying and downloading full text. Screen scrapers put a large performance burden on web sites and, at the same time, any slight changes to the web site can break the tool that is doing the screen scraping. The CrossRef Metadata API has three basic subcomponents:

  • A common mechanism for providing automated text and data mining tools with direct links to full text on the publisher’s site.
  • An optional common mechanism for rate-limiting automated text and data mining tools using HTTP headers
  • A common mechanism for recording license information in CrossRef metadata so that researchers can see if the license the content is published under enables text and data mining

The CrossRef Metadata API is free to use by researchers and the public. If publishers require researchers to agree to a supplementary license in order to text and data mine their content, this can also be accommodated through the system.

For technical details see:

Using CrossRef for text and data mining

This service will allow researchers to easily harvest content for text and data mining analysis using a standard API across all publishers. The system builds on well-defined web standards and best practices such as the DOI and content negotiation. The system also allows the text and data mining researcher to easily choose whether they want to make use of open access content, subscription-based content or both.

Interested in participating?

Please get in touch using our contact form. See a list of our current CrossRef TDM participants.


View the most Frequently Asked Questions.

FAQs for researchers.

FAQs for publishers.

News Release: CrossRef Text and Data Mining Services Simplify Researcher Access - 29 May, 2014

If you need additional information please contact

Sign up for CrossRef updates.

Updated June 13, 2015

copyright 2015, pila, inc. all rights reserved