Helping researchers identify content they can text mine

2 minute read.

Helping researchers identify content they can text mine

Geoffrey Bilder – 2020 April 16

TL;DR

Many organisations are doing what they can to aid in the response to the COVID-19 pandemic. Crossref members can make it easier for researchers to identify, locate, and access content for text mining. In order to do this, members must include elements in their metadata that:

Point to the full text of the content.
Indicate that the content is available under an open access license or that it is being made available for free (gratis).

How to do it.

If your content is open access

Make sure the Crossref metadata for all of your open access content includes:

The URL of the open access license the content is under.
A URL that points to the full text of the content on your site (PDF, XML or HTML).

Instructions for including license and full text URLs in your metadata.

If you are making subscription content available for text mining (temporarily or otherwise).

Make sure the Crossref metadata for the content you are making freely available for text mining includes:

The URL of the publisher license the content is under.
A URL that points to the full text of the content where it is being made freely available (PDF, XML or HTML). This might not be on your site.

Instructions for including license and full text URLs in your metadata.

In addition, you need to flag the content that you are making freely available.

A “free to read” element in the access indicators section of your metadata indicating that the content is being made available free-of-charge (gratis).
An assertion element indicating that the content being made available is available free-of-charge.

Instructions for flagging your content as “free”

Note that step #4 is required in order for users to be able to find content marked as “gratis” in Crossref’s REST API.

And if you decide to revoke the free access in the future, you will need to update the data to reflect that restrictions have been reimposed.

Sounds great. Has anybody else actually done this?

Yes.

Over 43 million metadata records already have a license and a full text link. https://api.crossref.org/works?filter=has-license:true,has-full-text:true&rows=0

Millions of the above items have one of the Creative Commons licenses or a dedicated text and data mining license provided by the publisher.

And in the past three weeks (as of the writing of this blog post) over 23,000 articles have been flagged as “free” so they are available for text mining.

https://api.crossref.org/v1/works?filter=assertion:free,has-full-text:true

Recent blog posts

Why PID strategies need more than PIDs: our first position paper

2026 July 20

Schema 5.5 now available: adding CRediT, new record types for blogs and posters, and more

2026 July 09

Take part in UX Research at Crossref

2026 July 02

Building, refining, and connecting: summary of our May 2026 community update

2026 June 30

Get involved

Find a service

Documentation

About us