Python and Ruby Libraries for accessing the Crossref API

8 minute read.

Python and Ruby Libraries for accessing the Crossref API

Scott Chamberlain – 2016 March 04

I’m a co-founder with rOpenSci, a non-profit that focuses on making software to facilitate reproducible and open science. Back in 2013 we started to make an R client working with various Crossref web services. I was lucky enough to attend last year’s Crossref annual meeting in Boston, and gave one talk on details of the programmatic clients, and another higher level talk on text mining and use of metadata for research.

Crossref has a newish API encompassing works, journals, members, funders and more (check out the API docs), as well as a few other services. Essential to making the Crossref APIs easily accessible—and facilitating easy tool/app creation and exploration—are programmatic clients for popular languages. I’ve maintained an R client for a while now, and have been working on Python and Ruby clients for the past four months or so.

The R client falls squarely into the analytics/research use cases, while the Python and Ruby clients are ideal for general data access and use in web applications (the Javascript library below as well).

I’ve strived to make each client in idiomatic fashion according to the language. Due to this fact, there is not generally correspondence between the different clients with respect to data outputs. However, I’ve tried to make method names similar across Ruby and Python; although the R client is quite a bit older, so method names differ from the other clients and I’m resistant to changing them so as not to break current users’ projects. In addition, R users are likely to want a data.frame (i.e., table) of results, so we give back that - whereas with Python and Ruby we give back dictionaries and hashes, respectively.

Crossref clients

Python:
- Source: https://github.com/sckott/habanero
- Pypi: https://pypi.python.org/pypi/habanero
Ruby:
- Source: https://github.com/sckott/serrano
- Rubygems: https://rubygems.org/gems/serrano
- serrano also comes with a command line tool of the same name that’s installed when you install serrano (examples below)
R:
- Source: https://github.com/ropensci/rcrossref
- CRAN: https://cran.rstudio.com/web/packages/rcrossref/
Javascript:
- Source: https://github.com/scienceai/crossref
- NPM: https://www.npmjs.com/package/crossref

I’ll cover the Python, Ruby, and R libraries below.

Installation

Python

on the command line

pip install habanero

Ruby

on the command line

gem install serrano

in an R session

install.packages("rcrossref")

Examples

Output is indicated by the syntax #> in all examples below.

Python

in a Python REPL (e.g. iPython)

Import the Crossref module from within habanero, and initialize a client

from habanero import Crossref
cr = Crossref()

Query for the phrase “ecology”

x = cr.works(query = "ecology", limit = 5)

Index to various parts of the output

x['message']['total-results']
#> 276188

Extract similar data items from each result. The records are in the “items” slot

[ z['DOI'] for z in x['message']['items'] ]
#> [u'10.1002/(issn)1939-9170',
#>  u'10.4996/fireecology',
#>  u'10.5402/ecology',
#>  u'10.1155/8641',
#>  u'10.1111/(issn)1439-0485']

In habanero for some methods we require you to instantiate a client.

You can set a base URL and API key. This is a future looking feature

as Crossref API does not require an API key.

Note: I’ve tried to make sure habanero is Python 2 and 3 compatible. Hopefully you’ll find that’s true.

Ruby

in a Ruby repl (e.g., pry), load serrano

require 'serrano'

Query for “peerj” on the journals route

x = Serrano.journals(query: "peerj")

Collect just ISSN’s from each result

x['message']['items'].collect { |z| z['ISSN'] }
#> => [["2376-5992"], ["2167-8359"]]

Shell

The serrano command line tool is quite powerful if you are used to doing things there.

Here, search for one article; summary data is shown.

serrano works 10.1371/journal.pone.0033693
#> DOI: 10.1371/journal.pone.0033693
#> type: journal-article
#> title: Methylphenidate Exposure Induces Dopamine Neuron Loss and Activation of Microglia in the Basal Ganglia of Mice

There’s also a -json flag to give back JSON data, which can be parsed with the command line tool jq.

serrano works --filter=has_full_text:true --json --limit=5 | jq '.message.items[].link[].URL'
#> "http://api.wiley.com/onlinelibrary/tdm/v1/articles/10.1002%2F9781119208082.ch9"
#> "http://api.wiley.com/onlinelibrary/tdm/v1/articles/10.1002%2F9781119208082.index"
#> "http://api.wiley.com/onlinelibrary/tdm/v1/articles/10.1002%2F9781119208082.ch11"
#> "http://api.wiley.com/onlinelibrary/tdm/v1/articles/10.1002%2F9781119208082.ch15"
#> "http://api.wiley.com/onlinelibrary/tdm/v1/articles/10.1002%2F9781119208082.ch4"

In an R session, load rcrossref

library("rcrossref")

Search the works route for the phrase “science”

res <- cr_works(query = "science", limit = 5)
#> $meta
#>   total_results search_terms start_index items_per_page
#> 1       4333827      science           0              5
#>
#> $data
#> Source: local data frame [5 x 23]
#>
#>   alternative.id container.title    created  deposited                        DOI funder    indexed
#>            (chr)           (chr)      (chr)      (chr)                      (chr)  (chr)      (chr)
#> 1                                2013-11-21 2013-11-21            10.1126/science <NULL> 2015-12-27
#> 2                  Science Askew 2004-11-26 2013-12-16 10.1887/0750307145/b426c18 <NULL> 2015-12-24
#> 3                                2006-04-10 2010-07-30    10.1002/(issn)1557-6833 <NULL> 2015-12-25
#> 4                                2013-08-27 2013-08-27    10.1002/(issn)1469-896x <NULL> 2015-12-27
#> 5                                2013-12-19 2013-12-19                10.5152/bs. <NULL> 2015-12-28
#> Variables not shown: ISBN (chr), ISSN (chr), issued (chr), link (chr), member (chr), prefix (chr), publisher
#>   (chr), reference.count (chr), score (chr), source (chr), subject (chr), title (chr), type (chr), URL
#>   (chr), assertion (chr), author (chr)
#>
#> $facets
#> NULL

Index through to get the DOIs

res$data$DOI
#> [1] "10.1126/science"            "10.1887/0750307145/b426c18" "10.1002/(issn)1557-6833"
#> [4] "10.1002/(issn)1469-896x"    "10.5152/bs."

rcrossref also has faster versions of most functions with an underscore at the end (_) which only do the http request and give back json (e.g., cr_works_())

Comparison of Crossref Client Methods

After installation and loading the libraries above, the below methods are available

<th>
  <span >Python</span>
</th>

<th>
  <span >Ruby</span>
</th>

<th>
  <span >R</span>
</th>

<td>
  <span ><code>cr.works()</code></span>
</td>

<td>
  <span ><code>Serrano.works()</code></span>
</td>

<td>
  <span ><code>cr_works()</code></span>
</td>

<td>
  <span ><code>cr.members()</code></span>
</td>

<td>
  <span ><code>Serrano.members()</code></span>
</td>

<td>
  <span ><code>cr_members()</code></span>
</td>

<td>
  <span ><code>cr.funders()</code></span>
</td>

<td>
  <span ><code>Serrano.funders()</code></span>
</td>

<td>
  <span ><code>cr_funders()</code></span>
</td>

<td>
  <span ><code>cr.types()</code></span>
</td>

<td>
  <span ><code>Serrano.types()</code></span>
</td>

<td>
  <span ><code>cr_types()</code></span>
</td>

<td>
  <span ><code>cr.licenses()</code></span>
</td>

<td>
  <span ><code>Serrano.licenses()</code></span>
</td>

<td>
  <span ><code>cr_licenses()</code></span>
</td>

<td>
  <span ><code>cr.journals()</code></span>
</td>

<td>
  <span ><code>Serrano.journals()</code></span>
</td>

<td>
  <span ><code>cr_journals()</code></span>
</td>

<td>
  <span ><code>cr.members()</code></span>
</td>

<td>
  <span ><code>Serrano.members()</code></span>
</td>

<td>
  <span ><code>cr_members()</code></span>
</td>

<td>
  <span ><code>cr.registration_agency()</code></span>
</td>

<td>
  <span ><code>Serrano.registration_agency()</code></span>
</td>

<td>
  <span ><code>cr_agency()</code></span>
</td>

<td>
  <span ><code>cr.random_dois()</code></span>
</td>

<td>
  <span ><code>Serrano.random_dois()</code></span>
</td>

<td>
  <span ><code>cr_r()</code></span>
</td>

API Route
works
members
funders
types
licenses
journals
members
registration agency
random DOIs

Other Crossref Services

<th>
  <span >Python</span>
</th>

<th>
  <span >Ruby</span>
</th>

<th>
  <span >R</span>
</th>

<td>
  <span ><code>cn.content_negotiation()</code><a href="#footnote-1">[1]</a></span>
</td>

<td>
  <span ><code>Serrano.content_negotiation()</code></span>
</td>

<td>
  <span ><code>cr_cn()</code></span>
</td>

<td>
  <span ><code>cn.csl_styles()</code><a href="#footnote-1">[1]</a></span>
</td>

<td>
  <span ><code>Serrano.csl_styles()</code></span>
</td>

<td>
  <span ><code>get_styles()</code></span>
</td>

<td>
  <span ><code>counts.citation_count()</code><a href="#footnote-2">[2]</a></span>
</td>

<td>
  <span ><code>Serrano.citation_count()</code></span>
</td>

<td>
  <span ><code>cr_citation_count()</code></span>
</td>

Service
content negotiation
CSL styles
citation count

[1] from habanero import cn

[2] from habanero import counts

Features

These are supported in all 3 libraries:

Filters (see below)
Deep paging (see below)
Pagination
Verbose curl output

Filters

Filters (see API docs for details) are a powerful way to get closer to exactly what you want in your queries. In the Crossref API filters are passed as query parameters, and are comma-separated like filter=has-orcid:true,is-update:true . In the client libraries, filters are passed in idiomatic fashion according to the language.

Python

from habanero import Crossref
cr = Crossref()
cr.works(filter = {'award_number': 'CBET-0756451', 'award_funder': '10.13039/100000001'})

Ruby

require 'serrano'
Serrano.works(filter: {award_number: 'CBET-0756451', award_funder: '10.13039/100000001'})

library("rcrossref")
cr_works(filter=c(award_number=TRUE, award_funder='10.13039/100000001'))

Note how syntax is quite similar among languages, though keys don’t have to be quoted in Ruby and R, and in R you pass in a vector or list instead of a hash as in the other two.

All 3 clients have helper functions to show you what filters are available and what the options are for each filter.

<th>
  <span >Python</span>
</th>

<th>
  <span >Ruby</span>
</th>

<th>
  <span >R</span>
</th>

<td>
  <span ><code>filters.filter_names</code><a href="#footnote-3">[3]</a></span>
</td>

<td>
  <span ><code>Serrano::Filters.names</code></span>
</td>

<td>
  <span ><code>filter_names()</code></span>
</td>

<td>
  <span ><code>filters.filter_details</code><a href="#footnote-3">[3]</a></span>
</td>

<td>
  <span ><code>Serrano::Filters.filters</code></span>
</td>

<td>
  <span ><code>filter_details()</code></span>
</td>

Action
Filter names
Filter details

[3] from habanero import filters

Deep paging

Sometimes you want a lot of data. The Crossref API has parameters for paging (see rows and offset), but large values of either can lead to long response times and potentially timeouts (i.e., request failure). The API has a deep paging feature that can be used when large data volumes are desired. This is made possible via Solr’s cursor feature (e.g., blog post on it). Here’s a run down of how to use it:

cursor: each method in each client library that allows deep paging has a cursor parameter that if you set to * will tell the Crossref API you want deep paging.
cursor_max: for boring reasons we need to have feedback from the user when they want to stop, since each request comes back with a cursor value that we can make the next request with, thus, an additional parameter cursor_max is used to indicate the number of results you want back.
limit: this parameter when not using deep paging determines number of results to get back. however, when deep paging, this parameter sets the chunk size. (note that the max. value for this parameter is 1000)

For example, cursor=”*” states that you want deep paging, cursor_max states maximum results you want back, and limit determines how many results per request to fetch.

Python

from habanero import Crossref
cr = Crossref()
cr.works(query = "widget", cursor = "*", cursor_max = 500)

Ruby

require 'serrano'
Serrano.works(query: "widget", cursor: "*", cursor_max: 500)

library("rcrossref")
cr_works(query = "widget", cursor = "*", cursor_max = 500)

Text mining clients

Just a quick note that I’ve begun a few text-mining clients for Python and Ruby, focused on using the low level clients discussed above.

Python: https://github.com/sckott/pyminer
Ruby: https://github.com/sckott/textminer

Do try them out!

Get involved

Find a service

Documentation

About us

2024 July 25

Re-introducing Participation Reports to encourage best practices in open metadata

2024 July 22

Metadata schema development plans

2024 July 02

Crossmark community consultation: What did we learn?

2024 July 01

Celebrating five years of Grant IDs: where are we with the Crossref Grant Linking System?

Blog