As part of our blog post series on the Crossref REST API, we talked to Silvio Peroni and David Shotton of OpenCitations (OC) about the work they’re doing, and how they’re using the Crossref REST API as part of their workflow.
OpenCitations employs Semantic Web technologies to create an open repository of the citation data that publishers have made available. This repository, called the OpenCitations Corpus (OCC), contains RDF-based scholarly citation data that are made freely available so that others may use and build upon them. All the resources published by OC – namely the data within the OCC, the ontologies describing the data, and the software developed to build the OCC – are available to the public with open licenses.
What problem is your service trying to solve?
OC was started to address the lack of RDF-based open citation data. To our knowledge, when the project formally started with Jisc funding in 2010 the prototype OCC was the first RDF-based dataset of open citation data.
We collect accurate scholarly citation data derived from bibliographic references harvested from the scholarly literature, so as to make them available under a Creative Commons public domain dedication (CC0) by means of Semantic Web technologies, thus making them findable, accessible, interoperable, and re-usable, as well as structured, separable, and open.
OCC citation data are described using standard and/or well-known vocabularies, including the SPAR Ontologies , PROV-O, the Data Catalog Vocabulary, and VoID. The use of such vocabulary is described in the OCC metadata document, and is implemented by means of the OpenCitations Ontology (OCO).
The OCC resources are made available and accessible in different ways, so as to facilitate their reuse in different contexts: as monthly dumps, via the SPARQL endpoint, and by accessing them directly by means of the HTTP URIs of the stored resources (via content negotiation; example)
Can you tell us how you are using the Crossref Metadata API at OpenCitations?
At present, basic citation information is retrieved from PubMed Central, and the Crossref API is then used to retrieve additional metadata describing the citing and cited articles, and to disambiguate bibliographic resources and agents by means of the identifiers retrieved (e.g., DOI, ISSN, ISBN, URL, and Crossref member URL). In future, we will retrieve full citation data direct from Crossref.
What metadata values do you pull from the API?
We pull the titles, subtitles, identifiers (e.g. DOI, ISSN, ISBN, URL, and Crossref member URL), author list, publisher, container resources (issue, volume, journal, book, etc.), publication year and pages.
Have you built your own interface to extract this data?
The SPAR Citation Indexer, a.k.a. SPACIN, is a script and a series of Python classes that allow one to process particular JSON files containing the bibliographic reference lists of papers, produced from the PubMed Central API by another script included in the OpenCitations GitHub repository.
SPACIN processes such JSON files and retrieves additional metadata information about all the citing and cited articles by querying the Crossref API, among others. Once SPACIN has retrieved all these metadata, RDF resources are created (or reused, if they have been already added in the past) and stored in the file system in JSON-LD format. In addition, they are also uploaded to the OCC triplestore (via the SPARQL UPDATE protocol).
How often do you extract/query data?
The entire OpenCitations ingestion workflow is running continuously, processing about half a million citations per month.
What do you do with the metadata once it’s pulled from the API?
All the metadata relevant to bibliographic entities are stored by using the OCC metadata model. The ontological terms of such metadata model are collected within an ontology called the OpenCitations Ontology (OCO), which includes several terms from the SPAR Ontologies and other vocabularies. In particular, the following six bibliographic entity types occur in the datasets created by SPACIN:
bibliographic resources (br), class fabio:Expression – resources that either cite or are cited by other bibliographic resources (e.g. journal articles), or that contain such citing/cited resources (e.g. journals);
resource embodiments (re), class fabio:Manifestation – details of the physical or digital forms in which the bibliographic resources are made available by their publishers;
bibliographic entries (be), class biro:BibliographicReference – literal textual bibliographic entries occurring in the reference lists of bibliographic resources;
responsible agents (ra), class foaf:Agent – names of agents having certain roles with respect to the bibliographic resources (i.e. names of authors, editors, publishers, etc.);
agent roles (ar), class pro:RoleInTime – roles held by agents with respect to the bibliographic resources (e.g. author, editor, publisher);
identifiers (id), class datacite:Identifier – external identifiers (e.g. DOI, ORCID, PubMedID) associated to bibliographic resources and agents.
Do you have plans to enhance your metadata input?
We already handle additional information, such as ORCIDs, that are extracted by means of the ORCID API applied to the citing and cited articles included in the OCC. In addition, we are developing scripts in order to use all the new citation data Crossref now makes available as consequence of the Initiative for Open Citations (I4OC).
What are the future plans for OpenCitations?
With funding received from the Alfred P. Sloan Foundation, we will shortly extend the current infrastructure and the rate of data ingest. Our immediate goal is to increment the daily ingestion of citation data from about half a million citations per month to about half a million citations per day. In addition, we plan to analyse the OCC so as to understand the quality of its current data, and to develop new user interfaces, including graph visualizations of citation networks, that will expand the means whereby users can interact with the OpenCitations data.
What else would you like to see our REST API offer?
Categorising articles/journals/any bibliographic resources according to their main discipline (Computer Science, Biology, etc.) and, eventually, by means of subject terms and/or keywords. Additionally, provision of authors’ institutional affiliations and funder information would be extremely valuable.
Thank you Silvio and David!
If you are keen to share what you’re doing with the our Metadata APIs, contact firstname.lastname@example.org and share your story.