Bulk metadata formats

CrossRef will distribute the entire metadata repository to organizations that specifically subscribe to these services. The data is made available using two transaction models. Each model produces a slightly different format of data.

FTP/HTTP of ZIP Archives

In this model CrossRef produces an XML file for each title that will contain all DOIs for that publication.  The files are built to the xmlns="http://www.crossref.org/output/3.0" whose schema can be found on the CrossRef web site.

Schema overview

CrossRef runs daily and weekly processes that create files containing new or updated DOIs during these periods. The files are then combined into a compressed archive (Zip'd Unix tar files). The archives are stored in a set of 7 daily folders, named day0 through day6, and 4 weekly folders, named week0 through week3.  After 7 days the daily runs reuse the folders overwriting the older daily data while after 4 weeks the weekly run overwrites the oldest weekly data. A control file is used to indicate  to which folders the most recent runs wrote their data. In addition, once a month a 'full' run is made that produces a file for each title that contains all DOIs, regardless of when they were deposited or last updated.

In the XML <publication> element the attribute 'filedate' defines the date when this data was created and the attribute 'mode' defines the production mode as daily,weekly or full. On each <doi_element> record the attribute 'citationid' is an internal database key used for the DOI and 'datestamp' is the date when the DOI was deposited or last updated (useful in weekly or full mode)

Sample FTP/HTTP XML 

OAI-PMH

OAI-PMH is a protocol for metadata harvesting and a full description is available at www.openarchives.org. To summarize, this protocol defines a set of HTTP requests that allow a harvester to interact with a repository to pull selective metadata or all available data. Normally, an OAI-PMH compatible repository provides free and open access to its data. CrossRef however has implemented an access control mechanism based on recipients and their IP address. Each recipient must be registered with CrossRef. CrossRef member publishers may choose to opt-out for a given recipient. The publishers may also choose to opt-out/opt-in on a title by title basis, making a subset of their metadata available to the specific recipient.

By default all members are set as opt-in, including all titles. When CrossRef is made aware that a new recipient intends to subscribe to the PMH service a window of time will be allowed for members to decide to opt-out, if they take no action the default opt-in settings will be applied.

Dublin Core (DC) is the standard format for metadata distribution by a PMH compliant repository.  However, DC is not expressive enough to represent the full metadata available in CrossRef. While DC is not currently support by CrossRef's PMH interface we will likely add support in the near future to achieve  compatibility with the specification. Our PMH currently generates data formatted using our Unified XML schema.

Unified XML Schema - Overview

Each record  retrieved through the PMH interface will have the complete metadata as deposited by the publisher for that DOI. CrossRef's repository supports a three level  hierarchical set organization. At the top level are the publisher sets where the set identifier is the DOI prefix for the publisher. The next level is the publication title level where the set identifier is an internal value assigned by CrossRef for the publication. The third level is the year of publication.

All PMH identifiers are DOIs expressed using the IETF RFC 4452 "The 'info' URI Scheme for Information Assets with Identifiers in Public Namespaces" (see the info-uri registry).

Sample PMH Record 

persistent link to this page link


CrossRef logo

support@crossref.org