4 minute read.
Increasing Crossref Data Reusability With Format Experiments
Every year, Crossref releases a full public data file of all of our metadata. This is partly a commitment to POSI and partly just what we do. We want the community to re-use our metadata and to find interesting ends to which they can be put!
However, we have also recognized, for some time, that 170GB of compressed .tar.gz files, spread over 27,000 items, is not the easiest of formats with which to work. For instance, there’s no indexing capacity on these files, meaning that it is virtually impossible simply to pull out the record for a DOI. Decompressing the .tar.gz files takes a good three hours or more even on high-end hardware, without any additional processing.
To that end, the Crossref Labs team has been experimenting with different formats for trial release that might allow us to reach broader audiences, including those who have not previously worked with our metadata files. The two new formats, alongside the existing data file format, with which we have been experimenting, are JSON lines and SQLite.
The first format with which we’ve been experimenting is JSON-L (JSON lines). With one JSON entry per line, as opposed to one giant JSON file/block, JSON-L lends itself to better parallelisation in systems such as SPARC, because the data can easily be partitioned.
This data format also has the benefit of being appendable, one line at a time. Unlike conventional JSON, which requires the entire structure to be parsed in-memory before an append is possible, JSON-L can simply be written to and updated. It’s also possible to do multi-threaded write operations on the file, without each thread having to parse the entire JSON structure and then sync with other threads.
In our experiments, JSON-L came with substantial parallelisation benefits. Our routines to calculate citation counts can be completed in ~20-25 minutes. Calculating the number of resolutions per container title takes less than half an hour.
SQLite is a library written in C with client bindings for Python, Java, C#, and many other languages that produces an on-disk, portable, single-file SQL database. You can produce the SQLite file using our openly available Rust program, rustsqlitepacker. We also have a Python script that can produce the final SQLite file, for those happier working in this language.
The resultant SQLite file is approximately 900GB in size, so it requires quite a lot of free disk space to create in the first place (alongside storage of the data file that is needed to build it). However, queries are snappy when looking up by DOI and other indexes can be constructed (the indexing part of the procedure takes about 1.5 hours per field).
The database structure, at present, is the bare minimum that will work. It contains a list of fields for searching/indexing – DOI, URL, member, prefix, type, created, and deposited – and a metadata field that contains the JSON response that would be returned by the API for this value.
This allows for the processing and extraction of individual JSON elements using SQLite’s built-in json_extract method. For example, to get just the title of an item, you can use:
SELECT json_extract(metadata, ‘$.title’) from works WHERE doi=“10.1080/10436928.2020.1709713”;
The balance that we have had to strike here is between flattening the JSON so that more fields are indexable and searchable, as against the trade-off in time and processing that this takes to create the database in the first place. The first draft version of our experiment was wildly ambitious in flattening all the records and using an Object Relation Mapper (ORM) to present Python models of the database. Like painting the Forth Bridge, this initial attempt would not finish in any sane length of time. Indeed, by the time we’d created this year’s data file, we’d need to begin work on the next.
What are the anticipated use cases here? When people need to do an offline metadata search on an embedded device, for instance, the portability and indexed lookup of the SQLite database can be very appealing. One of our team has even got the database running on a Raspberry Pi 5. You can also load the database into Datasette if you want to explore it visually.
Where do we go from here with this? It would be good to flatten a few more fields, but we would welcome feedback on use cases that we haven’t anticipated for SQLite and we’d love to hear whether this is already too unwieldy (at 900GB).
As usual, we will be releasing the annual data file in the next few months. As an experiment this year, we will also be releasing the tools that can be used on that file to produce these alternative file formats. We will consider releasing the final data files for each of these formats, too.
What we would like to hear from the community is whether there are other data file formats that you might wish to use. Are there use cases that we haven’t anticipated? What would you ideally like in terms of file formats?
Related pages and blog posts