Blog

 6 minute read.

Using AWS S3 as a large key-value store for Chronograph

One of the cool things about working in Crossref Labs is that interesting experiments come up from time to time. One experiment, entitled “what happens if you plot DOI referral domains on a chart?” turned into the Chronograph project. In case you missed it, Chronograph analyses our DOI resolution logs and shows how many times each DOI link was resolved per month, and also how many times a given domain referred traffic to DOI links per day.

We’ve released a new version of Chronograph. This post explains how it was put together. One for the programmers out there.

Big enough to be annoying

Chronograph sits on the boundary between normal-sized data and large-enough-to-be-annoying-size data. It doesn’t store data for all DOIs (it includes only those that are used on average once a day), but it has information on up to 1 million DOIs per month over about 5 years, and about 500 million data points in total.

Storing 500 million data points is within the capabilities of a well-configured database. In the first iteration of Chronograph a MySQL database was used. But that kind of data starts to get tricky to back up, move around and index.

Every month or two new data comes in for processing, and it needs to be uploaded and merged into the database. Indexes need to be updated. Disk space needs to be monitored. This can be tedious.

Key values

Because the data for a DOI is all retrieved at once, it can be stored together. So instead of a table that looks like

<td>
  <span >2010-01-01</span>
</td>

<td>
  <span >5</span>
</td>
<td>
  <span >2010-02-01</span>
</td>

<td>
  <span >7</span>
</td>
<td>
  <span >2010-03-01</span>
</td>

<td>
  <span >3</span>
</td>
10.5555/12345678
10.5555/12345678
10.5555/12345678

Instead we can store

<td>
  {&#8220;2010-01-01&#8221;: 5, &#8220;2010-02-01&#8221;: 7, &#8220;2010-03-01&#8221;: 3}
</td>
10.5555/12345678

This is much lighter on the indexes and takes much less space to store. However, it means that adding new data is expensive. Every time there’s new data for a month, the structure must be parsed, merged with the new data, serialised and stored again millions of times over.

After trials with MySql, MongoDB and MapDB, this approach was taken with MySQL in the original Chronograph.

Keep it Simple Storage Service Stupid

In the original version of Chronograph the data was processed using Apache Spark. There are various solutions for storing this kind of data, including Cassandra, time-series databases and so on.

The flip side of being able to do interesting experiments is wanting them to stick around without having to bother a sysadmin. The data is important to us, but we’d rather not have to worry about running another server and database if possible.

Chronograph fits into the category of ‘interesting’ rather than ‘mission-critical’ projects, so we’d rather not have to maintain expensive infrastructure if possible.

I decided to look into using Amazon Web Services Simple Storage Service (AWS S3) to store the data. AWS itself is a key-value store, so it seems like a good fit. S3 is a great service because, as the name suggests, it’s a simple service for storing a large number of files. It’s cheap and its capabilities and cost scale well.

However, storing and updating up to 80 million very small keys (one per DOI) isn’t very clever, and certainly isn’t practical. I looked at DynamoDB, but we still face the overhead of making a large number of small updates.

Is it weird?

In these days of plentiful databases with cheap indexes (and by ‘these days’ I mean the 1970s onward) it seems somehow wrong to use plain old text files. However, the whole Hadoop “Big Data” movement was predicated on a return to batch processing files. Commoditisation of services like S3 and the shift to do more in the browser have precipitated a bit of a rethink. The movement to abandon LAMP stacks and use static site generators is picking up pace. The term ‘serverless architecture’ is hard to avoid if you read certain news sites.

Using Apache Spark (with its brilliant RDD concept) was useful for bootstrapping the data processing for Chronograph, but the new code has an entirely flat-file workflow. The simplicity of not having to unnecessarily maintain a Hadoop HDFS instance seems to be the right choice in this case.

Repurposing the Wheel

The solution was to use S3 as a big hash table to store the final data that’s served to users.

The processing pipeline uses flat files all the way through from input log files to projections to aggregations. At the penultimate stage of the pipeline blocks of CSV per DOI are produced that represent date-value pairs.

<td>
  2010-01
</td>

<td>
  2010-01-01,05<br /> 2010-02-01,02<br /> 2010-01-03,08<br /> &#8230;
</td>
<td>
  2010-02
</td>

<td>
  2010-02-1,10<br /> 2010-02-01,7<br /> 2010-02-03,22<br /> &#8230;
</td>
10.5555/12345678
10.5555/12345678

At the last stage, these are combined into blocks of all dates for a DOI

<td>
  2010-01
</td>

<td>
  2010-01-01,05<br /> 2010-02-01,02<br /> 2010-01-03,08<br /> &#8230;<br /> 2010-02-1,10<br /> 2010-02-01,7<br /> 2010-02-03,22<br /> &#8230;
</td>
10.5555/12345678

The DOIs are then hashed into 12 bits and stored as chunks of CSV

day-doi.csv-chunks_8841:

10.1038/ng.3020
2014-06-24,4
2014-06-25,4
2014-06-26,3
...

10.1007/978-94-007-2869-1_7
2012-06-01,12
2012-06-02,8
...

10.1371/journal.pone.0145509
2016-02-01,13
2016-02-02,75
2016-02-03,30
...

There are 65,536 (0x000 to 0xFFFF) possible files, each with about a thousand DOIs worth of data in each.

When the browser requests data for a DOI, it is hashed and then the request for the appropriate file in S3 is made. The browser then has to perform a linear scan of the file to find the DOI it is looking for.

This is the simplest possible form of hash table: simple addressing with separate linear chaining. The hash function is a 16-bit mask of MD5, chosen because of availability in the browser. It does a great job of evenly distributing the DOIs over all 65,536 possible files.

Striking the balance

In any data structure implementation, there are balances to be struck. Traditionally these concern memory layout, the shape of the data, practicalities of disk access and CPU cost.

In this instance, the factors in play included the number of buckets that need to be uploaded and the cost of the browser downloading an over-large bucket. The size of the bucket doesn’t matter much for CPU (as far as the user is concerned it takes about the same time to scan 10 entries as it does 10,000), but it does make a difference asking  user to download a 10kb bucket or a 10MB one.

I struck the balance at 4096 buckets, resulting in files of around 100k, which is the size of a medium sized image.

It works

The result is a simple system that allows people to look up data for millions of DOIs, without having to look after another server. It’s also portable to any other file storage service.

The approach isn’t groundbreaking, but it works.

Related pages and blog posts

Page owner: Joe Wass   |   Last updated 2016-August-02