Main

April 17, 2012

PDF-Extract

PDF-EXTRACT

CrossRef Labs is happy to announce the first public release of "pdf-extract" an open source set of tools and libraries for extracting citation references (and, eventually, other semantic metadata) from PDFs. We first demonstrated this tool to CrossRef members at our annual meeting last year. See the pdf-extract labs page for a detailed introduction to this new set of tools.

If you are unable to download and install the tool, you can play with a experimental web interface called "Extracto." Be warned, Extracto is running on very feeble server using an erratic and slow internet connection. The only guarantee that we can make about using it is that it will repeatedly fall over and annoy you. The weasel has spoken.

Continue reading "PDF-Extract" »

November 28, 2011

Turning DOIs into formatted citations

Today two new content types were added to dx.doi.org resolution for CrossRef DOIs. These allow anyone to retrieve DOI bibliographic metadata as formatted bibliographic entries. To perform the formatting we're using the citation style language processor, citeproc-js which supports a shed load of citation styles and locales. In fact, all the styles and locales found in the CSL repositories, including many common styles such as bibtex, apa, ieee, harvard, vancouver and chicago are supported.

First off, if you'd like to try citation formatting without using content negotiation, there's a simple web UI that allows input of a DOI, style and locale selection.

If you're more into accessing the web via your favorite programming language, have a look at these content negotiation curl examples. To make a request for the new "text/bibliography" content type:

$ curl -LH "Accept: text/bibliography; style=bibtex" http://dx.doi.org/10.1038/nrd842

@article{Atkins_Gershell_2002, title={From the analyst's couch: Selective anticancer drugs}, volume={1}, DOI={10.1038/nrd842}, number={7}, journal={Nature Reviews Drug Discovery}, author={Atkins, Joshua H. and Gershell, Leland J.}, year={2002}, month={Jul}, pages={491-492}}

A locale can be specified with the "locale" content type parameter, like this:

$ curl -LH "Accept: text/bibliography; style=mla; locale=fr-FR" http://dx.doi.org/10.1038/nrd842

Atkins, Joshua H., et Leland J. Gershell. « From the analyst's couch: Selective anticancer drugs ». Nature Reviews Drug Discovery 1.7 (2002): 491-492.

You may want to process metadata through CSL yourself. For this use case, there's another new content type, "application/citeproc+json" that returns metadata in a citeproc-friendly JSON form:

$ curl -LH "Accept: application/citeproc+json" http://dx.doi.org/10.1038/nrd842

{"volume":"1","issue":"7","DOI":"10.1038/nrd842","title":"From the analyst's couch: Selective anticancer drugs","container-title":"Nature Reviews Drug Discovery","issued":{"date-parts":[[2002,7]]},"author":[{"family":"Atkins","given":"Joshua H."},{"family":"Gershell","given":"Leland J."}],"page":"491-492","type":"article-journal"}

Finally, to retrieve lists of supported styles and locales, either hit these URLs:

or check out the CSL style and locale repositories.

There's one big caveat to all this. The CSL processor will do its best with CrossRef metadata which can unfortunately be quite patchy at times. There may be pieces of metadata missing, inaccurate metadata or even metadata items stored under the wrong field, all resulting in odd-looking formatted citations. Most of the time, though, it works.

August 16, 2010

Add linked images to PDFs

While working on an internal project, we developed "pdfstamp", a command-line tool that allows one to easily apply linked images to PDFs. We thought some in our community might find it useful and have released it on github.

Some more PDF-related tools will follow soon.