Warnings, Caveats and Weasel Words

Most of the experiments linked to here are running on R&D equipment in a non-production environment. They may disappear without warning and/or perform erratically. If one of them isn’t working for some reason, come back later and try again.

# pdfmark

Overview

“pdfmark” is an experimental open source tool that allows you to add Crossref metadata to a PDF. You can add metadata to a PDF by passing the tool a pre-generated XMP file, or you can apply Crossref bibliographic metadata by passing the command a Crossref DOI as an argument. If you pass it a Crossref DOI, it will automatically lookup the metadata for that DOI using the Crossref OpenURL API, generate XMP from said metadata and apply it to the PDF.

Note that pdfmark is non-destructive. It will always generate a new PDF with the XMP added to it. Having said this, pdfmark does not re-linearize the resulting file. To re-linearized the PDF you can simply use ghostscript’s pdfopt command or any similar tool (e.g. Acrobat Pro).

“pdfmark” is open source. We have released it in order to encourage publishers and other content producers to start adding embedded bibliographic metadata to their PDFs.

Why Should I Care?

PDF is widely regarded as a pretty “dumb” file format. That is, it sacrifices semantics in favour of display fidelity. But this doesn’t have to be the case. PDF has evolved over the past years and now supports the ability to imbed semantic metadata into the PDF. This metadata can, in turn, be used by bibliographic management tools, search engines, text mining tools, etc. And, of course, the big advantage of embedding bibliographic metadata in the PDF is that the content and metadata are never separated. The PDF can be copied, emailed and archived and it will always have its metadata with it.

Finally, Tony Hammond has written extensively on XMP in the CrossTech blog. Reading his various posts will give you a very solid background on the pros and cons of XMP.

## How do I use it?

We are assuming that you are at least technical enough to know whether you have a recent version of Java installed on your system and that you are comfortable with the command line. If this doesn’t describe you, then you had better stop here and get your resident geek to help you with this.

So, assuming you are of the geeky persuasion…

If you had a PDF of Allen Renear and Carole Palmer’s Science article, “Strategic Reading, Ontologies, and the Future of Scientific Publishing” and said PDF file was named “renear_palmer.pdf”, simply invoking the following would add the relevant metadata to the PDF:

java -jar pdfmark.jar -d 10.1126/science.1157784 renear_palmer.pdf

If the PDF is “linearized”, then the command will exit with a warning that the resulting PDF will be de-linearized. If you want to force it to generate the new PDF anyway, pass the command like the -“f” opton like this:

java -jar pdfmark.jar -f -d 10.1126/science.1157784 renear_palmer.pdf

pdfmark will automatically try to fill in the dc:copyright element with the name of the publisher of the PDF. To override this behaviour, use the “–no-copyright” flag.

Naturally, we are hoping that people will give us feedback or, better yet- patch, debug and build on the source we have released. Send comments, etc. too: