Metadata in PDF: 1. Strategies


thammond – 2007 August 01

In Metadata

Emboldened by my own researches, by the recent handle plugin announcement from CNRI (on which, more in a follow-on post), and by Alexander Griekspoor’s comment to my earlier post, I thought I’d write a more extensive piece about embedding metadata in PDF with a view to the following:

  • Discover what other publishers are currently doing

    • Stimulate discussions between content providers and/or consumers

      • Lay groundwork for a Crossref best practice guidelines

      Why should Crossref be interested? Well, at minimum to embed the DOI along with the digital asset would seem to be inherently “a good thing”. (And, in fact, this is precisely the approach that CNRI have taken for their plugin demos. I’ll look later at what they actually did and consider whether that is a model that Crossref publishers might usefully follow.)

      Why include the DOI as an explicit piece of metadata rather than have it included by virtue of its appearance in a content section? The main reason is that it is then unambiguously accessible. Content sections in PDFs are typically filtered and sometimes encrypted), whereas metadata is usually plain text and moreover is marked up as to field type.

      Another question concerns whether to add in the identifier alone, or to embed a full metadata set. Why not just embed the identifier and visit Crossref for the metadata? This is feasible in some cases although it does involve an extra network trip, requires an application to service the identifier and is obviously not workable in offline contexts. Seems like a “no-brainer” to include a fuller description from the outset. Note that publishers frequently make some of this information available anyway in other metadata delivery channels, e.g. RSS feeds.

Metadata in PDF: 2. Use Cases


thammond – 2007 August 01

In Metadata

Well, this is likely to be a fairly brief post as I’m not aware of many use cases of metadata in PDFs from scholarly publishers. Certainly, I can say for Nature that we haven’t done much in this direction yet although are now beginning to look into this.

I’ll discuss a couple cases found in the wild but invite comment as to others’ practices. Let me start though with the CNRI handle plugin demo for Acrobat which I blogged here.

Handle Acrobat Reader Plugin


thammond – 2007 July 31

In Metadata

Just announced on the handle-info list is a new plugin from CNRI for Acrobat Reader - see here. The announcement says: _“It is intended to demonstrate the utility of embedding a identifying handle in a PDF document.   …   A set of demonstration documents, each with an embedded identifying handle, is packaged with the plug-in to show potential uses. To make productive use of this technology, a given industry or community of

URI Template Republished


thammond – 2007 July 28

In Identifiers

Well, it all went very quiet for a while but glad to see that the URI Template Internet-Draft has just been republished: _“A New Internet-Draft is available from the on-line Internet-Drafts directories. Title : URI Template Author(s) : J. Gregorio, et al. Filename : draft-gregorio-uritemplate-01.txt Pages : 9 Date : 2007-7-23 URI Templates are strings that can be transformed into URIs after embedded variables are substituted. This document defines the

XMP: First Hacks


thammond – 2007 July 27

In Metadata

(Update - 2007.07.28: I meant to reference in this entry Pierre Lindenbaum’s post back in May Is there any XMP in scientific pdf ? (No), which btw also references Roderic Page’s post on XMP but forgot to add in the links in my haste to scoot off. Well, truth is we still can’t answer Pierre in the affirmative but at least we can take the first steps towards rectifying this.)

I’ve been revisiting Adobe’s XMP just recently. (I blogged here about the new XMP Toolkit 4.1 back in March.)

I wanted to share some of my early experiences. First off, after a couple of previous attempts which got pushed aside due to other projects, I managed to compile the libraries and the sample apps that ship with the C++ SDK under Xcode on the Mac. I also needed to compile Expat first which doesn’t ship with the distribution.

OK, so far, so good. What this basically leaves one with is a couple of XMP dump utilities (DumpMainXMP and DumpScannedXMP) and two others (XMPCoreCoverage and XMPFilesCoverage) which is a good start anyways for exploring. And turns out that our PDFs already have some workflow metadata in them. This is encouraging because the SDK allows apps to read and update existing XMP packets from files, though not to write new packets into files (as far as I understand).

I thought I would take this opportunity anyway to:

  1. See what XMP metadata terms we might consider adding
  • Try and add these to existing XMP packetsUgly details are presented below, but by updating the XMP packet metadata in one of our PDFs (Nature 445, 37 (2007), C.J. Hogan) we can teach Acrobat Reader to read - see the “before” (PDF here) and “after” (PDF here) screenshots in the figure.


    Of course, this is really about much more than getting Adobe apps to read/write metadata. It’s about using XMP as a standard platform for embedding metadata in digital assets for third-party apps to read/write. If we can put ID3 tags into our podcasts then why not XMP packets into other media?

Publishing Linked Data


thammond – 2007 July 19

In Web

With these words: _“There was quite some interest in Linked Data at this year’s World Wide Web Conference (WWW2007). Therefore, Richard Cyganiak, Tom Heath and I decided to write a tutorial about how to publish Linked Data on the Web, so that interested people can find all relevant information, best practices and references in a single place.”_ Chris Bizer announces this draft How to Publish Linked Data on the Web.

PURL Redux


thammond – 2007 July 12

In Identifiers

Seems that there’s life in the old dog yet. :~) See this post about PURL from Thom Hickey, OCLC, This extract: OCLC has contracted with Zepheira to reimplement the PURL code which has become a bit out of date over the years. The new code will be in written in Java and released under the Apache 2.0 license.

BioNLP 2007


thammond – 2007 July 10

In Meetings

Just posted on Nascent a brief account of a presentation I gave recently on OTMI at BioNLP 2007. The post lists some of the feedback I received. We are very interested to get further comments so do feel free to contribute comments either directly to the post, privately to, or publicly to And then there’s always the OTMI wiki available for comment at It is important to note that OTMI is not a universal panacea but rather an attempt at bridging the gap between publisher and researcher.

IBM Article on PRISM


thammond – 2007 July 10

In Metadata

Nice entry article on PRISM here by Uche Ogbuji, Fourthought Inc. on IBM’s DeveloperWorks.

Oh, shiny!


admin – 2007 July 02

In Publishing

The other day Ed and I visited the OECD to talk about all things e-publishig. At the end of our our meeting, Toby Green, the OECD’s head of publishing, handed all 30+ meeting attendees a copy of their well-known OECD Factbook- on a USB stick. Before you dismiss this as a gimick- note that organizations like the OECD get a lot of political and marketing mileage with “leave behinds”- print copies of their key reports, conference proceedings and reference works.
RSS Feed