thammond – 2007 August 22
So, following up on my recent posts here on Metadata in PDFs (Strategies, Use Cases, Deployment), I finally came across PDF/A and PDF/X, two ISO standardized subsets of PDF. the former (ISO 19005-1:2005) for archiving and the latter (ISO 15929:2002, ISO 15930-1:2001, etc.) for prepress digital data exchange.
Both formats share some common ground such as minimizing surprises between producer and consumer and keeping things open and predictable. But my interest here is specifically in metadata and to see what guidance these standards might provide us. Not unsurprisingly, metadata is a key issue for PDF/A, less so for PDF/X. I’ll discuss PDF/X briefly but the bulk of this post is focussed on PDF/A. See below.
thammond – 2007 August 08
The first thing to note is that this demo (the Acrobat plugin) is an application. And that comes with its own baggage, i.e. this is a Windows only plugin and is targeted at Acrobat Reader 8. On a wider purview the application merely bridges an identifier embedded in the media file and the handle record filed against that identifier and delivers some relevant functionality. The data (or metadata) declared in the PDF and in the associated handle if rich enough and structured openly can also be used by other applications. I think this is a key point worth bearing in mind, that the demo besides showing off new functionalities is also demonstrating how data (or metadata) can be embedded at the respective endpoints (PDF, handle).
Some initial observations follow below.
So, assuming we know the form of the metadata we wish to add to our PDFs (or else to comply with if there is already a set of guidelines, or some industry initiative in effect) how can we realize this? And, on the flip side, how can we make it easier for consumers to extract metadata we have embedded in our PDFs.
Below are some considerations on deploying metadata in PDFs and consumer access.
Emboldened by my own researches, by the recent handle plugin announcement from CNRI (on which, more in a follow-on post), and by Alexander Griekspoor’s comment to my earlier post, I thought I’d write a more extensive piece about embedding metadata in PDF with a view to the following:
Discover what other publishers are currently doing
Stimulate discussions between content providers and/or consumers
Lay groundwork for a Crossref best practice guidelines
Why include the DOI as an explicit piece of metadata rather than have it included by virtue of its appearance in a content section? The main reason is that it is then unambiguously accessible. Content sections in PDFs are typically filtered and sometimes encrypted), whereas metadata is usually plain text and moreover is marked up as to field type.
Another question concerns whether to add in the identifier alone, or to embed a full metadata set. Why not just embed the identifier and visit Crossref for the metadata? This is feasible in some cases although it does involve an extra network trip, requires an application to service the identifier and is obviously not workable in offline contexts. Seems like a “no-brainer” to include a fuller description from the outset. Note that publishers frequently make some of this information available anyway in other metadata delivery channels, e.g. RSS feeds.
Well, this is likely to be a fairly brief post as I’m not aware of many use cases of metadata in PDFs from scholarly publishers. Certainly, I can say for Nature that we haven’t done much in this direction yet although are now beginning to look into this.
I’ll discuss a couple cases found in the wild but invite comment as to others’ practices. Let me start though with the CNRI handle plugin demo for Acrobat which I blogged here.
thammond – 2007 July 31
thammond – 2007 July 28
thammond – 2007 July 27
(Update - 2007.07.28: I meant to reference in this entry Pierre Lindenbaum’s post back in May Is there any XMP in scientific pdf ? (No), which btw also references Roderic Page’s post on XMP but forgot to add in the links in my haste to scoot off. Well, truth is we still can’t answer Pierre in the affirmative but at least we can take the first steps towards rectifying this.)
I wanted to share some of my early experiences. First off, after a couple of previous attempts which got pushed aside due to other projects, I managed to compile the libraries and the sample apps that ship with the C++ SDK under Xcode on the Mac. I also needed to compile Expat first which doesn’t ship with the distribution.
OK, so far, so good. What this basically leaves one with is a couple of XMP dump utilities (DumpMainXMP and DumpScannedXMP) and two others (XMPCoreCoverage and XMPFilesCoverage) which is a good start anyways for exploring. And turns out that our PDFs already have some workflow metadata in them. This is encouraging because the SDK allows apps to read and update existing XMP packets from files, though not to write new packets into files (as far as I understand).
I thought I would take this opportunity anyway to:
Try and add these to existing XMP packetsUgly details are presented below, but by updating the XMP packet metadata in one of our PDFs (Nature 445, 37 (2007), C.J. Hogan) we can teach Acrobat Reader to read - see the “before” (PDF here) and “after” (PDF here) screenshots in the figure.
Of course, this is really about much more than getting Adobe apps to read/write metadata. It’s about using XMP as a standard platform for embedding metadata in digital assets for third-party apps to read/write. If we can put ID3 tags into our podcasts then why not XMP packets into other media?
2018 October 16
2018 October 15
2018 October 10
2018 October 05