Blog

Stop Press

thammond

thammond – 2007 August 28

In Metadata

Boy, was I ever so wrong! Contrary to what I said in yesterday’s post, the new PRISM 2.0 spec does support XMP value type mappings for its terms. See the table below which lists the PRISM basic vocabulary terms and the XMP value types.

Many thanks to Dianne Kennedy and the rest of the PRISM Working Group for having added this support to PRISM 2.0.

ExifTool

thammond

thammond – 2007 August 27

In Metadata

(Update - 2007.08.28: I inadvertently missed out the term names in the last example of XMP as RDF/N3 with QNames and have now added these in. Also - a biggie - I said that PRISM had no XMP schema defined. This is actually wrong and as I blogged here today, the new PRISM 2.0 spec does indeed have a mapping of PRISM terms to XMP value types. Should actually have read the spec instead of just blogging about it earlier here. :~)

Having previously stooped to an extremely crass hack for pulling out a document information dictionary from PDFs (for which no apologies are sufficient but it does often work) I feel I should make some kind of amends and mention the wonderful ExifTool by Phil Harvey for reading and writing metadata to media files. This is both a Perl library and command-line application (so it’s cross-platform - a Windows .exe and Mac OS .dmg are also provided.) Besides handling EXIF tags in image files this veritable swissknife of metadata inspectors can also read PDFs for the information dictionary and the document XMP packet. And moreover, intriguingly, can dump the raw (document) XMP packet.

I’m still experimenting with it. There’s quite a number of features to explore. But some preliminary finds are listed below.

pdfa.org

thammond

thammond – 2007 August 23

In Pdf

Following on from yesterday’s post I just came across this very useful source of information on PDF/A: the PDF/A Conformance Center. This provides links to resources such as this whitepaper PDF/A - A new Standard for Long-Term Archiving, and a number of technical notes, especially Metadata and PDF/A-1(also available as a PDF). (This latter corrects some errors in the ISO standard which are to be redressed in a forthcoming Technical Corrigendum later this year.

Weird Scenes Inside the Gold Mine

thammond

thammond – 2007 August 22

In Metadata

So, following up on my recent posts here on Metadata in PDFs (Strategies, Use Cases, Deployment), I finally came across PDF/A and PDF/X, two ISO standardized subsets of PDF. the former (ISO 19005-1:2005) for archiving and the latter (ISO 15929:2002, ISO 15930-1:2001, etc.) for prepress digital data exchange.

Both formats share some common ground such as minimizing surprises between producer and consumer and keeping things open and predictable. But my interest here is specifically in metadata and to see what guidance these standards might provide us. Not unsurprisingly, metadata is a key issue for PDF/A, less so for PDF/X. I’ll discuss PDF/X briefly but the bulk of this post is focussed on PDF/A. See below.

New SRU (1.2) Website

thammond

thammond – 2007 August 08

In Search

From Ray Denenberg’s post to the SRU Listserv yesterday: _“The new SRU web site is now up: http://www.loc.gov/sru/ It is completely reorganized and reflects the version 1.2 specifications. (It also includes version 1.1 specifications, but is oriented to version 1.2.) … There is an official 1.1 archive under the new site, https://web.archive.org/web/20080724063403/http://www.loc.gov/sru/sru1-1archive/. And note also, that the new spec incorporates both version 1.1 and 1.2 (anything specific to version 1.

Handle Plugin: Some Notes

thammond

thammond – 2007 August 02

In Linking

The first thing to note is that this demo (the Acrobat plugin) is an application. And that comes with its own baggage, i.e. this is a Windows only plugin and is targeted at Acrobat Reader 8. On a wider purview the application merely bridges an identifier embedded in the media file and the handle record filed against that identifier and delivers some relevant functionality. The data (or metadata) declared in the PDF and in the associated handle if rich enough and structured openly can also be used by other applications. I think this is a key point worth bearing in mind, that the demo besides showing off new functionalities is also demonstrating how data (or metadata) can be embedded at the respective endpoints (PDF, handle).

Some initial observations follow below.

Metadata in PDF: 3. Deployment

thammond

thammond – 2007 August 02

In Metadata

So, assuming we know the form of the metadata we wish to add to our PDFs (or else to comply with if there is already a set of guidelines, or some industry initiative in effect) how can we realize this? And, on the flip side, how can we make it easier for consumers to extract metadata we have embedded in our PDFs.

Below are some considerations on deploying metadata in PDFs and consumer access.

PRISM 2.0

thammond

thammond – 2007 August 02

In Metadata

Only just caught up with this but the PRISM 2.0 draft is now available (since July 12) for public comment. See this posted by Dianne Kennedy: _“Just a note to let you know that PRISM 2.0 has just been posted at www.prismstandard.org <http://www.prismstandard.org/> . This is the first major revision to PRISM. We have incorporated new elements to support online content and have expanded and revised our controlled vocabularies. In addition we have added a profile to support PRISM in an XMP environment.

Metadata in PDF: 1. Strategies

thammond

thammond – 2007 August 01

In Metadata

Emboldened by my own researches, by the recent handle plugin announcement from CNRI (on which, more in a follow-on post), and by Alexander Griekspoor’s comment to my earlier post, I thought I’d write a more extensive piece about embedding metadata in PDF with a view to the following:

  • Discover what other publishers are currently doing

    • Stimulate discussions between content providers and/or consumers

      • Lay groundwork for a Crossref best practice guidelines

      Why should Crossref be interested? Well, at minimum to embed the DOI along with the digital asset would seem to be inherently “a good thing”. (And, in fact, this is precisely the approach that CNRI have taken for their plugin demos. I’ll look later at what they actually did and consider whether that is a model that Crossref publishers might usefully follow.)

      Why include the DOI as an explicit piece of metadata rather than have it included by virtue of its appearance in a content section? The main reason is that it is then unambiguously accessible. Content sections in PDFs are typically filtered and sometimes encrypted), whereas metadata is usually plain text and moreover is marked up as to field type.

      Another question concerns whether to add in the identifier alone, or to embed a full metadata set. Why not just embed the identifier and visit Crossref for the metadata? This is feasible in some cases although it does involve an extra network trip, requires an application to service the identifier and is obviously not workable in offline contexts. Seems like a “no-brainer” to include a fuller description from the outset. Note that publishers frequently make some of this information available anyway in other metadata delivery channels, e.g. RSS feeds.

Metadata in PDF: 2. Use Cases

thammond

thammond – 2007 August 01

In Metadata

Well, this is likely to be a fairly brief post as I’m not aware of many use cases of metadata in PDFs from scholarly publishers. Certainly, I can say for Nature that we haven’t done much in this direction yet although are now beginning to look into this.

I’ll discuss a couple cases found in the wild but invite comment as to others’ practices. Let me start though with the CNRI handle plugin demo for Acrobat which I blogged here.

RSS Feed

Categories

Archives