(Update – 2007.07.28: I meant to reference in this entry Pierre Lindenbaum’s post back in May Is there any XMP in scientific pdf ? (No), which btw also references Roderic Page’s post on XMP but forgot to add in the links in my haste to scoot off. Well, truth is we still can’t answer Pierre in the affirmative but at least we can take the first steps towards rectifying this.)
I’ve been revisiting Adobe’s XMP just recently. (I blogged here about the new XMP Toolkit 4.1 back in March.)
I wanted to share some of my early experiences. First off, after a couple of previous attempts which got pushed aside due to other projects, I managed to compile the libraries and the sample apps that ship with the C++ SDK under Xcode on the Mac. I also needed to compile Expat first which doesn’t ship with the distribution.
OK, so far, so good. What this basically leaves one with is a couple of XMP dump utilities (DumpMainXMP and DumpScannedXMP) and two others (XMPCoreCoverage and XMPFilesCoverage) which is a good start anyways for exploring. And turns out that our PDFs already have some workflow metadata in them. This is encouraging because the SDK allows apps to read and update existing XMP packets from files, though not to write new packets into files (as far as I understand).
I thought I would take this opportunity anyway to:
- See what XMP metadata terms we might consider adding
- Try and add these to existing XMP packets
Ugly details are presented below, but by updating the XMP packet metadata in one of our PDFs (Nature 445, 37 (2007), C.J. Hogan) we can teach Acrobat Reader to read – see the “before” (PDF here) and “after” (PDF here) screenshots in the figure.
Of course, this is really about much more than getting Adobe apps to read/write metadata. It’s about using XMP as a standard platform for embedding metadata in digital assets for third-party apps to read/write. If we can put ID3 tags into our podcasts then why not XMP packets into other media?
First a brief digression on XMP packets, which look essentially like this:
<?xpacket begin="..." id="..."?> <x:xmpmeta xmlns:x="adobe:ns:meta/"> <rdf:RDF xmlns:rdf="..." xmlns:...> ... </rdf:RDF> </x:xmpmeta> ... XML whitespace as padding ... <?xpacket end="w"?><rdf:RDF>" element which is optionally wrapped by an "<x:xmpmeta>" element. This XML fragment with trailing XML whitespace is topped and tailed by "<?xpacket>" processing instructions with "begin" and "end" attributes, respectively. The RDF supported is a simple profile of RDF with only certain constructs recognized: scalars, arrays, structures. It is not a means to embed arbitrary RDF/XML structures. But I'll pass on that for now. At first blush it's at least suitable to get a simple dictionary of key/value terms written in, and more besides. The XMP metadata from the PDF file listed above looks as follows in RDF/N3 (which is a more chipper serialization of RDF than is RDF/XML):
<uuid:...> dc:creator "x" ; dc:format "application/pdf" ; dc:title "19.7 N&V.indd NEW.indd"@x-default ; pdf:GTS_PDFXConformance "PDF/X-1a:2001" ; pdf:GTS_PDFXVersion "PDF/X-1:2001" ; pdf:Producer "Acrobat Distiller 6.0.1 for Macintosh" ; pdf:Trapped "False" ; pdfx:GTS_PDFXConformance "PDF/X-1a:2001" ; pdfx:GTS_PDFXVersion "PDF/X-1:2001" ; xap:CreateDate "2007-07-16T09:25:20+01:00" ; xap:CreatorTool "InDesign: pictwpstops filter 1.0" ; xap:MetadataDate "2007-07-16T11:40:21+01:00" ; xap:ModifyDate "2007-07-16T11:40:21+01:00" ; xapMM:DocumentID "uuid:be3a9be5-4e3a-4b66-a50b-26f0a0bfc89d" ; xapMM:InstanceID "uuid:73dcd021-d40a-4cb7-a99b-44f8e90624f4" .
(Note: I've omitted namespaces here and dropped some of the structuring info that was present on the "dc:creator" and "dc:title" elements thus leaving all values as simple strings. Back to that in a bit. )
What this says is simply that all these properies expressed in key/value pairs apply to the current document denoted by the resource identifier "<uuid:...>", and terms are taken from the schemas indicated by the prefixes. So, for example, the term "creator" from the schema referenced by the placeholder "dc" (there is a namespace URI for this but I haven't shown it here) has the value "x" for this document, and so on.
So, salting away the media- and XMP-specific metadata, we are left with the following work metadata in our main XMP packet.
<uuid:...> dc:creator "x" ; dc:format "application/pdf" ; dc:title "19.7 N&V.indd NEW.indd"@x-default ;
Not wildly impressive, i must admit. Ideally we would like to pump this up with a fuller descriptive and rights metadata set such as we routinely syndicate with our web feeds. This would make use of both DC and PRISM vocabularies. In RDF/N3 we might expect to see something like:
<uuid:...> dc:creator "Craig J. Hogan" ; dc:title "Cosmology: Ripples of early starlight" ; dc:identifier "doi:10.1038/445037a" ; dc:description "doi:10.1038/445037a" ; dc:source "Nature 445, 37 (2007)" ; dc:date "2007-01-04" ; dc:format "application/pdf" ; dc:publisher "Nature Publishing Group" ; dc:language "en" ; dc:rights "© 2007 Nature Publishing Group" ; prism:publicationName "Nature" ; prism:issn "0028-0836" ; prism:eIssn "1476-4679" ; prism:publicationDate "2007-01-04" ; prism:copyright "© 2007 Nature Publishing Group" ; prism:rightsAgent "email@example.com" ; prism:volume "445" ; prism:number "7123" ; prism:startingPage "37" ; prism:endingPage "37" ; prism:section "News and Views" ;
So, taking this RDF and doing a quick and dirty substitution of it for the existing DC description in the PDF XMP packet (i.e. more or less "lobotomizing" the PDF) we then get an updated XMP packet which can be dumped with the DumpMainXMP utility as (with some schemas removed):
// ---------------------------------- // Dumping main XMP for 445037a.pdf : File info : format = " ", handler flags = 00000260 Packet info : offset = 267225, length = 3651 Initial XMP from 445037a.pdf Dumping XMPMeta object "" (0x0) ... http://purl.org/dc/elements/1.1/ dc: (0x80000000 : schema) dc:rights (0x1E00 : isLangAlt isAlt isOrdered isArray)  = "
2007 Nature Publishing Group" (0x50 : hasLang hasQual) ? xml:lang = "x-default" (0x20 : isQual) dc:language (0x200 : isArray)  = "en" dc:publisher (0x200 : isArray)  = "Nature Publishing Group" dc:format = "application/pdf" dc:date (0x600 : isOrdered isArray)  = "2007-01-04" dc:source = "Nature 445, 37 (2007)" dc:description (0x1E00 : isLangAlt isAlt isOrdered isArray)  = "doi:10.1038/445037a" (0x50 : hasLang hasQual) ? xml:lang = "x-default" (0x20 : isQual) dc:identifier = "doi:10.1038/445037a" dc:title (0x1E00 : isLangAlt isAlt isOrdered isArray)  = "Cosmology: Ripples of early starlight" (0x50 : hasLang hasQual) ? xml:lang = "x-default" (0x20 : isQual) dc:creator (0x600 : isOrdered isArray)  = "Craig J. Hogan" http://prismstandard.org/namespaces/1.2/basic/ prism: (0x80000000 : schema) prism:section = "News and Views" prism:endingPage = "37" prism:startingPage = "37" prism:number = "7123" prism:volume = "445" prism:rightsAgent = "firstname.lastname@example.org" prism:copyright = " 2007 Nature Publishing Group" prism:publicationDate = "2007-01-04" prism:eIssn = "1476-4679" prism:issn = "0028-0836" prism:publicationName = "Nature"
Full dumps of the "before" and "after" PDFs are available here:
Note also that in the dump above some of the DC terms are interpreted by the XMP toolkit to have structured formats, i.e. are recognized as array members, and have language and ordering attributes. This seems to be an artefact of the toolkit as the RDF did not specify these structurings. Note also that the PRISM values were not similarly interpreted as the PRISM schema is not registered with the toolkit.
Obviously, there's much more to be learned yet. I'll post an update to this later, but meantime it would be very interesting to get feedback from others on experiences they may have with XMP or any opinions they may want to share. I think it all looks very promising although tools are somewhat restricted.