(Update – 2007.07.28: I meant to reference in this entry Pierre Lindenbaum’s post back in May Is there any XMP in scientific pdf ? (No), which btw also references Roderic Page’s post on XMP but forgot to add in the links in my haste to scoot off. Well, truth is we still can’t answer Pierre in the affirmative but at least we can take the first steps towards rectifying this.)
I’ve been revisiting Adobe’s XMP just recently. (I blogged here about the new XMP Toolkit 4.1 back in March.)
I wanted to share some of my early experiences. First off, after a couple of previous attempts which got pushed aside due to other projects, I managed to compile the libraries and the sample apps that ship with the C++ SDK under Xcode on the Mac. I also needed to compile Expat first which doesn’t ship with the distribution.
OK, so far, so good. What this basically leaves one with is a couple of XMP dump utilities (DumpMainXMP and DumpScannedXMP) and two others (XMPCoreCoverage and XMPFilesCoverage) which is a good start anyways for exploring. And turns out that our PDFs already have some workflow metadata in them. This is encouraging because the SDK allows apps to read and update existing XMP packets from files, though not to write new packets into files (as far as I understand).
I thought I would take this opportunity anyway to:
- See what XMP metadata terms we might consider adding
- Try and add these to existing XMP packets
Ugly details are presented below, but by updating the XMP packet metadata in one of our PDFs (Nature 445, 37 (2007), C.J. Hogan) we can teach Acrobat Reader to read – see the “before” (PDF here) and “after” (PDF here) screenshots in the figure.

Of course, this is really about much more than getting Adobe apps to read/write metadata. It’s about using XMP as a standard platform for embedding metadata in digital assets for third-party apps to read/write. If we can put ID3 tags into our podcasts then why not XMP packets into other media?
First a brief digression on XMP packets, which look essentially like this:
<?xpacket begin="..." id="..."?> <x:xmpmeta xmlns:x="adobe:ns:meta/"> <rdf:RDF xmlns:rdf="..." xmlns:...> ... </rdf:RDF> </x:xmpmeta> ... XML whitespace as padding ... <?xpacket end="w"?><rdf:RDF>" element which is optionally wrapped by an "<x:xmpmeta>" element. This XML fragment with trailing XML whitespace is topped and tailed by "<?xpacket>" processing instructions with "begin" and "end" attributes, respectively. The RDF supported is a simple profile of RDF with only certain constructs recognized: scalars, arrays, structures. It is not a means to embed arbitrary RDF/XML structures. But I'll pass on that for now. At first blush it's at least suitable to get a simple dictionary of key/value terms written in, and more besides. The XMP metadata from the PDF file listed above looks as follows in RDF/N3 (which is a more chipper serialization of RDF than is RDF/XML):
<uuid:...> dc:creator "x" ; dc:format "application/pdf" ; dc:title "19.7 N&V.indd NEW.indd"@x-default ; pdf:GTS_PDFXConformance "PDF/X-1a:2001" ; pdf:GTS_PDFXVersion "PDF/X-1:2001" ; pdf:Producer "Acrobat Distiller 6.0.1 for Macintosh" ; pdf:Trapped "False" ; pdfx:GTS_PDFXConformance "PDF/X-1a:2001" ; pdfx:GTS_PDFXVersion "PDF/X-1:2001" ; xap:CreateDate "2007-07-16T09:25:20+01:00" ; xap:CreatorTool "InDesign: pictwpstops filter 1.0" ; xap:MetadataDate "2007-07-16T11:40:21+01:00" ; xap:ModifyDate "2007-07-16T11:40:21+01:00" ; xapMM:DocumentID "uuid:be3a9be5-4e3a-4b66-a50b-26f0a0bfc89d" ; xapMM:InstanceID "uuid:73dcd021-d40a-4cb7-a99b-44f8e90624f4" .
(Note: I've omitted namespaces here and dropped some of the structuring info that was present on the "dc:creator" and "dc:title" elements thus leaving all values as simple strings. Back to that in a bit. )
What this says is simply that all these properies expressed in key/value pairs apply to the current document denoted by the resource identifier "<uuid:...>", and terms are taken from the schemas indicated by the prefixes. So, for example, the term "creator" from the schema referenced by the placeholder "dc" (there is a namespace URI for this but I haven't shown it here) has the value "x" for this document, and so on.
So, salting away the media- and XMP-specific metadata, we are left with the following work metadata in our main XMP packet.
<uuid:...> dc:creator "x" ; dc:format "application/pdf" ; dc:title "19.7 N&V.indd NEW.indd"@x-default ;
Not wildly impressive, i must admit. Ideally we would like to pump this up with a fuller descriptive and rights metadata set such as we routinely syndicate with our web feeds. This would make use of both DC and PRISM vocabularies. In RDF/N3 we might expect to see something like:
<uuid:...> dc:creator "Craig J. Hogan" ; dc:title "Cosmology: Ripples of early starlight" ; dc:identifier "doi:10.1038/445037a" ; dc:description "doi:10.1038/445037a" ; dc:source "Nature 445, 37 (2007)" ; dc:date "2007-01-04" ; dc:format "application/pdf" ; dc:publisher "Nature Publishing Group" ; dc:language "en" ; dc:rights "© 2007 Nature Publishing Group" ; prism:publicationName "Nature" ; prism:issn "0028-0836" ; prism:eIssn "1476-4679" ; prism:publicationDate "2007-01-04" ; prism:copyright "© 2007 Nature Publishing Group" ; prism:rightsAgent "permissions@nature.com" ; prism:volume "445" ; prism:number "7123" ; prism:startingPage "37" ; prism:endingPage "37" ; prism:section "News and Views" ;
So, taking this RDF and doing a quick and dirty substitution of it for the existing DC description in the PDF XMP packet (i.e. more or less "lobotomizing" the PDF) we then get an updated XMP packet which can be dumped with the DumpMainXMP utility as (with some schemas removed):
// ---------------------------------- // Dumping main XMP for 445037a.pdf : File info : format = " ", handler flags = 00000260 Packet info : offset = 267225, length = 3651 Initial XMP from 445037a.pdf Dumping XMPMeta object "" (0x0) ... http://purl.org/dc/elements/1.1/ dc: (0x80000000 : schema) dc:rights (0x1E00 : isLangAlt isAlt isOrdered isArray) [1] = "2007 Nature Publishing Group" (0x50 : hasLang hasQual) ? xml:lang = "x-default" (0x20 : isQual) dc:language (0x200 : isArray) [1] = "en" dc:publisher (0x200 : isArray) [1] = "Nature Publishing Group" dc:format = "application/pdf" dc:date (0x600 : isOrdered isArray) [1] = "2007-01-04" dc:source = "Nature 445, 37 (2007)" dc:description (0x1E00 : isLangAlt isAlt isOrdered isArray) [1] = "doi:10.1038/445037a" (0x50 : hasLang hasQual) ? xml:lang = "x-default" (0x20 : isQual) dc:identifier = "doi:10.1038/445037a" dc:title (0x1E00 : isLangAlt isAlt isOrdered isArray) [1] = "Cosmology: Ripples of early starlight" (0x50 : hasLang hasQual) ? xml:lang = "x-default" (0x20 : isQual) dc:creator (0x600 : isOrdered isArray) [1] = "Craig J. Hogan" http://prismstandard.org/namespaces/1.2/basic/ prism: (0x80000000 : schema) prism:section = "News and Views" prism:endingPage = "37" prism:startingPage = "37" prism:number = "7123" prism:volume = "445" prism:rightsAgent = "permissions@nature.com" prism:copyright = " 2007 Nature Publishing Group" prism:publicationDate = "2007-01-04" prism:eIssn = "1476-4679" prism:issn = "0028-0836" prism:publicationName = "Nature"
Full dumps of the "before" and "after" PDFs are available here:
Note also that in the dump above some of the DC terms are interpreted by the XMP toolkit to have structured formats, i.e. are recognized as array members, and have language and ordering attributes. This seems to be an artefact of the toolkit as the RDF did not specify these structurings. Note also that the PRISM values were not similarly interpreted as the PRISM schema is not registered with the toolkit.
Obviously, there's much more to be learned yet. I'll post an update to this later, but meantime it would be very interesting to get feedback from others on experiences they may have with XMP or any opinions they may want to share. I think it all looks very promising although tools are somewhat restricted.

Hi Tony,
I think it would be great if all publishers would finally start to add decent metadata to their PDF files.
Of course it would be even better if they would use the full potential of XMP, but I personally would already be really happy if they could start with the normal metadata fields that have been there since the beginning of PDF. Only Elsevier deserves credit here as they do already put the DOI in the metadata. Many other publishers don’t add anything, or even worse, fill in the name of the registered acrobat user in the author field!
It seems the response of most publishers is that they think no one uses that stuff. Wrong!
Of course, there are the users of our program Papers that would be served a great deal, but the publishers forget that right now ALL mac users running OSX Tiger (which is in the order of 80%) would benefit directly from it the moment they would add it. Spotlight already knows how to read it, it’s just not there. Add the metadata and your PDFs become instantly more accessible for a large number of users.
Back to XMP, it would be great to see this added by the publishers. It would be important that everyone would do it in the same way (hooray for dublin-core). And that in a more limited form the data would also be added in the classical metadata fields (including the DOI).
The toolset for accessing XMP data in PDFs is still quite problematic for everyone who’s not a C++ junkie, like this Cocoa guy. Hopefully this will change soon as well.
It’s great to finally see some initiatives to get the (complete lack of) PDF metadata on the agenda!
Hi Tony,
I’m trying to build XMP Library with Xcode but all I get is a bunch of errors.
I must have missed something about the building procedure (which is not documented as far as I know).
Could you please tell me if you used some tricks to do it?
Thanks a lot!
Hi Florent:
No tricks. (Dear God, I’m no C++ programmer. Just a harmless hacker.
The problem I ran into – and it may be the same that you’re seeing – is that the XMP Toolkit sample programmes cannot be compiled out of the box since Adobe do not ship “Expat” – James Clark’s XML parser – but just leave a placeholder for that. You’ll need to look at the ReadMe.txt file included in the distro. This from the xmp_sdk_overview.pdf:
“/third-party/expat Contains a placeholder for the Expat XML parser used by XMPCore.
Read the ReadMe.txt for information about obtaining Expat.”
You’ll find that file here in the distro:
./third-party/expat/ReadMe.txt
Hope that helps. Let me know if you still have problems.
Tony
Hi Tony,
My compile errors were caused by an XCode bug:
XCode do not accept space characters in the project path. So when it happens, you get a lot of errors coming from nowhere…
Thanks for your help,
keep on hacking
hi i am a new bie to cocoa..
pls anyone help me … in how to read XMP metadata using objective C… thank you.. your suggestion will help me lot…