« July 2007 | Main | September 2007 »

August 28, 2007

Stop Press

Boy, was I ever so wrong! Contrary to what I said in yesterday's post, the new PRISM 2.0 spec does support XMP value type mappings for its terms. See the table below which lists the PRISM basic vocabulary terms and the XMP value types.

Many thanks to Dianne Kennedy and the rest of the PRISM Working Group for having added this support to PRISM 2.0.

SectionPRISM TermXMP Value Type
4.2.1prism:alternateTitlebag Text
4.2.2prism:byteCountInteger
4.2.3prism:channelText
4.2.4prism:complianceProfileChoice: "one", "two", "three"
4.2.5prism:copyrightText
4.2.6prism:corporateEntitybag Text
4.2.7prism:coverDateDate
4.2.8prism:coverDisplayDateText
4.2.9prism:creationDateDate
4.2.10prism:distributorText
4.2.11prism:editionText
4.2.12prism:eIssnText
4.2.13prism:embargoDatebag Date
4.2.14prism:endingPageText
4.2.15prism:eventbag Text
4.2.16prism:expirationDatebag Date
4.2.17prism:hasAlternativebag Text
4.2.18prism:hasCorrectionText
4.2.19prism:hasTranslationbag Text
4.2.20prism:industrybag Text
4.2.21prism:isCorrectionOfbag Text
4.2.22prism:issnText
4.2.23prism:issueIdentifierText
4.2.24prism:issueNameText
4.2.25prism:isTranslationOfText
4.2.26prism:killDateDate
4.2.27prism:locationbag Text
4.2.28prism:modificationDateDate
4.2.29prism:numberText
4.2.30prism:objectbag Text
4.2.31prism:originChoice: "email", "mobile", "broadcast", "web", "print", "recordableMedia", "other"
4.2.32prism:organizationbag Text
4.2.33prism:pageRangeText
4.2.34prism:personbag Text
4.2.35prism:postDateDate
4.2.36.prism:publicationDateDate
4.2.37prism:publicationNameText
4.2.38prism:receptionDateDate
4.2.39prism:rightsAgentText
4.2.40prism:sectionbag Text
4.2.41prism:startingPageText
4.2.42prism:subsection1bag Text
4.2.43prism:subsection2bag Text
4.2.44prism:subsection3bag Text
4.2.45prism:subsection4bag Text
4.2.46prism:teaserText
4.2.47prism:versionIdentifierText
4.2.48prism:volumeText
4.2.49prism:wordCountInteger

August 27, 2007

ExifTool

(Update - 2007.08.28: I inadvertently missed out the term names in the last example of XMP as RDF/N3 with QNames and have now added these in. Also - a biggie - I said that PRISM had no XMP schema defined. This is actually wrong and as I blogged here today, the new PRISM 2.0 spec does indeed have a mapping of PRISM terms to XMP value types. Should actually have read the spec instead of just blogging about it earlier here. :~)

Having previously stooped to an extremely crass hack for pulling out a document information dictionary from PDFs (for which no apologies are sufficient but it does often work) I feel I should make some kind of amends and mention the wonderful ExifTool by Phil Harvey for reading and writing metadata to media files. This is both a Perl library and command-line application (so it's cross-platform - a Windows .exe and Mac OS .dmg are also provided.) Besides handling EXIF tags in image files this veritable swissknife of metadata inspectors can also read PDFs for the information dictionary and the document XMP packet. And moreover, intriguingly, can dump the raw (document) XMP packet.

I'm still experimenting with it. There's quite a number of features to explore. But some preliminary finds are listed below.

Taking one of our standard (metadata poor) PDFs we get this dump:

% exiftool nature05428.pdf
ExifTool Version Number         : 6.95
File Name                       : nature05428.pdf
Directory                       : .
File Size                       : 367 kB
File Modification Date/Time     : 2007:07:26 14:01:23
File Type                       : PDF
MIME Type                       : application/pdf
Page Count                      : 3
Producer                        : Acrobat Distiller 6.0.1 (Windows)
Mod Date                        : 2006:12:19 15:03:23+08:00
Creation Date                   : 2006:12:18 16:57:58+08:00
Creator                         : 3B2 Total Publishing System 7.51n/W
Creator Tool                    : 3B2 Total Publishing System 7.51n/W
Modify Date                     : 2006:12:19 15:03:23+08:00
Create Date                     : 2006:12:18 16:57:58+08:00
Metadata Date                   : 2006:12:19 15:03:23+08:00
Document ID                     : uuid:f598740b-ad11-41c5-a49e-7caffea783f0
Format                          : application/pdf
Title                           : untitled

By way of comparison, if we take a demo (metadata rich) PDF with added descriptive DC and PRISM metadata terms, we then get this dump:

% exiftool 445037a.pdf
ExifTool Version Number         : 6.95
File Name                       : 445037a.pdf
Directory                       : .
File Size                       : 265 kB
File Modification Date/Time     : 2007:07:26 16:18:17
File Type                       : PDF
MIME Type                       : application/pdf
Page Count                      : 1
Creator Tool                    : InDesign: pictwpstops filter 1.0
Metadata Date                   : 2006:12:22 12:10:07Z
Document ID                     : uuid:4cd39128-2c8e-41c0-9cad-eea2a1fdb64f
Identifier                      : doi:10.1038/445037a
Description                     : doi:10.1038/445037a
Source                          : Nature 445, 37 (2007)
Date                            : 2007:01:04
Format                          : application/pdf
Publisher                       : Nature Publishing Group
Language                        : en
Rights                          : © 2007 Nature Publishing Group
Publication Name                : Nature
Issn                            : 0028-0836
E Issn                          : 1476-4679
Publication Date                : 2007-01-04
Copyright                       : © 2007 Nature Publishing Group
Rights Agent                    : permissions@nature.com
Volume                          : 445
Number                          : 7123
Starting Page                   : 37
Ending Page                     : 37
Section                         : News and Views
Modify Date                     : 2006:12:22 12:10:07Z
Create Date                     : 2006:12:22 11:46:18Z
Title                           : 4.1 N&V NS NEW.indd
Trapped                         : False
Creator                         : InDesign: pictwpstops filter 1.0
GTS PDFX Version                : PDF/X-1:2001
GTS PDFX Conformance            : PDF/X-1a:2001
Author                          : x
Producer                        : Acrobat Distiller 6.0.1 for Macintosh

Note that the DC and PRISM terms are encoded as my earlier examples and do not take account of a) how DC is defined as an XMP schema (i.e. the XMP value types for the seperate terms), or b) how PRISM might (because it isn't yet) be defined as an XMP schema. Nor are identifier considerations fully taken into account. Nonetheless this gives more than an idea of what things could look like.

Now, with ExifTool it is also possible to list out the terms by group, e.g.

% exiftool -g1 445037a.pdf
---- ExifTool ----
ExifTool Version Number         : 6.95
---- File ----
File Name                       : 445037a.pdf
Directory                       : .
File Size                       : 265 kB
File Modification Date/Time     : 2007:07:26 16:18:17
File Type                       : PDF
MIME Type                       : application/pdf
---- PDF ----
Page Count                      : 1
Modify Date                     : 2006:12:22 12:10:07Z
Create Date                     : 2006:12:22 11:46:18Z
Title                           : 4.1 N&V NS NEW.indd
Trapped                         : False
Creator                         : InDesign: pictwpstops filter 1.0
GTS PDFX Version                : PDF/X-1:2001
GTS PDFX Conformance            : PDF/X-1a:2001
Author                          : x
Producer                        : Acrobat Distiller 6.0.1 for Macintosh
---- XMP-xmp ----
Creator Tool                    : InDesign: pictwpstops filter 1.0
Metadata Date                   : 2006:12:22 12:10:07Z
---- XMP-xmpMM ----
Document ID                     : uuid:4cd39128-2c8e-41c0-9cad-eea2a1fdb64f
---- XMP-dc ----
Identifier                      : doi:10.1038/445037a
Description                     : doi:10.1038/445037a
Source                          : Nature 445, 37 (2007)
Date                            : 2007:01:04
Format                          : application/pdf
Publisher                       : Nature Publishing Group
Language                        : en
Rights                          : © 2007 Nature Publishing Group
---- XMP-prism ----
Publication Name                : Nature
Issn                            : 0028-0836
E Issn                          : 1476-4679
Publication Date                : 2007-01-04
Copyright                       : © 2007 Nature Publishing Group
Rights Agent                    : permissions@nature.com
Volume                          : 445
Number                          : 7123
Starting Page                   : 37
Ending Page                     : 37
Section                         : News and Views

Going back to the first example we can extract the (document) XMP packet as:

% exiftool -xmp -b nature05428.pdf
<?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d' bytes='1753'?>

<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'
xmlns:iX='http://ns.adobe.com/iX/1.0/'>

<rdf:Description about='uuid:3d686cee-18e6-483c-b1c9-e128e9f0d009'
xmlns='http://ns.adobe.com/pdf/1.3/'
xmlns:pdf='http://ns.adobe.com/pdf/1.3/'>
<pdf:Producer>Acrobat Distiller 6.0.1 (Windows)</pdf:Producer>
<pdf:ModDate>2006-12-19T15:03:23+08:00</pdf:ModDate>
<pdf:CreationDate>2006-12-18T16:57:58+08:00</pdf:CreationDate>
<pdf:Title>untitled</pdf:Title>
<pdf:Creator>3B2 Total Publishing System 7.51n/W</pdf:Creator>
</rdf:Description>

<rdf:Description about='uuid:3d686cee-18e6-483c-b1c9-e128e9f0d009'
xmlns='http://ns.adobe.com/xap/1.0/'
xmlns:xap='http://ns.adobe.com/xap/1.0/'>
<xap:CreatorTool>3B2 Total Publishing System 7.51n/W</xap:CreatorTool>
<xap:ModifyDate>2006-12-19T15:03:23+08:00</xap:ModifyDate>
<xap:CreateDate>2006-12-18T16:57:58+08:00</xap:CreateDate>
<xap:Format>application/pdf</xap:Format>
<xap:Title>
<rdf:Alt>
<rdf:li xml:lang='x-default'>untitled</rdf:li>
</rdf:Alt>
</xap:Title>
<xap:MetadataDate>2006-12-19T15:03:23+08:00</xap:MetadataDate>
</rdf:Description>

<rdf:Description about='uuid:3d686cee-18e6-483c-b1c9-e128e9f0d009'
xmlns='http://ns.adobe.com/xap/1.0/mm/'
xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/'>
<xapMM:DocumentID>uuid:f598740b-ad11-41c5-a49e-7caffea783f0</xapMM:DocumentID>
</rdf:Description>

<rdf:Description about='uuid:3d686cee-18e6-483c-b1c9-e128e9f0d009'
xmlns='http://purl.org/dc/elements/1.1/'
xmlns:dc='http://purl.org/dc/elements/1.1/'>
<dc:format>application/pdf</dc:format>
<dc:title>untitled</dc:title>
</rdf:Description>

</rdf:RDF>
<?xpacket end='r'?>%

Note that this PDF also included XMP packets for illustrations but the tool extracted the main, or document, XMP packet.

And now that it's easier to extract the metadata one can look to do something more interesting. For example, if one has cwm installed (Tim BL's Closed World Machine for semweb dabblings - a Python application, so again cross-platform) one can pipe the XMP packet into cwm as RDF/XML, verify it as valid RDF and read out in another format, e.g. RDF/N3. For the above example we can so this as follows.

But let me first define a pipeline to extract the XMP, a couple filters to strip out processing instructions (includes the open and close bracketing <?xpacket> XMP PI's as well as an undocumented - legacy? - <?adobe> Adobe PI), and then fed into cwm as RDF/XML and read out as RDF/N3. (Note that instead of ExifTool to extract the XMP another tool could have been used, e.g. something based on the sample apps shipped with the Adobe XMP SDK, or something bespoke.)

% alias get_n3
exiftool -xmp -b !$ | grep -v "<?" | grep -v xmpmeta | cwm --rdf --n3

We can then simply request to get the metadata from this PDF in RDF/N3 format:

% get_n3 nature05428.pdf
#Processed by Id: cwm.py,v 1.164 2004/10/28 17:41:59 timbl Exp 
        #    using base file:/Users/tony/Xcode/xmp/dev/
        
#  Notation3 generation by
#       notation3.py,v 1.166 2004/10/28 17:41:59 timbl Exp

# Base was: file:/Users/tony/Xcode/xmp/dev/
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

<uuid:3d686cee-18e6-483c-b1c9-e128e9f0d009> <http://ns.adobe.com/pdf/1.3/CreationDate> "2006-12-18T16:57:58+08:00";
<http://ns.adobe.com/pdf/1.3/Creator> "3B2 Total Publishing System 7.51n/W";
<http://ns.adobe.com/pdf/1.3/ModDate> "2006-12-19T15:03:23+08:00";
<http://ns.adobe.com/pdf/1.3/Producer> "Acrobat Distiller 6.0.1 (Windows)";
<http://ns.adobe.com/pdf/1.3/Title> "untitled";
<http://ns.adobe.com/xap/1.0/CreateDate> "2006-12-18T16:57:58+08:00";
<http://ns.adobe.com/xap/1.0/CreatorTool> "3B2 Total Publishing System 7.51n/W";
<http://ns.adobe.com/xap/1.0/Format> "application/pdf";
<http://ns.adobe.com/xap/1.0/MetadataDate> "2006-12-19T15:03:23+08:00";
<http://ns.adobe.com/xap/1.0/ModifyDate> "2006-12-19T15:03:23+08:00";
<http://ns.adobe.com/xap/1.0/Title> [
a rdf:Alt;
rdf:_1 "untitled"@x-default ];
<http://ns.adobe.com/xap/1.0/mm/DocumentID> "uuid:f598740b-ad11-41c5-a49e-7caffea783f0";
<http://purl.org/dc/elements/1.1/format> "application/pdf";
<http://purl.org/dc/elements/1.1/title> "untitled" .

#ENDS

Or writing that out again with QNames for readability (and dropping the UUID as RDF subject as recommemded by latest XMP spec) we have:

#Processed by Id: cwm.py,v 1.164 2004/10/28 17:41:59 timbl Exp 
        #    using base file:/Users/tony/Xcode/xmp/dev/
        
#  Notation3 generation by
#       notation3.py,v 1.166 2004/10/28 17:41:59 timbl Exp

# Base was: file:/Users/tony/Xcode/xmp/dev/

@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix pdf: <http://ns.adobe.com/pdf/1.3/> .
@prefix xmp: <http://ns.adobe.com/xap/1.0/> .
@prefix xmpMM: <http://ns.adobe.com/xap/1.0/mm/> .

<> pdf:CreationDate "2006-12-18T16:57:58+08:00";
pdf:Creator "3B2 Total Publishing System 7.51n/W";
pdf:ModDate "2006-12-19T15:03:23+08:00";
pdf:Producer "Acrobat Distiller 6.0.1 (Windows)";
pdf:Title "untitled";
xmp:CreateDate "2006-12-18T16:57:58+08:00";
xmp:CreatorTool "3B2 Total Publishing System 7.51n/W";
xmp:Format "application/pdf";
xmp:MetadataDate "2006-12-19T15:03:23+08:00";
xmp:ModifyDate "2006-12-19T15:03:23+08:00";
xmp:Title [
a rdf:Alt;
rdf:_1 "untitled"@x-default ];
xmpMM:DocumentID "uuid:f598740b-ad11-41c5-a49e-7caffea783f0";
dc:format "application/pdf";
dc:title "untitled" .

#ENDS

Now just imagine that there were something a little more interesting in there. Like a DOI. Like descriptive metadata, perhaps. :)

August 23, 2007

pdfa.org

Following on from yesterday's post I just came across this very useful source of information on PDF/A: the PDF/A Conformance Center. This provides links to resources such as this whitepaper PDF/A - A new Standard for Long-Term Archiving, and a number of technical notes, especially Metadata and PDF/A-1(also available as a PDF). (This latter corrects some errors in the ISO standard which are to be redressed in a forthcoming Technical Corrigendum later this year.)

The site also links to the standard, to a FAQ, to PDF/A products and to news and events. There's also an RSS feed and a discussion forum.

Still difficult to find examples of PDF/A though (the discussion forum doesn't throw up too much on that score) although at least the Technical Note linked to above is a PDF/A-1 document as can be seen from this XMP description:

      <rdf:Description rdf:about=""
            xmlns:pdfaid="http://www.aiim.org/pdfa/ns/id/">
         <pdfaid:part>1</pdfaid:part>
         <pdfaid:conformance>A</pdfaid:conformance>
      </rdf:Description>
 

As noted before, PDF/A may be more (and less) than CrossRef publishers require at this time, but nonetheless it is certainly a useful yardstick as regards embedding metadata within a PDF and is anyway a technology worth tracking in its own right.

August 22, 2007

Weird Scenes Inside the Gold Mine

So, following up on my recent posts here on Metadata in PDFs (Strategies, Use Cases, Deployment), I finally came across PDF/A and PDF/X, two ISO standardized subsets of PDF. the former (ISO 19005-1:2005) for archiving and the latter (ISO 15929:2002, ISO 15930-1:2001, etc.) for prepress digital data exchange.

Both formats share some common ground such as minimizing surprises between producer and consumer and keeping things open and predictable. But my interest here is specifically in metadata and to see what guidance these standards might provide us. Not unsurprisingly, metadata is a key issue for PDF/A, less so for PDF/X. I'll discuss PDF/X briefly but the bulk of this post is focussed on PDF/A. See below.

PDF/X

The main reference I am using here is the "Application Notes for PDF/X Standards" cited below [PDF/X 2]. There are two key sections which deal with metadata in PDF/X: "2.3 Identification and conformance", and "2.20 Document identification and metadata".

Section 2.3 states that a conforming PDF/X file has the key "/GTS_PDFXVersion" in the document information dictionary, and (depending on version) may or may not have the key "/GTS_PDFXConformance".

Section 2.20 then talks about inclusion of a document ID within the document trailer to ensure correct identification of the file. It then goes on specifically to say:

"Additionally, the use of the PDF version 1.4 Metadata key is allowed. Note that although information placed using this mechanism may be beneficial to production processes, any reader that is not PDF version 1.4 compliant may ignore this information."

That is, PDF/X requires the use of a document information dictionary with the key "/GTS_PDFXVersion" (and as version demands also the key "/GTS_PDFXConformance") to signal conformance. It is lukewarm, though with regard to the inclusion of XMP metadata (as would be indicated by the "/Metadata" key in the document catalog).

PDF/A

The main reference I'm using here is the "ISO DIS 19005-1:2005" draft cited below [PDF/A, 1].

Completely differently from PDF/X, PDF/A puts all its attention on the XMP metadata, while at the same time acknowledging that the document information dictionary may be used. Note 1 in Section 6.7.3 notes that:

"Since a document information dictionary is allowed within a conforming file, it is possible for a single file to be both PDF/A-1 and PDF/X [12, 13] conformant."

The non-normative Annex B also has this to say:

"Use of non-XMP metadata at the file level is strongly discouraged as there is no assurance that such metadata can be preserved in accordance with this specification. In cases where non-XMP metadata is present, the preference is to convert it to XMP, embed it in the file, and describe the conversion in the xmpMM:History property."

It's not fully clear here whether "file level" is intended to be the same as "document level". But note that this anyway is from a non-normative section and does not reflect the actual normative wording used in the standard (Section 6.7.3) which allows the use of the document information dictionary.

The key section for our purposes in the standard is "6.7 Metadata".

Section "6.7.2 Properties" says:

"The document catalog dictionary of a conforming file shall contain the Metadata key. The metadata stream that forms the value of that key shall conform to XMP Specification. All metadata properties pertaining to a file that are embedded in that file, except for document information dictionary entries that have no analogue in predefined XMP schemas as defined in 6.7.3, shall be in the form of one or more XMP packets as defined by XMP Specification, 3. Metadata properties shall be specified in predefined XMP schemas or in one or more extension schemas that comply with XMP requirements. Metadata object stream dictionaries shall not contain the Filter key."

This is quite something. Not only is PDF/A fully supportive of XMP (even if Adobe sometimes appear to be less than enthusiastic) it actually requires it. Further it says that the XMP packets shall be human readable (well, apart from the small matter of XML, that is :).

Section "6.7.3 Document information dictionary" then goes on to say:

"A document information dictionary may appear within a conforming file. If it does appear, then all of its entries that have analogous properties in predefined XMP schemas, as defined by Table 1, shall also be embedded in the file in XMP form with equivalent values. Any document information dictionary entry not listed in Table 1 shall not be embedded using a predefined XMP schema property."

This says that the primary source of metadata will be the XMP packet and that, as far as possible, metadata properties in the document information dictionary will be mapped directly to the XMP packet as specified and will not cause any conflict.

I'm not quite sure how to read the last sentence. Does that mean that is one were to use an "/Identifier" key in the document information dictionary then one couldn't map it as "dc:identifier", say, in the XMP. I think that would be OK. My read is that it precludes the use of a predefined term within the information dictionary, so one couldn't have something like "dc:identifier" in the information dictionary.

Note also that the one quirky mapping in Table 1 which arises from the need to sync the information dictionary entries with the XMP properties is this:

"If the dc:creator property is present in XMP metadata then it shall be represented by an ordered Text array of length one whose single entry shall consist of one or more names. The value of dc:creator and the document information dictionary Author entry shall be equivalent."

This means that:

"The document information dictionary entry:  
/Author (Peter, Paul, and Mary)
  is equivalent to the XMP property:
<dc:creator> 
  <rdf:Seq> 
    <rdf.:li>Peter, Paul, and Mary</rdf:li> 
  </rdf:Seq> 
</dc:creator> 
"

Weird, or what? Well, of course, I see the rationale, but ...

The remaining sections of interest here are "6.7.6 File identifiers" which says that:

"A conforming file should have one or more metadata properties to characterize, categorize, and otherwise identify the file. This part of ISO 19005 does not mandate any specific identification scheme. Identifiers may be externally based, such as an International Standard Book Number (ISBN) or a Digital Object Identifier (DOI), or internally based, such as a Globally Unique Identifier/Universally Unique Identifier (GUID/UUID) or another designation assigned during workflow operations."

Hmm, not that DOI is a file identifier necessarily. And certainly not in the CrossRef usage where is denotes a work rather than a manifestation.

Section "6.7.8 Extension schemas" talks about the need to rigorously declare any extension (undefined) schema with the following PDF/A extension schema description schema properties:

  • pdfaSchema:schema
  • pdfaSchema:namespaceURI
  • pdfaSchema:prefix
  • pdfaSchema:property
  • pdfaSchema:valueType

I think this means that were PRISM terms to be used the extension schema terms would need to be defined.

And finally, the section "6.7.11 Version and conformance level identification" says that:

"The PDF/A version and conformance level of a file shall be specified using the PDF/A Identification extension schema defined in this clause."

This uses the PDF/A identification schema properties:

  • pdfaid:part
  • pdfaid:amd
  • pdfaid:conformance

Summary

What does this all mean? Main lessons are to be learned from PDF/A which endorses (well, actually mandates) the use of XMP. Moreover, it requires that the document information dictionary and the XMP packet be in sync. Why it signals conformance through the XMP packet rather than through the information dictionary (as does PDF/X) is a mystery. Or at least not specify a means to also signal conformance through the information dictionary. The latter is readily get-at-able. A very crude hack to extract a PDF information dictionary can be as simple as

% strings <filename.pdf> | grep "/Producer"

or some other likely key. That will usually pull a line containing the full dictionary. The XMP packet is much harder to extract and then you're still left with XML to parse.

My gut feeling is that both mechanisms should be required (and sync'ed). And it's hard not to see the DOI being required in both sections. Leads to considerations on which schemas/terms to use and how to render the DOI. I am biased and would prefer to see it rendered in URI form, i.e. in an inclusive rather than an exclusive representation. DOI is special - but not that special. Other identifiers are also useful.

As per my earlier post, I could imagine that both DC and PRISM terms could be added to an XMP packet. I'm not sure whether there is any real interest at this time to follow the PDF/A specification or rather to be informed by it. There seems to be a lot of overhead and I'm still looking to meet up with some examples (either in the wild or fabricated) to see what it might look like in practice.

Interested as always in others' views.

References

So, note that these are ISO documents and as such are available for purchase from the ISO Store. (The citations above are linked to the relevant ISO Store pages.)

See also this recent post (August 1, 2007) by Rick Jelliffe on XML.com: Where to get ISO Standards on the Internet free.

There appear to be three main sources of information for these technologies: the ISO standards, application notes and FAQs. NPES (The Association for Suppliers of Printing, Publishing and Converting Technologies) hosts pages with relevant links - see here.

Below are listed specific links to freely available documentation that may be useful. Note that I have not purchased the ISO standards but have made use of an ISO DIS (draft international standard) for PDF/A and Application Notes for PDF/X by CGATS. (As yet there are no links to Application Notes for PDF/A.)

PDF/X

  1. (No Draft International Standard found.)
  2. Application Notes for PDF/X Standards Version 3, September 2002, CGATS
  3. Application Notes for PDF/X Standards Version 4 (PDF/X-1a:2003, PDF/X-2:2003 & PDF/X-3:2003), September 2006 , CGATS
  4. Frequently Asked Questions, November 2005, Martin Bailey, Chair, ISO/TC130/WG2/TF2 (PDF/X)

PDF/A

  1. Draft International Standard ISO/DIS 19005-1, ISO/TC171/SC2, Document management— Electronic document file format for long-term preservation — Part 1: Use of PDF 1.4 (PDF/A-1)
  2. (No Application Notes for PDF/A available yet.)
  3. Frequently Asked Questions (FAQs), ISO 19005-1:2005, PDF/A-1, July 2006, PDF/A Joint Working Group

August 08, 2007

New SRU (1.2) Website

From Ray Denenberg's post to the SRU Listserv yesterday:

"The new SRU web site is now up: http://www.loc.gov/sru/

It is completely reorganized and reflects the version 1.2 specifications.
(It also includes version 1.1 specifications, but is oriented to version
1.2.)

...

There is an official 1.1 archive under the new site,
http://www.loc.gov/sru/sru1-1archive/. And note also, that the new spec incorporates both version 1.1 and 1.2 (anything specific to version 1.1 is annotated as such)."


Interested to learn if any CrossRef publishers are currently implementing SRU.

August 02, 2007

PRISM 2.0

Only just caught up with this but the PRISM 2.0 draft is now available (since July 12) for public comment. See this posted by Dianne Kennedy:

"Just a note to let you know that PRISM 2.0 has just been posted at www.prismstandard.org <http://www.prismstandard.org/> . This is the first major revision to PRISM. We have incorporated new elements to support online content and have expanded and revised our controlled vocabularies. In addition we have added a profile to support PRISM in an XMP environment.

We invite you to review the new specification (in 6 documents organized by namespace) and provide your comments before September 15. Please just email comments and questions to me, dkennedy@idealliance.org. "

Handle Plugin: Some Notes

The first thing to note is that this demo (the Acrobat plugin) is an application. And that comes with its own baggage, i.e. this is a Windows only plugin and is targeted at Acrobat Reader 8. On a wider purview the application merely bridges an identifier embedded in the media file and the handle record filed against that identifier and delivers some relevant functionality. The data (or metadata) declared in the PDF and in the associated handle if rich enough and structured openly can also be used by other applications. I think this is a key point worth bearing in mind, that the demo besides showing off new functionalities is also demonstrating how data (or metadata) can be embedded at the respective endpoints (PDF, handle).

Some initial observations follow below.

Install problems

As noted in my previous post I had to haul out the old HP laptop and engage in a dialog with our IT folks to get both Acrobat Reader 8 and the plugin installed as I did not have admin privileges on my own machine. Wasn't pretty but they were kind.

Useability

I don't know what's happening here but from our network it seems as if the first attempts to contact the handle server are timing out and the handle client in the plugin is failing over to an alternate route (HTTP?). So, the plugin doesn't work as expected since the user has to wait an untenably long time (somewhere between 60s and 90s). Of course, if a certain network access policy is required that would need to be specified and implemented by institutions for their users.

I used both Firefox and Internet Explorer browsers and ran into occasional Acrobat plugin crashes which would lock up the browser. Due to the severe network access problems noted above I wasn't able to rigorously test this further apart from to note that it was "buggy".

Functionality

I tested most of the demo cases, but was hampered by the useability restrictions noted above. I didn't see the "Related Links" or get the "Collections" to work but did see all the other cases and tried the buttons provided.

One thing of note is that the CrossRef metadata record was spoofed and returned from a stored data file rather than an active query to CrossRef. A real query would have been been interesting to guage the impact of network latency, although the lookup point is made by hardwiring a response.

PDF Metadata

OK, so the doucment DOI is embedded in the PDF both in the document information dictionary and in the (document) metadata stream within an XMP packet. This is great although I do have some specific comments about how the DOI is actually disclosed. See my Metadata in PDF: 2. Use Cases post for details.

Handle Data

Handle types are generally a matter for the handle administrators to oversee, although the unregulated use of new types is not going to help foster interoperability between handle applications. In passing I note that the handles used in this demo

	10.5555/pdftest-collection
	10.5555/pdftest-collection-item1
	10.5555/pdftest-collection-item2
	10.5555/pdftest-collection-item3
	10.5555/pdftest-crossref
	10.5555/pdftest-kernelmetadata
	10.5555/pdftest-multires
	10.5555/pdftest-rights
	10.5555/pdftest-version

make use of the following handle types (periods and underscores used as below)

	COLLECTION
	COLLECTION_ITEM
	HS_ADMIN
	HS_MODIFIED
	HDL_MD
	HDL.RIGHTS
	HDL.XREF
	URL

There is some degree of variability here which presumably will be managed better with a central handle type registry.

DOI/Handle

And lastly, this demo raises questions again about DOI and handle boundaries. From a handle viewpoint a DOI is nothing more than a branded handle, whereas from a DOI viewpoint a DOI is a specific handle profile with governance and policies, and its own sevice portfolio. The two terms should not be used interchangeably which I fear is where some of the demo details would lead us. As a very crude analogy (and with apologies to Bob Kahn) but I would see the relationship between DOI and handle as not being dissimilar from that between TCP and IP.

Metadata in PDF: 3. Deployment

So, assuming we know the form of the metadata we wish to add to our PDFs (or else to comply with if there is already a set of guidelines, or some industry initiative in effect) how can we realize this? And, on the flip side, how can we make it easier for consumers to extract metadata we have embedded in our PDFs.

Below are some considerations on deploying metadata in PDFs and consumer access.

Write New

Obviously the best option would be to speak to one's suppliers and to get metadata added to the PDF at create time. This leads to questions such as:

  • What metadata do we have available in the workflow process? Do we have the full set we wish to write, or just a subset?
  • Do we include metadata in the document information dictionary, or in the document metadata stream, or both?
  • OK, so we've decided to (also) include an XMP packet. So, now do we make that XMP packet read only or write? That is, do we allow the possibility of further edits by adding in trailng whitespace and marking it as "write"?

Write Update

What possibilities exist for updating legacy PDF archives?

The cleanest means of updating a PDF is in-place edits. This maintains the number of PDF objects together with their lengths and byte offests. Specifically we are interested in metadata objects. There isn't too much one can do with the document information dictionary apart from overwriting a field value or substituting a field. This is something that may be possible on a "one off" basis only. On the other hand, XMP packets are ripe for updating if they are set in "write" mode and have trailing whitespace. This can be used to supplement the metadata already contained in the packet.

There is some "wiggle" room, however, even in read-only XMP packets which have no trailing whitespace. Some XMP packets may include unused default namespace declarations and/or empty elements. These could be safely stripped and used for more positive purposes. This may not be enough to write in a full metadata set, but could be enough to squeeze in the DOI.

The usual way to update a PDF file is to append new objects. This means that a replacement document information dictionary and (document) metadata stream can be provided without worrying about shoe-horning the data into any leftover space in the original objects.

And this would be just fine, but for the small matter of Linerarized PDFs. These are widely deployed as web friendly PDFs ready for byte serving and are written out in a strictly determined ordering. (See Appendix F, "Linearized PDF" in the PDF Reference Manual.) The manual does, however, say (Section F.4.6, "Accessing an Updated File") this about updating a Linearized PDF:

"As stated earlier, if a Linearized PDF file subsequently has an incremental update appended to it, the linearization and hints are no longer valid. Actually, this is not necessarily true, but the viewer application must do some additional work to validate the information.

...

For a PDF file that has received only a small update, this approach may be worthwhile. Accessing the file this way is quicker than accessing it without hints or retrieving the entire file before displaying any of it."

This may warrant some further investigation.

Read

Now for consumers, how can publishers help users to read the metadata embedded in a file? The document information dictionary is reasobaly accessible and is in the clear. It probably would not provide for much in terms of metadata but should anyway hopefully contain the DOI.

The XMP SDK is still far too unwieldy for wide use. Things would be much improved if there were even some SWIG wrappers for more popular languages such as Perl, Python, Ruby, etc. around the C++ code. The other thing to bear in mind is that the XMP SDK is dealing with generalities such as constructing and parsing XMP objects for reading and updating in a range of binary files. A consumer metadata app would only be interested in extracting the RDF/XML from the PDF. This can then be dealt with as appropriate to the application. Another problem concerns multiple XMP packets occurring in the same PDF, only one of them being the main (or document) XMP packet. This may be a non-problem in that all the RDF/XML could be extracted and the main XMP packet would be identifiable through the metadata it provided.

I suggest the best way to really help consumers is to go ahead and embed metadata in the first place, then there would be a clear impetus for extracting it. Even if a fuller metadata set is not being considered at this time, then at least the DOI should be considered for embedding in the PDF as a "hook" for further services. The handle plugin is a really good example of just such a downstream application.

August 01, 2007

Metadata in PDF: 2. Use Cases

Well, this is likely to be a fairly brief post as I'm not aware of many use cases of metadata in PDFs from scholarly publishers. Certainly, I can say for Nature that we haven't done much in this direction yet although are now beginning to look into this.

I'll discuss a couple cases found in the wild but invite comment as to others' practices. Let me start though with the CNRI handle plugin demo for Acrobat which I blogged here.

Handle Plugin

First off, the handle plugin PDF samples do include an embedded (test) DOI in both the document information dictionary

	5 0 obj
	<<
	/CreationDate (D:20070614140125-04'00')
	/Author (Simon)
	/Creator (PScript5.dll Version 5.2.2)
	/Producer (Acrobat Distiller 8.1.0 \(Windows\))
	/ModDate (D:20070614140240-04'00')
	/HDL (10.5555/pdftest-crossref)
	/Title (Microsoft Word - crossref-rev.doc)
	>>
	endobj
and in the (document) metadata stream
	<rdf:Description rdf:about="" xmlns:pdfx="http://ns.adobe.com/pdfx/1.3/">
	    <pdfx:HDL>10.5555/pdftest-crossref</pdfx:HDL>
	</rdf:Description>

Bar any fuller disclosure of metadata terms at large (and one of the demo cases makes use of DOI to retrieve metadata form CrossRef) this is excellent. I would, however, quibble with the use of "HDL" as a foreign key for the information dictionary. I realize this is just a test but the term "HDL" (or "DOI", for that's what it really is) is somewhat specific and a more general term such as "Identifier" would probably have more mileage, e.g.

	5 0 obj
	<<
	...
	/Identifier (doi:10.5555/pdftest-crossref)
	...
	>>
	endobj
In the second example from the metadata dictionary I don't think the term "HDL" from the PDF extension schema "pdfx" is very helpful. (Is that namespace actually defined anywhere?) From a descriptive metadata viewpoint a more usual schema such as DC would have wider coverage. So again the second example would be better rendered as
	<rdf:Description rdf:about="" xmlns:dc="http://purl.org/dc/elements/1.1/">
	    <dc:identifier>doi:10.5555/pdftest-crossref</dc:identifier>
	</rdf:Description>

or, alternately,

	<rdf:Description rdf:about="" xmlns:dc="http://purl.org/dc/elements/1.1/">
	    <dc:identifier>info:hdl/10.5555/pdftest-crossref</dc:identifier>
	</rdf:Description>

Elsevier

Well, we have Alexander Griekspoor's comment earlier that Elsevier are including the DOI in their PDFs. I don't know how consistently this is being done but I've checked a couple sample articles and it would seem that they have embedded the DOI (here from Cancer Cell, doi:0.1016/j.ccr.2007.06.004) in the title element which shows up in the information dictionary as

	361 0 obj
	<<
	/Producer (Adobe LiveCycle PDFG 7.2)
	/Creator (Elsevier)
	/Author ()
	/Keywords ()
	/Title (doi:10.1016/j.ccr.2007.06.004)
	/ModDate (D:20070630031637+05'30')
	/Subject ()
	/CreationDate (D:00000101000000Z)
	>>
	endobj

and in the (document) metadata dictionary as

	365 0 obj
	<<
	/Type /Metadata
	/Subtype /XML
	/Length 1526 
	>>
	stream
	<?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d' bytes='1526'?>
         
	<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'
	    xmlns:iX='http://ns.adobe.com/iX/1.0/'>
          
	 <rdf:Description about=''
  	     xmlns='http://ns.adobe.com/pdf/1.3/'
 	     xmlns:pdf='http://ns.adobe.com/pdf/1.3/'>
 	   <pdf:Producer>Adobe LiveCycle PDFG 7.2</pdf:Producer>
 	   <pdf:ModDate>2007-06-30T03:16:37+05:30</pdf:ModDate>
	   <pdf:Title>doi:10.1016/j.ccr.2007.06.004</pdf:Title>
	   <pdf:Creator>Elsevier</pdf:Creator>
 	   <pdf:Author></pdf:Author>
 	   <pdf:Keywords></pdf:Keywords>
 	   <pdf:Subject></pdf:Subject>
 	   <pdf:CreationDate>0-01-01T00:00:00Z</pdf:CreationDate>
	</rdf:Description>
         
	<rdf:Description about=''
 	    xmlns='http://ns.adobe.com/xap/1.0/'
 	    xmlns:xap='http://ns.adobe.com/xap/1.0/'>
 	  <xap:CreatorTool>Elsevier</xap:CreatorTool>
 	  <xap:ModifyDate>2007-06-30T03:16:37+05:30</xap:ModifyDate>
 	  <xap:Title>
  	    <rdf:Alt>
 	      <rdf:li xml:lang='x-default'>doi:10.1016/j.ccr.2007.06.004</rdf:li>
 	    </rdf:Alt>
 	  </xap:Title>
 	  <xap:Author></xap:Author>
 	  <xap:Description>
 	    <rdf:Alt>
 	      <rdf:li xml:lang='x-default'/>
 	    </rdf:Alt>
 	  </xap:Description>
 	  <xap:CreateDate>0-01-01T00:00:00Z</xap:CreateDate>
 	  <xap:MetadataDate>2007-06-30T03:16:37+05:30</xap:MetadataDate>
 	</rdf:Description>
         
	<rdf:Description about=''
 	    xmlns='http://purl.org/dc/elements/1.1/'
 	    xmlns:dc='http://purl.org/dc/elements/1.1/'>
 	  <dc:title>doi:10.1016/j.ccr.2007.06.004</dc:title>
 	  <dc:creator/>
 	  <dc:description/>
	</rdf:Description>
         
	</rdf:RDF>
	<?xpacket end='r'?>
	endstream
	endobj

Kudos anyway to Elsevier for emebedding this piece of information in their PDFs (if indeed it is a general practice). This has the merit of being picked up by Adobe apps and displayed in e.g. Reader. Also third party apps can pull this and use this to retrieve the metadata record from CrossRef.

The only downside is that technically this seems to be a kludge to satisfy Adobe apps and is not the correct field for filing this information. I would have thought that some other information dictionary field (e.g. "Subject") would be a better kludge, and then reserve the "Title" and "Author" fields for their proper purposes. The RDF/XML title fields would appear to be inherited from the "Title" field in the information dictionary. It's a bit of a shame really because the DOI is embedded - it's just in the wrong place(s). (OK, so that's still way better, maybe, than not providing this information at all.)

Hopefully, with more examples to mull over and experiences to learn from we can arrive at a much better and more systematic way of including the DOI, and other key metadata fields, within a PDF so that this information can be gleaned easily and unambiguously by third party apps.

Metadata in PDF: 1. Strategies

Emboldened by my own researches, by the recent handle plugin announcement from CNRI (on which, more in a follow-on post), and by Alexander Griekspoor's comment to my earlier post, I thought I'd write a more extensive piece about embedding metadata in PDF with a view to the following:

  • Discover what other publishers are currently doing
  • Stimulate discussions between content providers and/or consumers
  • Lay groundwork for a CrossRef best practice guidelines

Why should CrossRef be interested? Well, at minimum to embed the DOI along with the digital asset would seem to be inherently "a good thing". (And, in fact, this is precisely the approach that CNRI have taken for their plugin demos. I'll look later at what they actually did and consider whether that is a model that CrossRef publishers might usefully follow.)

Why include the DOI as an explicit piece of metadata rather than have it included by virtue of its appearance in a content section? The main reason is that it is then unambiguously accessible. Content sections in PDFs are typically filtered and sometimes encrypted), whereas metadata is usually plain text and moreover is marked up as to field type.

Another question concerns whether to add in the identifier alone, or to embed a full metadata set. Why not just embed the identifier and visit CrossRef for the metadata? This is feasible in some cases although it does involve an extra network trip, requires an application to service the identifier and is obviously not workable in offline contexts. Seems like a "no-brainer" to include a fuller description from the outset. Note that publishers frequently make some of this information available anyway in other metadata delivery channels, e.g. RSS feeds.

There are two (complementary) approaches to embedding document-level metadata in a PDF:


  • A - Document Information Dictionary  This is an optional object (a dictionary) referenced from the PDF trailer dictionary. Example:

    	1 0 obj
    	<<
    	/Title ( PostScript Language Reference, Third Edition )
    	/Author ( Adobe Systems Incorporated )
    	/Creator ( Adobe FrameMaker 5.5.3 for Power Macintosh® )
    	/Producer ( Acrobat Distiller 3.01 for Power Macintosh )
    	/CreationDate ( D:19970915110347-08'00' )
    	/ModDate ( D:19990209153925-08'00' )
    	>>
    	endobj
    

  • B - (Document) Metadata Stream  This is an optional object (a stream) referenced from the document catalog, itself referenced from the PDF trailer dictionary. Example:

    	2 0 obj
    	<<
    	/Type /Metadata
    	/Subtype /XML
    	/Length 1706
    	>>
    	stream
    	<?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?>
    	<!-- RDF/XML goes here -->
    	<?xpacket end='w'?>
    	endstream
    	endobj
    

    Both approaches usually make the embedded metadata in the PDF available in the clear, whereas content is frequently filtered and sometimes encrypted. (Note that the information dictionary is always in the clear, while the metadata stream can be filtered and rendered unreadable although in practice this tends not to be filtered.)

    Below I examine both approaches and see how they can be used to encode the kind of metadata that scholarly publishers are accustomed to.

    A - Document Information Dictionary

    Note that keys in the document information dictionary divide equally between the logical document description (non-asterisked keys) and the physical asset description (asterisked keys):

    	Title
    	Author
    	Subject
    	Keywords
             
    	* Creator
    	* Producer
    	* CreationDate
    	* ModDate
    	* Trapped
    

    This is the complete listing of keys in the PDF specification, although foreign keys are allowed (and ignored).

    What is missing here is any document identifier and/or any other descriptive metadata. From a CrossRef point of view the identifier (the DOI) is a "hook" into the metadata record and so at minimum this could usefully be added. The question then is how? Either the identifier can be squeezed into one of the existing fields ("Title", "Author", "Subject", "Keywords" ?) or else a new foreign key could be created.

    IMO if an existing keyword is used then I would opt for "Subject" or "Keywords", and probably the former. If, on the other hand, a new foreign key were to be created I would choose something generic and (in keeping with the other terms) use something like "Identifier" (rather than, say, "DOI").

    Of preference, I think I would go for the latter ("Identifier") but if one wanted to make this more robust one could think of also adding in a known term (e.g. "Subject" or "Keywords"). So, to include metadata for the news article "Cosmology: Ripples of early starlight" printed in Nature magazine Nature 445, 37 (2007): doi:10.1038/445037a, we might include the following terms in the document information dictionary as:

    	1 0 obj
    	<<
    	/Title ( Cosmology: Ripples of early starlight )
    	/Author ( Craig J. Hogan )
    	/Subject ( doi:10.1038/445037a )
    	/Keywords ( cosmology infrared protogalaxy starlight )
    	/Identifier ( doi:10.1038/445037a )
    	/Creator ( ... )
    	/Producer ( ... )
    	/CreationDate ( ... )
    	/ModDate ( ... )
    	>>
    	endobj
    

    where the bolded term represents a foreign key/value pair.

    Note: This (including the DOI in the "Subject" field) is a fix intended to get the DOI listed by Adobe apps which would not otherwise recognize the foreign key "Identifier".

    Since it is not really feasible to include separate enumerated fields within the information dictionary (although it could be done), one might also consider including a descriptive citation field as a foreign key, e.g something like:

    	/Source (Nature 445, 37 \(2007\))
    

    Aternatively that might better be presented as the "Subject" along with the DOI. Which would then limit the number of foreign keys to one ("Identifier").

    B - (Document) Metadata Stream

    The metadata stream with its use of XMP packets (wrapping RDF/XML instances) is a much more flexible approach to embedding metadata and allows multiple schemas to be used. As noted in my previous post here on XMP, PDFs with XMP packets mostly use media-specific terms and schemas, although there is also a token showing of DC. From a descriptive metadata point of view we would more likely make use of DC and PRISM for our schemas.

    Reprising the example from the previous post (and again using citation example listed above) this would mean we may be inclined to include the following terms for a scholarly work (here in RDF/N3 for readability):

    	dc:creator "Craig J. Hogan" ;
    	dc:title "Cosmology: Ripples of early starlight" ;
    	dc:identifier "doi:10.1038/445037a" ;
    	dc:source "Nature 445, 37 (2007)" ;
    	dc:date "2007-01-04" ;
    	dc:format "application/pdf" ;
    	dc:publisher "Nature Publishing Group" ;
    	dc:language "en" ;
    	dc:rights "© 2007 Nature Publishing Group"  ;
         
    	prism:publicationName "Nature" ;
    	prism:issn "0028-0836" ;
    	prism:eIssn "1476-4679" ;
    	prism:publicationDate "2007-01-04" ;
    	prism:copyright "© 2007 Nature Publishing Group" ;
    	prism:rightsAgent "permissions@nature.com" ;
    	prism:volume "445" ; 
    	prism:number "7123" ;
    	prism:startingPage "37" ;
    	prism:endingPage "37" ;
    	prism:section "News and Views" ; 
    

    This would look something like the following as an XMP packet within a PDF metadata stream (the RDF now being serialized as RDF/XML):

    	2 0 obj
    	<<
    	/Type /Metadata
    	/Subtype /XML
    	/Length 1706
    	>>
    	stream
    	<?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?>
    	<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    	  <rdf:Description rdf:about=""
    	        xmlns:dc="http://purl.org/dc/elements/1.1/">
                 <dc:creator>Craig J. Hogan</dc:creator>
                 <dc:title>Cosmology: Ripples of early starlight</dc:title>
                 <dc:identifier>doi:10.1038/445037a</dc:identifier>
                 <dc:source>Nature 445, 37 (2007)</dc:source>
                 <dc:date>2007-01-04</dc:date>
                 <dc:format>application/pdf</dc:format>
                 <dc:publisher>Nature Publishing Group</dc:publisher>
                 <dc:language>en<dc:language>
                 <dc:rights>© 2007 Nature Publishing Group</dc:rights>
              </rdf:Description>
     
    	  <rdf:Description rdf:about=""
    	        xmlns:prism="http://prismstandard.org/namespaces/1.2/basic/">
                 <prism:publicationName>Nature</prism:publicationName>
                 <prism:issn>0028-0836</prism:issn>
                 <prism:eIssn>1476-4679</prism:eIssn>
                 <prism:publicationDate>2007-01-04</prism:publicationDate>
                 <prism:copyright>© 2007 Nature Publishing Group</prism:copyright>
                 <prism:rightsAgent>permissions@nature.com</prism:rightsAgent>
                 <prism:volume>445</prism:volume> 
                 <prism:number>7123</prism:number>
                 <prism:startingPage>37</prism:startingPage>
                 <prism:endingPage>37</prism:endingPage&
                 <prism:section>News and Views</prism:section>
               </rdf:Description>
    	<?xpacket end='w'?>
    	endstream
    	endobj
    

    References

    Some useful references are:

    1. Adobe® Portable Document Format, Version 1.7, November 2006 (see http://www.adobe.com/devnet/pdf/pdf_reference.html).
    2. Adobe® XMP Sepcification, September 2005 (see http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf
    3. Embedding XMP Metadata in Application Files, September 2001 (see http://xml.coverpages.org/XMP-Embedding.pdf

    Note a): See Section 10.2, "Metadata" in Ref. [1].

    Note b): Ref. [3] is a fairly brief draft which covers both the Information Dictionary and Metadata Dictionary (XMP) approaches. There is an Adobe-hosted update to this document from June 2002 but that only discusses the XMP approach.