Blog

 8 minute read.

Weird Scenes Inside the Gold Mine

thammond

thammond – 2007 August 22

In Metadata

So, following up on my recent posts here on Metadata in PDFs (Strategies, Use Cases, Deployment), I finally came across PDF/A and PDF/X, two ISO standardized subsets of PDF. the former (ISO 19005-1:2005) for archiving and the latter (ISO 15929:2002, ISO 15930-1:2001, etc.) for prepress digital data exchange.

Both formats share some common ground such as minimizing surprises between producer and consumer and keeping things open and predictable. But my interest here is specifically in metadata and to see what guidance these standards might provide us. Not unsurprisingly, metadata is a key issue for PDF/A, less so for PDF/X. I’ll discuss PDF/X briefly but the bulk of this post is focussed on PDF/A. See below.

PDF/X

The main reference I am using here is the “Application Notes for PDF/X Standards” cited below [PDF/X 2]. There are two key sections which deal with metadata in PDF/X: “2.3 Identification and conformance”, and “2.20 Document identification and metadata”.

Section 2.3 states that a conforming PDF/X file has the key “/GTS_PDFXVersion” in the document information dictionary, and (depending on version) may or may not have the key “/GTS_PDFXConformance“.

Section 2.20 then talks about inclusion of a document ID within the document trailer to ensure correct identification of the file. It then goes on specifically to say:

“Additionally, the use of the PDF version 1.4 Metadata key is allowed. Note that although information placed using this mechanism may be beneficial to production processes, any reader that is not PDF version 1.4 compliant may ignore this information.”

That is, PDF/X requires the use of a document information dictionary with the key “/GTS_PDFXVersion” (and as version demands also the key “/GTS_PDFXConformance“) to signal conformance. It is lukewarm, though with regard to the inclusion of XMP metadata (as would be indicated by the “/Metadata” key in the document catalog).

PDF/A

The main reference I’m using here is the “ISO DIS 19005-1:2005” draft cited below [PDF/A, 1].

Completely differently from PDF/X, PDF/A puts all its attention on the XMP metadata, while at the same time acknowledging that the document information dictionary may be used. Note 1 in Section 6.7.3 notes that:

“Since a document information dictionary is allowed within a conforming file, it is possible for a single file to be both PDF/A-1 and PDF/X [12, 13] conformant.”

The non-normative Annex B also has this to say:

“Use of non-XMP metadata at the file level is strongly discouraged as there is no assurance that such metadata can be preserved in accordance with this specification. In cases where non-XMP metadata is present, the preference is to convert it to XMP, embed it in the file, and describe the conversion in the xmpMM:History property.”

It’s not fully clear here whether “file level” is intended to be the same as “document level”. But note that this anyway is from a non-normative section and does not reflect the actual normative wording used in the standard (Section 6.7.3) which allows the use of the document information dictionary.

The key section for our purposes in the standard is “6.7 Metadata”.

Section “6.7.2 Properties” says:

“The document catalog dictionary of a conforming file shall contain the Metadata key. The metadata stream that forms the value of that key shall conform to XMP Specification. All metadata properties pertaining to a file that are embedded in that file, except for document information dictionary entries that have no analogue in predefined XMP schemas as defined in 6.7.3, shall be in the form of one or more XMP packets as defined by XMP Specification, 3. Metadata properties shall be specified in predefined XMP schemas or in one or more extension schemas that comply with XMP requirements. Metadata object stream dictionaries shall not contain the Filter key.”

This is quite something. Not only is PDF/A fully supportive of XMP (even if Adobe sometimes appear to be less than enthusiastic) it actually requires it. Further it says that the XMP packets shall be human readable (well, apart from the small matter of XML, that is :).

Section “6.7.3 Document information dictionary” then goes on to say:

“A document information dictionary may appear within a conforming file. If it does appear, then all of its entries that have analogous properties in predefined XMP schemas, as defined by Table 1, shall also be embedded in the file in XMP form with equivalent values. Any document information dictionary entry not listed in Table 1 shall not be embedded using a predefined XMP schema property.”

This says that the primary source of metadata will be the XMP packet and that, as far as possible, metadata properties in the document information dictionary will be mapped directly to the XMP packet as specified and will not cause any conflict.

I’m not quite sure how to read the last sentence. Does that mean that is one were to use an “/Identifier” key in the document information dictionary then one couldn’t map it as “dc:identifier“, say, in the XMP. I think that would be OK. My read is that it precludes the use of a predefined term within the information dictionary, so one couldn’t have something like “dc:identifier” in the information dictionary.

Note also that the one quirky mapping in Table 1 which arises from the need to sync the information dictionary entries with the XMP properties is this:

“If the dc:creator property is present in XMP metadata then it shall be represented by an ordered Text array of length one whose single entry shall consist of one or more names. The value of dc:creator and the document information dictionary Author entry shall be equivalent.”

This means that:

“The document information dictionary entry:

/Author (Peter, Paul, and Mary)

 

is equivalent to the XMP property:

<dc:creator>
<rdf:Seq>
<rdf.:li>Peter, Paul, and Mary</rdf:li>
</rdf:Seq>
</dc:creator>

Weird, or what? Well, of course, I see the rationale, but …

The remaining sections of interest here are “6.7.6 File identifiers” which says that:

“A conforming file should have one or more metadata properties to characterize, categorize, and otherwise identify the file. This part of ISO 19005 does not mandate any specific identification scheme. Identifiers may be externally based, such as an International Standard Book Number (ISBN) or a Digital Object Identifier (DOI), or internally based, such as a Globally Unique Identifier/Universally Unique Identifier (GUID/UUID) or another designation assigned during workflow operations.”

Hmm, not that DOI is a file identifier necessarily. And certainly not in the Crossref usage where is denotes a work rather than a manifestation.

Section “6.7.8 Extension schemas” talks about the need to rigorously declare any extension (undefined) schema with the following PDF/A extension schema description schema properties:

  • pdfaSchema:schema
  • pdfaSchema:namespaceURI
  • pdfaSchema:prefix
  • pdfaSchema:property
  • pdfaSchema:valueTypeI think this means that were PRISM terms to be used the extension schema terms would need to be defined.

    And finally, the section “6.7.11 Version and conformance level identification” says that:

    > “The PDF/A version and conformance level of a file shall be specified using the PDF/A Identification extension schema defined in this clause.”

    This uses the PDF/A identification schema properties:

    • pdfaid:part
  • pdfaid:amd

  • pdfaid:conformanceSummary

    What does this all mean? Main lessons are to be learned from PDF/A which endorses (well, actually mandates) the use of XMP. Moreover, it requires that the document information dictionary and the XMP packet be in sync. Why it signals conformance through the XMP packet rather than through the information dictionary (as does PDF/X) is a mystery. Or at least not specify a means to also signal conformance through the information dictionary. The latter is readily get-at-able. A very crude hack to extract a PDF information dictionary can be as simple as

    % strings <filename.pdf> | grep “/Producer”
    

    or some other likely key. That will usually pull a line containing the full dictionary. The XMP packet is much harder to extract and then you’re still left with XML to parse.

    My gut feeling is that both mechanisms should be required (and sync’ed). And it’s hard not to see the DOI being required in both sections. Leads to considerations on which schemas/terms to use and how to render the DOI. I am biased and would prefer to see it rendered in URI form, i.e. in an inclusive rather than an exclusive representation. DOI is special - but not that special. Other identifiers are also useful.

    As per my earlier post, I could imagine that both DC and PRISM terms could be added to an XMP packet. I’m not sure whether there is any real interest at this time to follow the PDF/A specification or rather to be informed by it. There seems to be a lot of overhead and I’m still looking to meet up with some examples (either in the wild or fabricated) to see what it might look like in practice.

    Interested as always in others’ views.

    References

    So, note that these are ISO documents and as such are available for purchase from the ISO Store. (The citations above are linked to the relevant ISO Store pages.)

    See also this recent post (August 1, 2007) by Rick Jelliffe on XML.com: Where to get ISO Standards on the Internet free.

    There appear to be three main sources of information for these technologies: the ISO standards, application notes and FAQs. NPES (The Association for Suppliers of Printing, Publishing and Converting Technologies) hosts pages with relevant links - see here.

    Below are listed specific links to freely available documentation that may be useful. Note that I have not purchased the ISO standards but have made use of an ISO DIS (draft international standard) for PDF/A and Application Notes for PDF/X by CGATS. (As yet there are no links to Application Notes for PDF/A.)

    PDF/X

    1. (No Draft International Standard found.)
  • Application Notes for PDF/X Standards Version 3, September 2002, CGATS Application Notes for PDF/X Standards Version 4 (PDF/X-1a:2003, PDF/X-2:2003 & PDF/X-3:2003), September 2006 , CGATS

  • Frequently Asked Questions, November 2005, Martin Bailey, Chair, ISO/TC130/WG2/TF2 (PDF/X)PDF/A

    1. Draft International Standard ISO/DIS 19005-1, ISO/TC171/SC2, Document management— Electronic document file format for long-term preservation — Part 1: Use of PDF 1.4 (PDF/A-1)
  • (No Application Notes for PDF/A available yet.)

  • Frequently Asked Questions (FAQs), ISO 19005-1:2005, PDF/A-1, July 2006, PDF/A Joint Working Group

See also:

comments powered by Disqus
RSS Feed

Categories

Archives

Last Updated: 2018 July 7 by thammond