CrossTech

Menu

Skip to content
  • Home
  • Terms & Conditons

Metadata in PDF: 3. Deployment

So, assuming we know the form of the metadata we wish to add to our PDFs (or else to comply with if there is already a set of guidelines, or some industry initiative in effect) how can we realize this? And, on the flip side, how can we make it easier for consumers to extract metadata we have embedded in our PDFs.
Below are some considerations on deploying metadata in PDFs and consumer access.


Write New
Obviously the best option would be to speak to one’s suppliers and to get metadata added to the PDF at create time. This leads to questions such as:

  • What metadata do we have available in the workflow process? Do we have the full set we wish to write, or just a subset?

  • Do we include metadata in the document information dictionary, or in the document metadata stream, or both?
  • OK, so we’ve decided to (also) include an XMP packet. So, now do we make that XMP packet read only or write? That is, do we allow the possibility of further edits by adding in trailng whitespace and marking it as “write”?

Write Update
What possibilities exist for updating legacy PDF archives?
The cleanest means of updating a PDF is in-place edits. This maintains the number of PDF objects together with their lengths and byte offests. Specifically we are interested in metadata objects. There isn’t too much one can do with the document information dictionary apart from overwriting a field value or substituting a field. This is something that may be possible on a “one off” basis only. On the other hand, XMP packets are ripe for updating if they are set in “write” mode and have trailing whitespace. This can be used to supplement the metadata already contained in the packet.
There is some “wiggle” room, however, even in read-only XMP packets which have no trailing whitespace. Some XMP packets may include unused default namespace declarations and/or empty elements. These could be safely stripped and used for more positive purposes. This may not be enough to write in a full metadata set, but could be enough to squeeze in the DOI.
The usual way to update a PDF file is to append new objects. This means that a replacement document information dictionary and (document) metadata stream can be provided without worrying about shoe-horning the data into any leftover space in the original objects.
And this would be just fine, but for the small matter of Linerarized PDFs. These are widely deployed as web friendly PDFs ready for byte serving and are written out in a strictly determined ordering. (See Appendix F, “Linearized PDF” in the PDF Reference Manual.) The manual does, however, say (Section F.4.6, “Accessing an Updated File”) this about updating a Linearized PDF:

“As stated earlier, if a Linearized PDF file subsequently has an incremental update appended to it, the linearization and hints are no longer valid. Actually, this is not necessarily true, but the viewer application must do some additional work to validate
the information.

…

For a PDF file that has received only a small update, this approach may be worthwhile. Accessing the file this way is quicker than accessing it without hints or retrieving the entire file before displaying any of it.”

This may warrant some further investigation.
Read
Now for consumers, how can publishers help users to read the metadata embedded in a file? The document information dictionary is reasobaly accessible and is in the clear. It probably would not provide for much in terms of metadata but should anyway hopefully contain the DOI.
The XMP SDK is still far too unwieldy for wide use. Things would be much improved if there were even some SWIG wrappers for more popular languages such as Perl, Python, Ruby, etc. around the C++ code. The other thing to bear in mind is that the XMP SDK is dealing with generalities such as constructing and parsing XMP objects for reading and updating in a range of binary files. A consumer metadata app would only be interested in extracting the RDF/XML from the PDF. This can then be dealt with as appropriate to the application. Another problem concerns multiple XMP packets occurring in the same PDF, only one of them being the main (or document) XMP packet. This may be a non-problem in that all the RDF/XML could be extracted and the main XMP packet would be identifiable through the metadata it provided.
I suggest the best way to really help consumers is to go ahead and embed metadata in the first place, then there would be a clear impetus for extracting it. Even if a fuller metadata set is not being considered at this time, then at least the DOI should be considered for embedding in the PDF as a “hook” for further services. The handle plugin is a really good example of just such a downstream application.

This entry was posted in Metadata on August 2, 2007 by thammond.

Post navigation

← Metadata in PDF: 2. Use Cases Handle Plugin: Some Notes →

Recent Posts

  • Easily add publications to your ORCID profile
  • CrossRef Metadata Search++
  • PatentCite
  • CrossRef and DataCite unify support for HTTP content negotiation
  • PDF-Extract

Recent Comments

  • Geoffrey Bilder on Content Negotiation for CrossRef DOIs
  • Karl Ward on Content Negotiation for CrossRef DOIs
  • Geoffrey Bilder on Content Negotiation for CrossRef DOIs
  • John S. Erickson, Ph.D. on DOIs and Linked Data: Some Concrete Proposals
  • Ed Summers on Content Negotiation for CrossRef DOIs

Archives

  • January 2013
  • October 2012
  • August 2012
  • May 2012
  • April 2012
  • February 2012
  • November 2011
  • October 2011
  • April 2011
  • March 2011
  • August 2010
  • July 2010
  • April 2010
  • March 2010
  • February 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007
  • March 2007
  • February 2007
  • January 2007
  • December 2006
  • November 2006
  • October 2006
  • September 2006
  • August 2006

Categories

  • Author Identifiers
  • Blog administration
  • Blogs
  • Citation Formats
  • Conference
  • CrossRef Labs
  • CrossRef Metadata Search
  • CrossTech
  • Data
  • Discussion
  • Handle
  • Identifiers
  • Interoperability
  • Linked Data
  • Linking
  • Meetings
  • Member Briefing
  • Metadata
  • Multiple Resolution
  • News
  • ORCID
  • OTMI
  • Patents
  • PDF
  • Programming
  • Publishing
  • RSS
  • Search
  • Standards
  • Uncategorized
  • Web
  • Webinars
  • XML
  • XMP

Meta

  • Log in
  • Entries RSS
  • Comments RSS
  • WordPress.org
Proudly powered by WordPress