So, assuming we know the form of the metadata we wish to add to our PDFs (or else to comply with if there is already a set of guidelines, or some industry initiative in effect) how can we realize this? And, on the flip side, how can we make it easier for consumers to extract metadata we have embedded in our PDFs.
Below are some considerations on deploying metadata in PDFs and consumer access.
Obviously the best option would be to speak to one’s suppliers and to get metadata added to the PDF at create time. This leads to questions such as:
- What metadata do we have available in the workflow process? Do we have the full set we wish to write, or just a subset?
- Do we include metadata in the document information dictionary, or in the document metadata stream, or both?
- OK, so we’ve decided to (also) include an XMP packet. So, now do we make that XMP packet read only or write? That is, do we allow the possibility of further edits by adding in trailng whitespace and marking it as “write”?
What possibilities exist for updating legacy PDF archives?
The cleanest means of updating a PDF is in-place edits. This maintains the number of PDF objects together with their lengths and byte offests. Specifically we are interested in metadata objects. There isn’t too much one can do with the document information dictionary apart from overwriting a field value or substituting a field. This is something that may be possible on a “one off” basis only. On the other hand, XMP packets are ripe for updating if they are set in “write” mode and have trailing whitespace. This can be used to supplement the metadata already contained in the packet.
There is some “wiggle” room, however, even in read-only XMP packets which have no trailing whitespace. Some XMP packets may include unused default namespace declarations and/or empty elements. These could be safely stripped and used for more positive purposes. This may not be enough to write in a full metadata set, but could be enough to squeeze in the DOI.
The usual way to update a PDF file is to append new objects. This means that a replacement document information dictionary and (document) metadata stream can be provided without worrying about shoe-horning the data into any leftover space in the original objects.
And this would be just fine, but for the small matter of Linerarized PDFs. These are widely deployed as web friendly PDFs ready for byte serving and are written out in a strictly determined ordering. (See Appendix F, “Linearized PDF” in the PDF Reference Manual.) The manual does, however, say (Section F.4.6, “Accessing an Updated File”) this about updating a Linearized PDF:
“As stated earlier, if a Linearized PDF file subsequently has an incremental update appended to it, the linearization and hints are no longer valid. Actually, this is not necessarily true, but the viewer application must do some additional work to validate
For a PDF file that has received only a small update, this approach may be worthwhile. Accessing the file this way is quicker than accessing it without hints or retrieving the entire file before displaying any of it.”
This may warrant some further investigation.
Now for consumers, how can publishers help users to read the metadata embedded in a file? The document information dictionary is reasobaly accessible and is in the clear. It probably would not provide for much in terms of metadata but should anyway hopefully contain the DOI.
The XMP SDK is still far too unwieldy for wide use. Things would be much improved if there were even some SWIG wrappers for more popular languages such as Perl, Python, Ruby, etc. around the C++ code. The other thing to bear in mind is that the XMP SDK is dealing with generalities such as constructing and parsing XMP objects for reading and updating in a range of binary files. A consumer metadata app would only be interested in extracting the RDF/XML from the PDF. This can then be dealt with as appropriate to the application. Another problem concerns multiple XMP packets occurring in the same PDF, only one of them being the main (or document) XMP packet. This may be a non-problem in that all the RDF/XML could be extracted and the main XMP packet would be identifiable through the metadata it provided.
I suggest the best way to really help consumers is to go ahead and embed metadata in the first place, then there would be a clear impetus for extracting it. Even if a fuller metadata set is not being considered at this time, then at least the DOI should be considered for embedding in the PDF as a “hook” for further services. The handle plugin is a really good example of just such a downstream application.