Now, assuming XMP is a good idea – and I think on balance it is (as blogged here earlier), why are we not seeing any metadata published in scholarly media files? The only drawbacks that occur to me are:
- Hard to write – it’s too damn difficult, no tools support, etc.
- Hard to model – rigid, “simple” XMP data model, both complicates and constrains the RDF data model
Well, I don’t really believe that 1) is too difficult to overcome. A little focus and ingenuity should do the trick. I do, however, think 2) is just a crazy straitjacket that Adobe is forcing us all to wear but if we have to live with that then so be it. Better in Bedlam than without. (RSS 1.0 wasn’t so much better but allowed us to do some useful things. And that came from the RDF community itself.) We could argue this till the cows come home but I don’t see any chance of any change any time soon.
So, putting the RDF issue aside for the moment (as if RDF didn’t have problems of its own – XML, URI, etc.) let’s just look at the options for writing the stuff. (Btw, I’m not referencing any tools or toolkits. This is just in the round.) There are various means of publishing metadata in XMP:
- XMP can be produced as standalone files – see XMP Specification, (Sept. ’05), p. 36. (These are called “sidecar” files if the file has the same name as the main document and is in the same directory.) The only things needed to produce these files are a text editor and a good grasp of the XMP serialization. A template will do for that. The main problem with a standalone file is that it does not travel with the media file and so risks being left behind.
Worth a note here. Not standalone as such but the Mars format (the draft XML formalization for PDF) discloses its metadata in an independent XMP file “metadata.xml” under the “META-INF/” directory. For distribution the whole directory structure is packaged up as a zip file and so the XMP is embedded in a “.mars” file, but accessed directly from the zip file or from the unpackaged directory the XMP can be manipulated just like any other XML document.
- This is the normal means of distributing XMP – embedded within the media file. Some graphics formats are essntially linear (JPEG, PNG, GIF) and it is relatively straightforward to add in an XMP packet. Other formats (PDF, TIFF) have internal cross-referencing and are more difficult to deal with.
- Embedded + Sidecar
- One possible method for dealing with the difficulty of writing XMP is to note that some media (especially PDFs) already have embedded XMP packets. As noted earlier, much if not all of the metadata in these XMP packets will be workflow-related and thus dispensible for final-form products where authority work-related metadata is desired. These packets may, or may not, be writeable and thus include additional padding whitespace. Even for read-only packets there is much (if not all) that can be discarded and also sometimes unnecesary bulk (e.g. default namespace declarations which are never used). The bottom line is that any legacy XMP packet may typically be 2-3K in size and, just as in transplanting a cell nucleus, the XMP packet innards can be deftly substituted with a minimal XMP packet content, say 1K in size, which would be guaranteed to fit with suitable padding. A packet that size would be sufficient to provide at minimum for a DOI and for a reference to additional metadata, e.g. a more complete standalone XMP packet. The two forms can coexist.
The third way option here allows embedding a minimal XMP packet into “difficult” packaging structures while pointing out to a fully-formed XMP packet. The “simple” packaging structures may both include a fully-formed XMP packet while also possibly referencing extended metadata sources as per my previous post here.