« DataNet | Main | I Want My XMP »

Metadata - For the Record

Interesting post here from Gunar Penikis of Adobe entitled "Permanent Metadata" (Oct. '04). He talks about the the issues of embedding metadata in media and comes up with this:

"It may be the case that metadata in the file evolves to become a "cache of convenience" with the authoritative information living on a web service. The web service model is designed to provide the authentication and permissions needed. The link between the two provided by unique IDs. In fact, unique IDs are already created by Adobe applications and stored in the XMP - that is what the XMP Media Management properties are all about."

An intriguing idea. Of course, Gunar's (and Adobe's) preoccupations with metadata revolve mainly around document workflow whereas, at least as things stand currently, scholarly publisher concerns are mainly with the dissemination of media in final form. Hence some differences in thinking:
Subject
As just noted Adobe are more interested in workflow than in work. Scholarly articles are rich in descriptive metadata about the work itself and have a well-developed ctation model. Academic interest is in the intellectual content rather than the vehicle used to carry and preserve that content - the file format.

Unique IDs
Workflow IDs are UUIDs which identify specific instances and expressions, but do not identify the abstract work. UUIDs provide a unique identifier but there is no central registry for such identifiers, hence they cannot be "looked up". CrossRef publishers should be concerned to associate closely the DOI for the underlying work with a given media file. That's the identifier that this community is actively promoting.

Read/Write
Because of the focus on workflow, the XMP specification recommends that XMP packets be "writeable", that is that they be marked as "writeable" and that they include padding whitespace which can accommodate updates without changing packet size. Publishers distributing final form documents are more likely to want to distribute "read-only" metadata which is authoritative and which describes the work, rather than the document format and workflow. Of course, this should not preclude additional sources of metadata which may be added "by reference" rather than "by value". That is, a pointer to a web page (or service) may be sufficient to relate additional publisher terms and user annotations instead of embedding them directly in the file for various reasons: a) file integrity, b) limiting growth of file size, c) term authority, d) dynamic production (in forward time), and e) multiple sources.

Comments

He's right about the importance of unique IDs, but I don't think Adobe has fully resolved the problem. I'd say they've avoided it.

For example, they effectively strip out the core building block of RDF from XMP: the URI. So their GUID is in fact encoded as a literal, and the document effectively has no stable identity (an empty rdf:about attribute typically).

If you give the document a stable URI, then it becomes easy to describe relations between documents: x is a version of y, and so forth. Some of this could even be automatically captured.

Now, giving a document a stable URI becomes a little tricky in a desktop environment where you move documents around, and so cannot rely on, for example, a file path URI.

But, it seems to me, you could give a document something like a urn:uuid or a similar non-dereferencable URI.

BTW, see my latest on ODF and RDF.

Oh, BTW, if one gave a document (like an ODF document) a non-derefereable URI, I'd store the file path information and such along with that.

Hi Bruce:

Not sure I follow you entirely. This is a typical XMP packet (here from the XMP spec):


@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix pdf: <http://ns.adobe.com/pdf/1.3/> .
@prefix xmp: <http://ns.adobe.com/xap/1.0/> .
@prefix xmpMM: <http://ns.adobe.com/xap/1.0/mm/> .

<uuid:f37aabb4-4adc-11d8-8575-000a9575d11c> pdf:Keywords "XMP metadata schema XML RDF";
pdf:Producer "Acrobat Distiller 5.0.5 for Macintosh";
xmp:CreateDate "2004-01-14T11:50:46Z";
xmp:CreatorTool "FrameMaker 7.0";
xmp:MetadataDate "2004-01-19T16:09:44-08:00";
xmp:ModifyDate "2004-01-19T16:09:44-08:00";
xmpMM:DocumentID "uuid:ddbb0948-4adc-11d8-8575-000a9575d11c";
dc:creator [
a rdf:Seq;
rdf:_1 "Adobe Developer Technologies" ];
dc:description [
a rdf:Alt;
rdf:_1 "XMP Metadata"@x-default ];
dc:format "application/pdf";
dc:title [
a rdf:Alt;
rdf:_1 "XMP - Extensible Metadata Platform"@x-default ] .

So, They are are using UUIDs as URIs. But, since the spec appeared contemporaneously wit the URN UUID RFC, they were not able to make use of that. You are right about XMP wanting to force predicates to be mostly (not always) literals rather than resources. This really is unforgiveable. But what to do? I'd rather have a resource as literal than nothing at all. Because consenting applications could read it for what it is. That doesn't make it right, though.

I had admittedly not look at XMP lately. Are you saying they're now serializing as N3?? It's a little hard to read the example on my screen at least. It like you're saying the subject URI is in fact . If yes, then mea culpa; seems they've changed things, and that's good.

But yes, my point is the notion of reference is really central not just to RDF, but users. Think of a GUI select lists.

Are you kidding me? That "X" is there for a reason. ;)

Nopes, XMP is pretty much as it always was. Just that I find it so much easier to read N3 than XML, wouldn't you agree? Blogged about that first about halfway through this post:

http://www.crossref.org/CrossTech/2007/08/exiftool.html

I set this alias in my login to grep the XMP packet and turn it into something readable:

alias xmp2n3 'exiftool -xmp -b \!$ | grep -v "<?" | grep -v xmpmeta | cwm --rdf --n3=d'

As for the UUID's. This example, as in some XMP, uses a UUID as subject. Generally the subject is empty and the reference is to be taken as the relative URI (current document).

(This from 5.3, http://www.w3.org/TR/rdf-syntax-grammar/

"The empty string is transformed into an RDF URI reference by substituting the in-scope base URI.")

As for resources as literals. While this is real bad practice, I guess I'm inclined to be a little more lenient coming from a Perl/Ruby background and used to riding on the coattails of dynamic typing. If I was a diehard C/C++, or Java, programmer I would be truly aghast.

Post a comment

Verification (needed to reduce spam):