« September 2007 | Main | November 2007 »

October 17, 2007

Hybrid

So, back on the old XMP tack. The simple vision from the XMP spec is that XMP packets are embedded in media files and transported along with them - and as such are relatively self-contained units, see Fig 1.

Hybrid - A.jpg
Fig. 1 - Media files with fully encapsulated descriptions.

But this is too simple. Some preliminary considerations lead us to to see why we might want to reference additional (i.e. external) sources of metadata from the original packet:

PDFs
PDFs are tightly structured and as such it can be difficult to write a new packet, or to update an existing packet. One solution proposed earlier is to embed a minimal packet which could then reference a more complete description in a standalone packet. (And in turn this standalone packet could reference additional sources of metadata.)

Images
While considerably simpler to write into web-delivery image formats (e.g. JPEG, GIF, PNG), it is the case that metadata pertinent to the image only is likely to be embedded. Also, of interest is the work from which the image is derived which is most likely to be presented externally to the image as a standalone document. (And in turn this standalone packet could reference additional sources of metadata.)

(Continues)

Thus the two cases - PDF documents and images - are not dissimilar. Fig. 2 shows a "wall-to-wall" XMP architecture whereby the standalone metadata documents for the work and for additional sources are expressed in XMP.

Hybrid - B.jpg
Fig. 2 - XMP "wall-to-wall" architecture.

Fig. 3 presents a variant on this theme whereby additional sources are presented as generic RDF/XML. (In the most general case only RDF need be assumed, the serialization being a matter of choice.)

Hybrid - C.jpg
Fig. 3 - XMP authority metadata with references to generic RDF/XML

And finally, Fig. 4 shows the most extreme case whereby XMP is used merely to "bootstrap" RDF descriptions for media objects. The XMP is used to embed a minimal description into the media file with references to a fuller work description and to additional sources which are presented as generic RDF/XML. That is, the metadata descriptions use generic RDF/XML exclusively and only resort to the idiomatic RDF/XML employed by XMP for embedding descriptions into binary structures.

Hybrid - D.jpg
Fig. 4 - XMP "bootstrap" only - metadata descriptions proper are generic RDF/XML.

If I were to choose I might opt for the scenario presented in Fig. 3, but the scenarios in both Figs. 2 and 4 leave room for thought. Such a hybrid solution may be a means to bridge two different concerns:

  • Generic RDF/XML for unconstrained descriptions.
  • Idiomatic RDF/XML (aka XMP) for embedding the head of a metadata trail.

I'm not sure that I see the XMP spec loosening up any time soon to accommodate generic RDF/XML. Nor, likewise is XMP likely to be provided (or even tolerated) down the metadata trail. And the metadata is not going to be fully encapsulated within a media file. The media file will merely encapsulate the head of the metadata trail.

DCMI Identifiers Community

Another DCMI invitation. And a list. Lovely.

See this message (copied below) from Douglas Campbell, National Library of New Zealand, to the dc-general mailing list.

(Continues)

"Hi all,

I would like to alert members of this list to the new DCMI Identifiers Community established at the recent Dublin Core Metadata Initiative (DCMI) Advisory Board meeting in Singapore. It is moderated by Douglas Campbell (National Library of New Zealand).

The community is a forum for individuals and organisations with an interest in the design and use of identifiers in metadata. It also serves as a liaison channel for those involved in identifier efforts in other domains.

There was a lot of interest in identifiers at the recent DCMI conference. Identifiers are fundamental to the Web and for managing digital content, but most of us don't know where to begin in designing and assigning them. The level of confusion can be seen in the number of meetings and workshops held just about identifiers. DCMI is in a unique position to bring together the thinking (and doing) around identifiers from multiple domains.

I would like to encourage you to share your identifier efforts and thinking amongst the DCMI community on our Identifiers wiki at:
http://dublincore.org/identifierswiki

You can join the community by signing up to our JISCMAIL list, linked from our community homepage at:
http://www.dublincore.org/groups/identifiers/
or by going direct to jiscmail:
http://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=dc-identifiers&A=1

Thanx,
Douglas"

October 15, 2007

NLM Blog Citation Guidelines

I've just returned from Frankfurt Book fair and noticed that there has been some recent popular interest in the The NLM Style Guide for Authors, Editors and Publishers recommendations concerning citing blogs.

Which reminds me of an issue that has periodically been raised here at CrossRef- should we be doing something to try and provide a service for reliably citing more ephemeral content such as blogs, wikis, etc.?

Personally, I cringe when I see people include plain old URLs (POUs?) in citations. What's the point? They are almost guaranteed to fail to resolve after a few years. In citing them, you are hardly helping to preserve the scholarly record. You might as well just record the metadata associated with the content.

So why don't we simply allow individuals to assign DOIs to their content?

As Chuck Koscher says, "CrossRef DOIs are only as persistent as CrossRef staff." CrossRef depends on its ability to chase down and berate member publishers when they fail to update their DOI records. Its hard enough doing this with publishers, so just imagine what it would be like trying to chase down individuals. In short, it just wouldn't scale.

But what if we provided a different service for more informal content? Recently we have been in talking with Gunther Eysenbach, the creator of the very cool WebCite service about whether CrossRef could/should operate a citation caching service for ephemera.

As I said, I think WebCite is wonderful, but I do see a few problems with it in its current incarnation.

The first is that, the way it works now, it seems to effectively leech usage statistics away from the source of the content. If I have a blog entry that gets cited frequently, I certainly don't want all the links (and their associated Google-juice) redirected away from my blog. As long as my blog is working, I want traffic coming to my copy of the content, not some cached copy of the content (gee- the same problem publishers face, no?). I would also, ideally, like that traffic to continue to come to to my blog if I move hosting providers, platforms (WordPress, Moveable Type) , blog conglomerates (Gawker, Weblogs, Inc.), etc.

The second issue I have with WebCite is simpler. I don't really fancy having to actually recreate and run a web-caching infrastructure when there is already a formidable one in existence.

So what if we ran a service for individuals that worked like this:

  1. For a fee, you can assign DOIs to your ephemeral, CC-licensed content.
  2. When you assign a DOI to a piece of content (or update an existing DOI), we will immediately archive said content with the Internet Archive (who, incidentally, charges for this service)
  3. We will direct those DOIs to your web site as long as you are both:
    1. Paying the fee
    2. Updating your URLs to point to the correct content
  4. If you fail in either "a" or "b", we will then redirect said DOIs to the cached version of the content on the Internet Archive (after having warned you repeatedly via automated e-mail).

(Note, as an aside, that we could in theory provide a similar dark-archive service for publishers with non free content using something like JStore as the archive)

This approach would help to ensure that a blogger's version of content was always linked to as long it was available. It would also preserve the "persistence" of CrossRef DOIs by making sure that we could always resolve the DOI even if we were not able to get the owner of said DOI to update it.

So back to the NLM guidelines... On the one hand, I'm delighted to see that the NLM has issued guidelines on citing blogs. It seems glaringly obvious that informal (and ephemeral) content such as blogs and wikis are increasingly becoming vital parts of the scholarly record. On the other hand, it also seems to me that recommending that somebody "cite" with a broken pointer (i.e. a URL) to content verges on tokenism. This isn't the NLM's fault- there just isn't a reliable mechanism for citing informal content in a manner that ensures you can then retrieve and look at said content in the future.

And this is no longer a problem confined to the Scholarly/Professional publishing space. As Jon Udell has occasionally pointed out, citation is increasingly an important currency for *any* professional writer on the web. It seems to me that a system for reliably citing blogs and wikis would benefit many communities. I could easily see commercial hosted Blog services (Blogger, WordPress) offering a "Cached-DOI" feature as a premium service to their clients.

So what do you think? What am I missing? is this something we should be looking at?


October 14, 2007

OpenDocument Adds RDF

Bruce D'Arcus left a comment here in which he linked to post of his: "OpenDocument's New Metadata System". Not everybody reads comments so I'm repeating it here. His post is worth reading on two counts:

  1. He talks about the new metadata functionality for OpenDocument 1.2 which uses generic RDF. As he says:
    "Unlike Microsoft’s custom schema support, we provide this through the standard model of RDF. What this means is that implementors can provide a generic metadata API in their applications, based on an open standard, most likely just using off-the-shelf code libraries."
    This is great. It means that description is left up to the user rather than being restricted by any vendor limitation. (Ideally we would like to see the same for XMP. But Adobe is unlikely to budge because of the legacy code base and documents. It's a wonder that Adobe still wants XMP to breathe.)
  2. He cites a wonderful passage from Rob Weir of IBM (something which I had been considering to blog but too late now) about the changing shape of documents. Can only say, go read Bruce's post and then Rob's post. But anyway a spoiler here:
    "The concept of a document as being a single storage of data that lives in a single place, entire, self-contained and complete is nearing an end. A document is a stream, a thread in space and time, connected to other documents, containing other documents, contained in other documents, in multiple layers of meaning and in multiple dimensions."
I think the ODF initiative is fantastic and wish that Adobe could follow suit. However, I do still hold out something for XMP. After all, nobody else AFAICT is doing anything remotely similar for multimedia. Where's the W3C and co. when you really need them? (Oh yeah, faffing about the new Semantic Web logo. ;)

October 13, 2007

I Want My XMP

Now, assuming XMP is a good idea - and I think on balance it is (as blogged here earlier), why are we not seeing any metadata published in scholarly media files? The only drawbacks that occur to me are:

  1. Hard to write - it's too damn difficult, no tools support, etc.
  2. Hard to model - rigid, "simple" XMP data model, both complicates and constrains the RDF data model

Well, I don't really believe that 1) is too difficult to overcome. A little focus and ingenuity should do the trick. I do, however, think 2) is just a crazy straitjacket that Adobe is forcing us all to wear but if we have to live with that then so be it. Better in Bedlam than without. (RSS 1.0 wasn't so much better but allowed us to do some useful things. And that came from the RDF community itself.) We could argue this till the cows come home but I don't see any chance of any change any time soon.

(Continues)

So, putting the RDF issue aside for the moment (as if RDF didn't have problems of its own - XML, URI, etc.) let's just look at the options for writing the stuff. (Btw, I'm not referencing any tools or toolkits. This is just in the round.) There are various means of publishing metadata in XMP:

Sidecar
XMP can be produced as standalone files - see XMP Specification, (Sept. '05), p. 36. (These are called "sidecar" files if the file has the same name as the main document and is in the same directory.) The only things needed to produce these files are a text editor and a good grasp of the XMP serialization. A template will do for that. The main problem with a standalone file is that it does not travel with the media file and so risks being left behind.

Worth a note here. Not standalone as such but the Mars format (the draft XML formalization for PDF) discloses its metadata in an independent XMP file "metadata.xml" under the "META-INF/" directory. For distribution the whole directory structure is packaged up as a zip file and so the XMP is embedded in a ".mars" file, but accessed directly from the zip file or from the unpackaged directory the XMP can be manipulated just like any other XML document.

Embedded
This is the normal means of distributing XMP - embedded within the media file. Some graphics formats are essntially linear (JPEG, PNG, GIF) and it is relatively straightforward to add in an XMP packet. Other formats (PDF, TIFF) have internal cross-referencing and are more difficult to deal with.

Embedded + Sidecar
One possible method for dealing with the difficulty of writing XMP is to note that some media (especially PDFs) already have embedded XMP packets. As noted earlier, much if not all of the metadata in these XMP packets will be workflow-related and thus dispensible for final-form products where authority work-related metadata is desired. These packets may, or may not, be writeable and thus include additional padding whitespace. Even for read-only packets there is much (if not all) that can be discarded and also sometimes unnecesary bulk (e.g. default namespace declarations which are never used). The bottom line is that any legacy XMP packet may typically be 2-3K in size and, just as in transplanting a cell nucleus, the XMP packet innards can be deftly substituted with a minimal XMP packet content, say 1K in size, which would be guaranteed to fit with suitable padding. A packet that size would be sufficient to provide at minimum for a DOI and for a reference to additional metadata, e.g. a more complete standalone XMP packet. The two forms can coexist.


The third way option here allows embedding a minimal XMP packet into "difficult" packaging structures while pointing out to a fully-formed XMP packet. The "simple" packaging structures may both include a fully-formed XMP packet while also possibly referencing extended metadata sources as per my previous post here.

Metadata - For the Record

Interesting post here from Gunar Penikis of Adobe entitled "Permanent Metadata" (Oct. '04). He talks about the the issues of embedding metadata in media and comes up with this:

"It may be the case that metadata in the file evolves to become a "cache of convenience" with the authoritative information living on a web service. The web service model is designed to provide the authentication and permissions needed. The link between the two provided by unique IDs. In fact, unique IDs are already created by Adobe applications and stored in the XMP - that is what the XMP Media Management properties are all about."

An intriguing idea. Of course, Gunar's (and Adobe's) preoccupations with metadata revolve mainly around document workflow whereas, at least as things stand currently, scholarly publisher concerns are mainly with the dissemination of media in final form. Hence some differences in thinking:
Subject
As just noted Adobe are more interested in workflow than in work. Scholarly articles are rich in descriptive metadata about the work itself and have a well-developed ctation model. Academic interest is in the intellectual content rather than the vehicle used to carry and preserve that content - the file format.

Unique IDs
Workflow IDs are UUIDs which identify specific instances and expressions, but do not identify the abstract work. UUIDs provide a unique identifier but there is no central registry for such identifiers, hence they cannot be "looked up". CrossRef publishers should be concerned to associate closely the DOI for the underlying work with a given media file. That's the identifier that this community is actively promoting.

Read/Write
Because of the focus on workflow, the XMP specification recommends that XMP packets be "writeable", that is that they be marked as "writeable" and that they include padding whitespace which can accommodate updates without changing packet size. Publishers distributing final form documents are more likely to want to distribute "read-only" metadata which is authoritative and which describes the work, rather than the document format and workflow. Of course, this should not preclude additional sources of metadata which may be added "by reference" rather than "by value". That is, a pointer to a web page (or service) may be sufficient to relate additional publisher terms and user annotations instead of embedding them directly in the file for various reasons: a) file integrity, b) limiting growth of file size, c) term authority, d) dynamic production (in forward time), and e) multiple sources.

October 12, 2007

DataNet

Last week, my colleague Ian Mulvany posted on Nascent an entry about NSF's recent call for proposals on DataNet (aka "A Sustainable Digital Data Preservation and Access Network"). Peter Brantley, of DLF, has set up a public group DataNet on Nature Network where all are welcome to join in the discussion on what NSF effectively are viewing as the challenge of dealing with "big data". As Ian notes in a mail to me:

"It seems that for a fully integrated flow of data then publisher involvement is going to be required, and it is clear from the proposal that the NSF are also interested in rights management or at negotiating that issue."

October 09, 2007

OTMI Applied - Means More Search Hits

otmi-twease-window-alpha.png

(Click image to enlarge.)

Following up on previous posts here on OTMI (the proposal from NPG for scholarly publishers to syndicate their full text to drive text-mining applications), Fabien Campagne from Cornell, a long-time OTMI supporter, has created an OTMI-driven search engine (based on his Twease work). This may be the first publicly accessible OTMI-based service. It currently only contains NPG content from the OTMI archive online - some 2 years worth of Nature and four other titles. (When will we begin to see other publishers on board?)

What's happening here? Well, Twease is a web-based front-end to searching Medline abstracts. As such, a search will retrieve a set of results labeled by PMID and list all lines in the abstract where a match occurs. By contrast, with Twease-OTMI a search is run over the article full text and a will retrieve all text "snippets" (for Nature we use sentences, although other units of text are possible) which match. See the figure above where the top three results are all labeled by the same DOI and show text matches from various points within the document.

This shows that a far superior search match rate is possible using the article full text (as distributed in OTMI format) where text integrity as publishable asset is not compromised.

October 08, 2007

Mars Bar

Just noticed that there is now (as of last month) a blog for Mars ("Mars: Comments on PDF, Acrobat, XML, and the Mars file format"). See this from the initial post:

"The Mars Project at Adobe is aimed at creating an XML representation for PDF documents. We use a component-based model for representing different aspects of the document and we use the Universal Container Format (a Zip-based packaging format) to hold the pieces. Mars uses XML to represent the individual components where that makes sense, but otherwise uses industry standard formats to represent other components. Examples of these include Fonts (we use OpenType), Images (PNG, GIF, JPEG, JPEG2000), Color (ICC Color Profiles), etc.. We use SVG to represent page content, which fits as both an XML format and an industry standard."

October 05, 2007

The Names Project

Was reminded to blog about this after reading Lorcan's post on the Names Project being run by JISC. From the blurb:

"The project is going to scope the requirements of UK institutional and subject repositories for a service that will reliably and uniquely identify names of individuals and institutions.
 
It will then go on to develop a prototype service which will test the various processes involved. This will include determining the data format, setting up an appropriate database, mapping data from different sources, populating the database with records and testing the use of the data."
One immediate project tangible is the landscape report ('A review of the current landscape in relation to a proposed Name Authority Service for UK repositories of research outputs') which summarizes some current initiatives in author identification from a UK perspective, including inter alia Elsevier's Scopus Author Identifier.

Scholarly DC

This announcement was just sent out to the DC-GENERAL mailing list about the new DCMI Community for Scholarly Communications. As Julie Allinson says:

"The aim of the group is to provide a central place for individuals and organisations to exchange information, knowledge and general discussion on issues relating to using Dublin Core for describing items of 'scholarly communications', be they research papers, conference presentations, images, data objects. With digital repositories of scholarly materials increasingly being established across the world, this group would like to offer a home for exploring the metadata issues faced."
There's also a DC-SCHOLAR mailing list (subscribe here). Not too much there yet, but it may be useful to track - or even to participate. :)

October 02, 2007

InChIKey

The InChI (International Chemical Identifier from IUPAC) has been blogged earlier here. RSC have especially taken this on board in their Project Prospect and now routinely syndicate InChI identifiers in their RSS feeds as blogged here.

As reported variously last month (see here for one such review) IUPAC have now released a new (1.02beta) version of their software which allows hashed versions (fixed length 25-character) of the InChI, so-called InChIKey's, to be generated which are much more search engine friendly. Compare a regular InChI identifier:

InChI=1/C49H70N14O11/c1-26(2)39(61-42(67)33(12-8-18-55
-49(52)53)57-41(66)32(50)23-38(51)65)45(70)58-34(20-29-1
4-16-31(64)17-15-29)43(68)62-40(27(3)4)46(71)59-35(22-30
-24-54-25-56-30)47(72)63-19-9-13-37(63)44(69)60-36(48(7
3)74)21-28-10-6-5-7-11-28/h5-7,10-11,14-17,24-27,32-3
7,39-40,64H,8-9,12-13,18-23,50H2,1-4H3,(H2,51,65)(H,54,56
)(H,57,66)(H,58,70)(H,59,71)(H,60,69)(H,61,67)(H,62,68)(H,73,74)
(H4,52,53,55)/f/h56-62,73H,51-53H2

with its InChIKey counterpart:

InChIKey=JYPVVOOBQVVUQV-UHFFFAOYAR

That's some saving.

Oh No, Not You Again!

Oh dear. Yesterday's post "Using ISO URNs" was way off the mark. I don't know. I thought that walk after lunch had cleared my mind. But apparently not. I guess I was fixing on eyeballing the result in RDF/N3 rather than the logic to arrive at that result.

(Continues.)

There are three namespace cases (and I was only wrong in two out of the three, I think):

1. "pdf:"

I was originally going to suggest the use of "data:" for the PDF information dictionary terms here but then lunged at using an HTTP URI (the URI of the page for the PDF Reference manual on the Adobe site) for regular orthodox conformancy and good churchgoing:


@prefix pdf: <http://www.adobe.com/devnet/pdf/pdf_reference.html> .

This was wrong on two counts:

a) Afaik no such use for this URI as a namespace has ever been made by Adobe. And it is in the gift of the DNS tenant (elsewhere called "owner") to mint URIs under that namespace and to ascribe meanings to those URIs.

b) Also the URI is not best suited to a role as namespace URI since RDF namespaces typically end in "/" or "#" to make the division between namespace and term clearer. (In XML it doesn't make a blind bit of difference as XML namespaces are just a scoping mechanism.) So to have a property URI as


http://www.adobe.com/devnet/pdf/pdf_reference.htmlAuthor

does the job but looks pretty rough and more importantly precludes (at least, complicates) the possibility of dereferencing the URI to return a page with human or machine readable semantics. Better in RDF terms is one of the following:

a) http://www.adobe.com/devnet/pdf/pdf_reference/Author
b) http://www.adobe.com/devnet/pdf/pdf_reference#Author
c) http://www.adobe.com/devnet/pdf/pdf_reference.html#Author

In the absence of any published namespace from Adobe for these terms, I think it would have been more prudent to fall back on "data:" URIs. So

@prefix pdf: <data:,> .

leading to

data:,Author
data:,CreationDate
data:,Creator
etc.

This is correct (afaict) and merely provides a URI representation for bare strings.

Had we wanted to relate those terms to the PDF Reference we might have tried something like:


data:,PDF%20Reference:Author
data:,PDF%20Reference:CreationDate
data:,PDF%20Reference:Creator
etc.

And if we had wanted to make those truly secondary RDF resources related to a primary RDF resource for the "namespace" we could have attempted something like:

data:,PDF%20Reference#Author
data:,PDF%20Reference#CreationDate
data:,PDF%20Reference#Creator
etc.

Note though that the "data:" specification is not clear about the implications of using "#". (Is it allowed, or isn;t it?) We must suspect that it is not allowed, but see this mail from Chris Lilley (W3C) which is most insightful.

2. "pdfx:"

The example was just for demo purposes, but (as per 1a above) it is incumbent on the namespace authority (here ISO) to publish a URI for the term to be used. Anyhow, the namespace URI I cited


@prefix pdfx: <urn:iso:std:iso-iec:15930:-1:2001> .

would not have been correct and would have led to these mangled URIs:

urn:iso:std:iso-iec:15930:-1:2001GTS_PDFXVersion
urn:iso:std:iso-iec:15930:-1:2001GTS_PDFXConformance

It should have been something closer to

@prefix pdfx: <urn:iso:std:iso-iec:15930:-1:2001:> .

leading to

urn:iso:std:iso-iec:15930:-1:2001:GTS_PDFXVersion
urn:iso:std:iso-iec:15930:-1:2001:GTS_PDFXConformance

3. "_usr:"

This was the one correct call in yesterday's post.


@prefix _usr: <data:,> .

The only problem here would be to differentiate these terms from the terms listed in the PDF Reference manual, although the PDF information dictionary makes no such distinction itself.

To sum up, perhaps the best way of rendering the PDF information dictionary keys in RDF would be to use "data:" URIs for all (i.e. a methodology for URI-ifying strings) and to bear in mind that at some point ISO might publish URNs for the PDF/X mandated keys: 'GTS_PDFXVersion' and 'GTS_PDFXConformance'. So,

 
# document infodict (object 58: 476983):

@prefix: pdfx: <data:,> .
@prefix: pdf: <data:,> .
@prefix: _usr: <data:,> .

<> _usr:Apag_PDFX_Checkup "1.3";
pdf:Author "Scott B. Tully";
pdf:CreationDate "D:20020320135641Z";
pdf:Creator "Unknown";
pdfx:GTS_PDFXConformance "PDF/X-1a:2001";
pdfx:GTS_PDFXVersion "PDF/X-1:2001";
pdf:Keywords "PDF/X-1";
pdf:ModDate "D:20041014121049+10'00'";
pdf:Producer "Acrobat Distiller 4.05 for Macintosh";
pdf:Subject "A document from our PDF archive. ";
pdf:Title "Tully Talk November 2001";
pdf:Trapped "False" .


October 01, 2007

Using ISO URNs

(Update - 2007.10.02: Just realized that there were some serious flaws in the post below regarding publication and form of namespace URIs which I've now addressed in a subsequent post here.)

By way of experimenting with a use case for ISO URNs, below is a listing of the document metadata for an arbitrary PDF. (You can judge for yourselves whether the metadata disclosed here is sufficient to describe the document.) Here, the metadata is taken from the information dictionary and from the document metadata stream (XMP packet).

The metadata is expressed in RDF/N3. That may not be a surprise for the XMP packet which is serialized in RDF/XML, as it's just a hop, skip and a jump to render it as RDF/N3 with properties taken from schema whose namespaces are identified by URI. What may be more unusual is to see the document information dictionary metadata (the "normal" metadata in a PDF) rendered as RDF/N3 since the information dictionary is not nodelled on RDF, not expressed in XML, and not namespaced. Here, in addition to the trusty HTTP URI scheme, I've made use of two particular URI schemes: "iso:" URN namespaces, and "data:" URIs.

(Continues.)

As far as I am aware, there is no formal identifier for entries in the document information dictionary as specified by the PDF Reference from Adobe Systems, so it may be appropriate to use the HTTP URI for the Adobe homepage for the PDF Reference manual, from which specific editions are available.

For the PDF/X keys which are specified in the ISO standard ISO 15930-1 2001, I have used an ISO URN. (I don't expect this to be correct in all details but it should give some idea of how it might be used. It may be that the URI should express the term itself, rather than the document from which the term was defined.) And finally, for the one additional user-supplied key here I have made use of a "data:" URI with no body (i.e. I'm speechless). One could have provided some text within the body of the "data:" URI if one wanted to differentiate between alternate user keys or to otherwise annotate these keys.

Note that the prefixes used in the information dictionary and in the metadata stream are unrelated, as are the mappings of property elements to schemas.

Well, that's all really just for fun but it may show two things: 1) how a general description might be described with RDF and how general properties can be mapped to URIs (with possibly limited machine utility), and 2) how an ISO URN might be used.

 
   # document infodict (object 58: 476983):

@prefix: pdfx: <urn:iso:std:iso-iec:15930:-1:2001> .
@prefix: pdf: <http://www.adobe.com/devnet/pdf/pdf_reference.html> .
@prefix: _usr: <data:,> .

<> _usr:Apag_PDFX_Checkup "1.3";
pdf:Author "Scott B. Tully";
pdf:CreationDate "D:20020320135641Z";
pdf:Creator "Unknown";
pdfx:GTS_PDFXConformance "PDF/X-1a:2001";
pdfx:GTS_PDFXVersion "PDF/X-1:2001";
pdf:Keywords "PDF/X-1";
pdf:ModDate "D:20041014121049+10'00'";
pdf:Producer "Acrobat Distiller 4.05 for Macintosh";
pdf:Subject "A document from our PDF archive. ";
pdf:Title "Tully Talk November 2001";
pdf:Trapped "False" .

# document metadata stream (object 41: 472418):

@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix pdf: <http://ns.adobe.com/pdf/1.3/> .
@prefix pdfx: <http://ns.adobe.com/pdfx/1.3/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xmp: <http://ns.adobe.com/xap/1.0/> .
@prefix xmpMM: <http://ns.adobe.com/xap/1.0/mm/> .

<> pdf:Keywords "PDF/X-1";
pdf:Producer "Acrobat Distiller 4.05 for Macintosh";
pdfx:Apag_PDFX_Checkup "1.3";
pdfx:GTS_PDFXConformance "PDF/X-1a:2001";
pdfx:GTS_PDFXVersion "PDF/X-1:2001";
xmp:CreateDate "2002-03-20T13:56:41Z";
xmp:CreatorTool "Unknown";
xmp:MetadataDate "2004-10-14T12:10:49+10:00";
xmp:ModifyDate "2004-10-14T12:10:49+10:00";
xmpMM:DocumentID "uuid:bd7ae9a1-1110-43c0-8e84-632f2dbb55ab";
dc:creator [
a rdf:Seq;
rdf:_1 "Scott B. Tully" ];
dc:description [
a rdf:Alt;
rdf:_1 "A document from our PDF archive. "@x-default ];
dc:format "application/pdf";
dc:title [
a rdf:Alt;
rdf:_1 "Tully Talk November 2001"@x-default ] .

Whole Lotta ID

ISO has registered with the IANA a URN namespace identifier ("iso:") for ISO persistent resources. From the Internet-Draft:

"This URN NID is intended for use for the identification of persistent resources published by the ISO standards body (including documents, document metadata, extracted resources such as standard schemata and standard value sets, and other resources)."

The toplevel grammar rules (ABNF) give some indication of scope:

NSS     = std-nss
std-nss = "std:" docidentifier *supplement *docelement [addition]

Just wanted to quote here one of the funkier examples cited in the document:

urn:iso:std:iso:9999:-1:ed-1:v1-amd1.v1:en,fr:amd:2:v2:en:clause:3.1,a.2-b.9
 
"refers to (sub)clauses 3.1 and A.2 to B.9 in the corrected version of Amendment 2, in English, which amends the document comprising the 1st version of edition 1 of ISO 9999-1 incorporating the 1st version of Amendment 1, in English/French (bilingual document)"
Wow! That's some ID. That's something else.

As far as DOI is concerned there is nothing obvious to be learned. It is interesting to see such a level of granularity supported though. And since all these documents issue from a central publisher they can be prescriptive about the identifier syntax. Something which cannot be mandated for the many CrossRef publishers with their own commercial arrangements. Hence DOI is generally agnostic about suffix strings.

Seems to be a little confusion about the registration though. The NID was approved Jan. 15, '07 by the IESG and the IANA Registry of URN Namespaces (last updated Aug. 22, '07) lists the namespace "iso" with the provisional (unnumbered) RFC labelled "RFC-goodwin-iso-urn-01.txt" (being the -01 draft). However, the IETF I-D Tracker reports this status for draft-goodwin-iso-urn, which shows that a new I-D (an -02 draft) was submitted in Sept. 7, '07:

"A Uniform Resource Name (URN) Namespace for the International Organization for Standardization (ISO), draft-goodwin-iso-urn-02.txt"