« November 2008 | Main | January 2009 »

December 22, 2008

And the DOI is ...

Once structured metadata is added to a file then retrieving a given metadata element is usually a doddle. For example, for PDFs with embedded XMP one can use Phil Harvey's excellent Exiftool utility.

Exiftool is a Perl library and application which I've blogged about here earlier which is available as a '.zip' file for Windows (no Perl required) or '.dmg' for MacOS. Note that Phil maintains this actively and has done so over the last five years. (And when I say actively I mean just that. I once made the mistake of printing out the change file.)

If Perl's not your thing, then there's a Ruby wrapper gem (MiniExiftool) to access the Exiftool command in trouper OO fashion. Here's an example Ruby one-liner to get the DOI from a PDF (broken here to meet column width restriction):

% ruby -rubygems -e 'require "mini_exiftool";
    puts MiniExiftool.new("test.pdf")["doi"]'
10.1038/nphoton.2008.200
Of course, that could also have been run against an image, audio or video file with XMP packet.

(Makes one wonder vaguely about the feasibility of having a Swiss Army knife type of utility that could read any file to get the DOI using the embedded XMP, RDFa, RDF, HTML headers, COiNS, etc. Possibly even as last resort fall back to scanning the raw text - if any.)

December 19, 2008

Xmas XMP

Well, as I blogged on our web publishing blog Nascent we just went live with XMP labelling on Nature in yesterday's double issue. We will be adding XMP to all new issues of Nature as well as rolling out across all our other titles in the next few weeks and months.

The screenshots below from Acrobat (File > Properties, CMD-D / CTL-D) show what the user might see both with (bottom-left) and without (top-right) semantic markup.

pdf_props.png

As to the actual contents of the metadata record, see this sample I posted to the semantic web list.

December 06, 2008

ORE/POWDER: Remarks on Ratings

I wanted to make some remarks about the "Ease of use" and "Learn curve" ratings which I gave in the ORE/POWDER comparison table that I blogged about here the other day. It may seem that I came out a little harsh on ORE and a little easy on POWDER. I just wanted to rationalize the justification for calling it that way. (By the way, the revised comparison table includes a qualification to those ratings.)

My primary interest was from the perspective of a data provider rather than a data consumer. What does it take to get a resource description document ("resource map", "description resource" or "sitemap") ready for publication?

(Continues)

To look at POWDER first, it defines two sets of semantics: an "operational semantics" which is embodied in the simple XML that is intended as the primary publication vehicle, and a "formal semantics" embodied in the RDF/OWL document that would typically be generated by a POWDER processor.

The operational semantics (XML) document requires minimal RDF understanding (and arguably none at all): it only requires that URI resources be organized into <iriset> groups by pattern matching, and that metadata be attached to those groups using <descriptorset> groups.

URI patterns are specified using any of the following XML elements for inclusive patterns:

<includeschemes>, <includehosts>, <includeexactpaths>, <includepathcontains>, <includepathstartswith>, <includepathendswith>, <includeports>, <includequerycontains>, <includeiripattern>, <includeregex>, <includeresources>
and their exclusive counterparts
<excludeschemes>, ...
These are turned into corresponding regular expressions by a POWDER processor which then emits RDF/OWL classes using those expressions as property restrictions on set membership. But a publisher is not required to understand this transformation nor the formal semantics generated from the simple XML document that was authored.

Now, as to metadata. Resource group descriptors are either free text (tags) or properties from a published namespace. For example, the property name from a namespace ex: would be added in one of two ways, depending on whether it were a simple literal string ("value", say) or a resource URI ("http://example.org/value", say):

  • <ex:name>value</ex:name>
  • <ex:name rdf:resource="http://example.org/value"/>
While technically this is RDF/XML it hardly qualifies, I think, as requiring any great knowledge of RDF, more a knowledge of XML namespaces alone would be sufficient.

And that's about it – all that is required for publication of a POWDER "description resource" document. (The guidelines for discovery mechanisms of a POWDER document might also need to be consulted.)

So, on that basis I would judge POWDER to be at most "medium" on the "Learn curve". However, as soon as the mapping to the formal semantics (POWDER-S) using RDF/OWL is considered, then that learn curve rating would automatically swing to "high".

Now, ORE on the other hand is a straightforward RDF application. What does make ORE a bit of a challenge are the following two aspects:

  1. concept of named aggregation
  2. abstract data model - no fixed bindings
Well, the first aspect is what ORE is all about – its USP – and what it gives us beyond the simpler POWDER approach of merely describing resource bundles. Still, it's a concept that needs to be grokked. All too easy to take it for granted.

It is the second aspect that may make ORE appear to be "difficult". It does not prescribe a single binding or set of bindings but provides an abstract data model. That means that a prospective user must endeavour to understand something of the model before deploying.

But enough of that. Because who really reads instruction manuals anyway? So to deploy there are user guides available for one standalone document format (RDF/XML), and two carrier document formats (Atom, RDFa). That means right there that the publisher must either embrace RDF/XML or learn how to weave it into an existing document markup. (By the way, it should be remarked that there is an excellent primer available - as there is also for POWDER - and user guides for each of the formats.)

So that I think warrants the "high" rating for ORE on the learn curve, and the corresponding "low" ease of use. But that is not to say that the two initiatives are in any competition and that one should be favoured over the other. They serve different purposes. Any yet they may also have compatibilities as the previous mapping of ORE in POWDER attempts to show. I'll leave that task for other commentators.


December 05, 2008

Resource Maps Encoded in POWDER

Following right on from yesterday's post on ORE and POWDER, I've attempted to map the worked examples in the ORE User Guide for RDF/XML (specifically Sect. 3) to POWDER to show that POWDER can be used to model ORE, see

Resource Maps Encoded in POWDER
(A full explanation for each example is given in the RDF/XML Guide, Sect. 3 which should be consulted.)

This could just all be sheer doolally or might possibly turn out to have a modicum of instructional value – I don't know. (It would be real good to get some feedback here.) There are, however, a couple points to note in mapping ORE to POWDER:

  1. The POWDER form is actually more long-winded because it splits the RDF triples into subject and predicate/object divisions, with the first listed within an <iriset> and the second within a <descriptorset>. The net effect, however, may be somewhat cleaner since POWDER uses a simple XML format rather than RDF/XML.
  2. POWDER only supports simple object types (literals or resources) so the blank nodes in the RDF/XML examples for <dcterms:creator> cannot be mapped as such. I have chosen here to use either <foaf:name> or <foaf:page> value.
  3. Likewise, and as far as I am aware, POWDER does not support datatyping but I could be wrong on this. I have thus dropped the datatypes on <dcterms:created> and <dcterms:modified>.
This is a fairly naïve mapping. POWDER's real strength comes in defining groups of resources with its powerful pattern matching capabilities, whereas here I am using a named single resource in each <iriset> through the <includeresource> element. I think, though, this does show how the abstract ORE data model can be serialized in yet another format.


December 04, 2008

Describing Resource Sets: ORE vs POWDER

I've been reading up on POWDER recently (the W3C Protocol for Web Description Resources) which is currently in last call status (with comments due in tomorrow). This is an effort to describe groups of Web resources and as such has clear similarities to the Open Archives Initiative ORE data model, which has been blogged about here before.

In an attempt to better understand the similarities (and differences) between the two data models, I've put up the table

A Comparison of Description Mechanisms for URI Collections

ore-powder-fragment-30.jpg
which directly compares the two heavyweight contendors OAI-ORE and POWDER and also (unfairly) places them alongside the featherweight Sitemaps Protocol for reference.

This is very much a draft document and I will aim to update the table based on my own further reading and on any feedback that I may get (contributions gratefully received). I'm all too aware that my understanding of the respective data models is painfully limited and I, for one, hope to profit through this exercise. There will be certainly errors which I will aim to fix as soon as I get wind of them. :)

By the way, the ORE work especially is of interest to CrossRef members and has obvious synergies with the multiple resolution potential that DOI has long promised but not quite delivered on.

December 03, 2008

Ubiquity commands for CrossRef services

So the other day Noel O'Boyle made me feel guilty when he pinged me and asked about the possibility using one of the CrossRef APIs for creating a Ubiquity extension. You see, I had played with the idea myself and had not gotten around to doing much about it. This seemed inexcusable- particularly given how easy it is to build such extensions using the API we developed for the WordPress and Moveable Type plugins that we announced earlier in the year. So I dug up my half-finished code, cleaned it up a bit and have posted the results.

Note that the back-end that supports the plugins has been moved to more stable machines and the index is now being automatically updated with journal and conference proceeding deposits (sorry, no books yet).

Also note that we are hoping that others will look at the code for the WordPress, Moveable Type and Ubiquity plugins and create more such extensions. If you do, please let us know about them at citation-plugin@crossref.org.

CURIEs - A Cure for URIs

A quick straw poll of a few folks at London Online yesterday revealed that they had not heard of CURIE's. And there was I thinking that most everybody must have heard of them by now. :) So anyway here's something brief by way of explanation.

CURIE stands for Compact URI and does the signal job or rendering long and difficult to read URI strings into something more manageable. (URIs do have the particular gift of being "human transcribable" but in practice their length and the actual characters used in the URI strings tend to muddy things for the reader.) So given that the Web is built upon a bedrock of URIs, anything that then makes URIs easier to handle is going to be an important contributor to our overall ease of interaction with the Web.

(Continues)

Ten years back (in February 1998) when XML was first introduced it presented a flat naming system for document markup. For purposes of modularity and markup reuse the XML Namespaces specification released the following year allowed for element and attribute names to be replaced by expanded names in which the hitherto simple names would be replaced by name pairs consisting of a namespace name and a local name. The use of URIs for the namespace name thus opened the doors to assigning globally unique names for XML element/attribute names. As a practical point (both to keep the names short and to deal with URI characters), the notion of a qualified name (or QName) was introduced, whereby the local name would be qualified by a prefix which stood in for the namespace name.

This was such a successful device that over time it was applied to URIs in general as a mechanism for abbreviation. Especially in RDF/XML schema elements were referenced by QName. And the practice has spilled over into non-XML syntaxes (e.g. the N3 and Turtle RDF grammars which use a "@prefix" directive). But there were problems since the device was grounded in XML the local names were constrained by allowable characters for XML elements and attributes (e.g. names cannot start with a digit character), as well as there being no specification for applying this same device to non-XML grammars.

CURIE is an initiative to generalize this notion of qualified names for URIs beyond the immediate XML context for naming elements and attributes (which would also allow their use in attribute values), to a more general use in applications beyond XML. The development of CURIE is based upon work done in the definition of XHTML2, and upon work done by the RDF-in-HTML Task Force, a joint task force of the Semantic Web Best Practices and Deployment Working Group and XHTML 2 Working Group. The Editor's draft CURIE Syntax 1.0 is currently a W3C Candidate Recommendation which is receiving comments through Jan 15, 2009, at which time it is intended to put it forward as a W3C Proposed Recommendation. Meantime, though, the new W3C Recommendation RDFa Syntax in XHTML (published Oct 14, 2008) has a normative section on CURIEs (see Sect. 7).

So, what do CURIEs look like? Taking a simple RDFa example for DOI we might have a fragment such as:

<div xmlns:doi="http://dx.doi.org/" xmlns:dcterms="http://purl.org/dc/terms/">
  <div about="doi:10.1038/nature07184">
    <span property="dcterms:hasPart" resource="[doi:10.1038/nature07184]"/>
  </div>
</div>

This would be processed by an RDFa processor to yield the RDF triple (in N3/Turtle):
<doi:10.1038/nature07184> dcterms:hasPart <http://dx.doi.org/10.1038/nature07184> .

This triple (or fact) says that the resource identified by <doi:10.1038/nature07184> has as a component part (cf. DCTERMS vocabulary) the resource identified by <http://dx.doi.org/10.1038/nature07184>. (The abstract work identified by the DOI has as a component part the splash page identified by the proxy URL.)

OK, so what's going on? The "property" attribute takes a CURIE as value where the prefix "dcterms" is standing in for the XML namespace URI. The "about" and "resource" attributes both take a URI or CURIE as value, but because of any potential confusion a (so-called) "Safe CURIE" must be used which is a CURIE wrapped in brackets. The above example does not use brackets for the "about" attribute and therefore an RDFa processor would read this as being a full URI, i.e. <'doi:10.1038/nature07184>, whereas it does use brackets for the "resource" attribute and therefore this would be read as being a (Safe) CURIE, i.e. <http://dx.doi.org/10.1038/nature07184>.

We can turn this around as follows:

<div xmlns:doi="http://dx.doi.org/" xmlns:dcterms="http://purl.org/dc/terms/">
  <div about="[doi:10.1038/nature07184]">
    <span property="dcterms:isPartOf" resource="doi:10.1038/nature07184"/>
  </div>
</div>

This would be processed by an RDFa processor to yield the RDF triple (in N3/Turtle):
<http://dx.doi.org/10.1038/nature07184> dcterms:isPartOf <doi:10.1038/nature07184> .

This triple (or fact) says that the resource identified by <http://dx.doi.org/10.1038/nature07184> is a component part (cf. DCTERMS vocabulary) of the resource identified by <doi:10.1038/nature07184>. (The splash page identified by the proxy URL is a component part of the abstract work identified by the DOI.)

So what do CURIEs give us? Nothing more than a generic means to be able to make human-friendly statements such as

<doi:10.1038/nature07184> dcterms:hasPart doi:10.1038/nature07184 .

instead of having to spell it out in full triples form using long-winded URIs:
<doi:10.1038/nature07184>
  <http://http://purl.org/dc/terms/hasPart>
    <http://dx.doi.org/10.1038/nature07184> .