« June 2007 | Main | August 2007 »

July 31, 2007

Handle Acrobat Reader Plugin

Just announced on the handle-info list is a new plugin from CNRI for Acrobat Reader - see here. The announcement says:

"It is intended to demonstrate the utility of embedding a identifying
handle in a PDF document.
 
...
 
A set of demonstration documents, each with an embedded identifying
handle, is packaged with the plug-in to show potential uses. To make
productive use of this technology, a given industry or community of
users would have to agree on one or more specific applications and
populate the relevant handle records accordingly."

Two immediate comments:

  • This is a Windows-only plugin (realized that right after hitting the download button and seeing the '.exe' file) and also needs admin rights to install. (So I solved the first hurdle and am trying to clear the second hurdle. Lockdown is not an uncommon practice for enterprise or institutional computers.)
    (Update: Actually, I think I got this wrong. I need admin privileges to install Adobe Acrobat 8. Still scuppered, though. Can't even see the sample PDF files.)
  • The plugin seems to be aimed at the user rather than at the user agent and thus is necessarily limited in scope, i.e. it needs a human driver. (Ideally content providers would embed metadata within media files using structured markup techniques which would be readily accessible to any downstream app which could leverage this data transparently to provide enhanced user services.)

Anyway, I'll add something more when I can get it installed. I think this tool could be a useful addition to publishing toolkits but also that content providers could do much more for consumers by disclosing metadata for their digital assets in a neutral, structured form.

July 28, 2007

URI Template Republished

Well, it all went very quiet for a while but glad to see that the URI Template Internet-Draft has just been republished:

"A New Internet-Draft is available from the on-line Internet-Drafts
directories.

Title : URI Template
Author(s) : J. Gregorio, et al.
Filename : draft-gregorio-uritemplate-01.txt
Pages : 9
Date : 2007-7-23

URI Templates are strings that can be transformed into URIs after
embedded variables are substituted. This document defines the
syntax and processing of URI Templates.

A URL for this Internet-Draft is:

http://www.ietf.org/internet-drafts/draft-gregorio-uritemplate-01.txt"

URI templates should be a very useful publishing tool. Templates are already used by technologies such as OpenSearch - see here.

July 27, 2007

XMP: First Hacks

(Update - 2007.07.28: I meant to reference in this entry Pierre Lindenbaum's post back in May Is there any XMP in scientific pdf ? (No), which btw also references Roderic Page's post on XMP but forgot to add in the links in my haste to scoot off. Well, truth is we still can't answer Pierre in the affirmative but at least we can take the first steps towards rectifying this.)

I've been revisiting Adobe's XMP just recently. (I blogged here about the new XMP Toolkit 4.1 back in March.)

I wanted to share some of my early experiences. First off, after a couple of previous attempts which got pushed aside due to other projects, I managed to compile the libraries and the sample apps that ship with the C++ SDK under Xcode on the Mac. I also needed to compile Expat first which doesn't ship with the distribution.

OK, so far, so good. What this basically leaves one with is a couple of XMP dump utilities (DumpMainXMP and DumpScannedXMP) and two others (XMPCoreCoverage and XMPFilesCoverage) which is a good start anyways for exploring. And turns out that our PDFs already have some workflow metadata in them. This is encouraging because the SDK allows apps to read and update existing XMP packets from files, though not to write new packets into files (as far as I understand).

I thought I would take this opportunity anyway to:

  1. See what XMP metadata terms we might consider adding
  2. Try and add these to existing XMP packets

Ugly details are presented below, but by updating the XMP packet metadata in one of our PDFs (Nature 445, 37 (2007), C.J. Hogan) we can teach Acrobat Reader to read - see the "before" (PDF here) and "after" (PDF here) screenshots in the figure.

acrobats.png

Of course, this is really about much more than getting Adobe apps to read/write metadata. It's about using XMP as a standard platform for embedding metadata in digital assets for third-party apps to read/write. If we can put ID3 tags into our podcasts then why not XMP packets into other media?

First a brief digression on XMP packets, which look essentially like this:

<?xpacket begin="..." id="..."?>
<x:xmpmeta xmlns:x="adobe:ns:meta/"> 
  <rdf:RDF xmlns:rdf="..." xmlns:...>
    ... 
  </rdf:RDF>
</x:xmpmeta>
  ... XML whitespace as padding ... 
<?xpacket end="w"?> 
XMP effectively embeds RDF/XML into arbitrary application files - binary and text. The RDF is wrapped within an "<rdf:RDF>" element which is optionally wrapped by an "<x:xmpmeta>" element. This XML fragment with trailing XML whitespace is topped and tailed by "<?xpacket>" processing instructions with "begin" and "end" attributes, respectively.

The RDF supported is a simple profile of RDF with only certain constructs recognized: scalars, arrays, structures. It is not a means to embed arbitrary RDF/XML structures. But I'll pass on that for now. At first blush it's at least suitable to get a simple dictionary of key/value terms written in, and more besides.

The XMP metadata from the PDF file listed above looks as follows in RDF/N3 (which is a more chipper serialization of RDF than is RDF/XML):

   <uuid:...>
   dc:creator "x" ;
   dc:format "application/pdf" ;
   dc:title "19.7 N&V.indd NEW.indd"@x-default ;
     
   pdf:GTS_PDFXConformance "PDF/X-1a:2001" ;
   pdf:GTS_PDFXVersion "PDF/X-1:2001" ;
   pdf:Producer "Acrobat Distiller 6.0.1 for Macintosh" ;
   pdf:Trapped "False" ;
      
   pdfx:GTS_PDFXConformance "PDF/X-1a:2001" ;
   pdfx:GTS_PDFXVersion "PDF/X-1:2001" ;
     
   xap:CreateDate "2007-07-16T09:25:20+01:00" ;
   xap:CreatorTool "InDesign: pictwpstops filter 1.0" ;
   xap:MetadataDate "2007-07-16T11:40:21+01:00" ;
   xap:ModifyDate "2007-07-16T11:40:21+01:00" ;
     
   xapMM:DocumentID "uuid:be3a9be5-4e3a-4b66-a50b-26f0a0bfc89d" ;
   xapMM:InstanceID "uuid:73dcd021-d40a-4cb7-a99b-44f8e90624f4" .
(Note: I've omitted namespaces here and dropped some of the structuring info that was present on the "dc:creator" and "dc:title" elements thus leaving all values as simple strings. Back to that in a bit. )

What this says is simply that all these properies expressed in key/value pairs apply to the current document denoted by the resource identifier "<uuid:...>", and terms are taken from the schemas indicated by the prefixes. So, for example, the term "creator" from the schema referenced by the placeholder "dc" (there is a namespace URI for this but I haven't shown it here) has the value "x" for this document, and so on.

So, salting away the media- and XMP-specific metadata, we are left with the following work metadata in our main XMP packet.

   <uuid:...>
   dc:creator "x" ;
   dc:format "application/pdf" ;
   dc:title "19.7 N&V.indd NEW.indd"@x-default ;

Not wildly impressive, i must admit. Ideally we would like to pump this up with a fuller descriptive and rights metadata set such as we routinely syndicate with our web feeds. This would make use of both DC and PRISM vocabularies. In RDF/N3 we might expect to see something like:

   <uuid:...>
   dc:creator "Craig J. Hogan" ;
   dc:title "Cosmology: Ripples of early starlight" ;
   dc:identifier "doi:10.1038/445037a" ;
   dc:description "doi:10.1038/445037a" ;
   dc:source "Nature 445, 37 (2007)" ;
   dc:date "2007-01-04" ;
   dc:format "application/pdf" ;
   dc:publisher "Nature Publishing Group" ;
   dc:language "en" ;
   dc:rights "© 2007 Nature Publishing Group"  ;
    
   prism:publicationName "Nature" ;
   prism:issn "0028-0836" ;
   prism:eIssn "1476-4679" ;
   prism:publicationDate "2007-01-04" ;
   prism:copyright "© 2007 Nature Publishing Group" ;
   prism:rightsAgent "permissions@nature.com" ;
   prism:volume "445" ; 
   prism:number "7123" ;
   prism:startingPage "37" ;
   prism:endingPage "37" ;
   prism:section "News and Views" ;

So, taking this RDF and doing a quick and dirty substitution of it for the existing DC description in the PDF XMP packet (i.e. more or less "lobotomizing" the PDF) we then get an updated XMP packet which can be dumped with the DumpMainXMP utility as (with some schemas removed):

 // ----------------------------------
// Dumping main XMP for 445037a.pdf :
 
File info : format = "    ", handler flags = 00000260
Packet info : offset = 267225, length = 3651
 
Initial XMP from 445037a.pdf
Dumping XMPMeta object ""  (0x0)
 
  ...
 
  http://purl.org/dc/elements/1.1/  dc:  (0x80000000 : schema)
      dc:rights  (0x1E00 : isLangAlt isAlt isOrdered isArray)
         [1] = " 2007 Nature Publishing Group"  (0x50 : hasLang hasQual)
               ? xml:lang = "x-default"  (0x20 : isQual)
      dc:language  (0x200 : isArray)
         [1] = "en"
      dc:publisher  (0x200 : isArray)
         [1] = "Nature Publishing Group"
      dc:format = "application/pdf"
      dc:date  (0x600 : isOrdered isArray)
         [1] = "2007-01-04"
      dc:source = "Nature 445, 37 (2007)"
      dc:description  (0x1E00 : isLangAlt isAlt isOrdered isArray)
         [1] = "doi:10.1038/445037a"  (0x50 : hasLang hasQual)
               ? xml:lang = "x-default"  (0x20 : isQual)
      dc:identifier = "doi:10.1038/445037a"
      dc:title  (0x1E00 : isLangAlt isAlt isOrdered isArray)
         [1] = "Cosmology: Ripples of early starlight"  (0x50 : hasLang hasQual)
               ? xml:lang = "x-default"  (0x20 : isQual)
      dc:creator  (0x600 : isOrdered isArray)
         [1] = "Craig J. Hogan"
 
   http://prismstandard.org/namespaces/1.2/basic/  prism:  (0x80000000 : schema)
      prism:section = "News and Views"
      prism:endingPage = "37"
      prism:startingPage = "37"
      prism:number = "7123"
      prism:volume = "445"
      prism:rightsAgent = "permissions@nature.com"
      prism:copyright = " 2007 Nature Publishing Group"
      prism:publicationDate = "2007-01-04"
      prism:eIssn = "1476-4679"
      prism:issn = "0028-0836"
      prism:publicationName = "Nature"

Full dumps of the "before" and "after" PDFs are available here:

Note also that in the dump above some of the DC terms are interpreted by the XMP toolkit to have structured formats, i.e. are recognized as array members, and have language and ordering attributes. This seems to be an artefact of the toolkit as the RDF did not specify these structurings. Note also that the PRISM values were not similarly interpreted as the PRISM schema is not registered with the toolkit.

Obviously, there's much more to be learned yet. I'll post an update to this later, but meantime it would be very interesting to get feedback from others on experiences they may have with XMP or any opinions they may want to share. I think it all looks very promising although tools are somewhat restricted.

July 19, 2007

Publishing Linked Data

With these words:

"There was quite some interest in Linked Data at this year's World Wide
Web Conference (WWW2007). Therefore, Richard Cyganiak, Tom Heath and I
decided to write a tutorial about how to publish Linked Data on the
Web, so that interested people can find all relevant information, best
practices and references in a single place."

Chris Bizer announces this draft How to Publish Linked Data on the Web. It's a bright and breezy tutorial and useful (to me, anyway) for disclosing a couple of links:

The tutorial is unsurprisingly orthodox in its advocacy for all things HTTP and goes on to say:

"In the context of Linked Data, we restrict ourselves to using HTTP URIs only and avoid other URI schemes such as URNs and DOIs."

But this only relates back to Berners-Lee's piece on Linked Data referenced above in which he says:
"The second rule, to use HTTP URIs, is also widely understood. The only deviation has been, since the web started, a constant tendency for people to invent new URI schemes (and sub-schemes within the urn: scheme) such as LSIDs and handles and XRIs and DOIs and so on, for various reasons. Typically, these involve not wanting to commit to the established Domain Name System (DNS) for delegation of authority but to construct something under separate control. Sometimes it has to do with not understanding that HTTP URIs are names (not addresses) and that HTTP name lookup is a complex, powerful and evolving set of standards. This issue discussed at length elsewhere, and time does not allow us to delve into it here."

Hmm. Does make one wonder where the concept of URI ever arose. Surely the nascent WWW application should have mandated the exclusive use of HTTP identifiers? Seems that this concept snuck up on us somehow and we now have to put it back into the box. Pandora, indeed!

Back to the tutorial there are some unorthodox terms or at least I had not heard of them before. Contrasted with the defined term information resources (from AWWW) is the undefined term "non-information resources". Further on, there's a distinction made between two types of RDF triple: "literal triples" and "RDF links". I hadn't heard of either of these terms before although they are presented as if they were in common usage. The tutorial then goes on to deprecate the use of certain RDF features because it makes it "easier for clients". So, I guess that the full expressivity of RDF is either not required or the world of "linked data" is not quite so large as it would like to be.

And later on, there's this puzzling injunction:

"You should only define terms that are not already defined within well-known vocabularies. In particular this means not defining completely new vocabularies from scratch, but instead extending existing vocabularies to represent your data as required."

Am I wrong, or is there something of a Catch 22 there? To extend an arbitrary vocabulary I would need to be the namespace authority - to be the "URI owner" in W3C speak. But I can't be the authority for all namespaces/vocabularies because by the intent of the above they would likely be just the one (true?) vocabulary which I may or may not be the authority for. I thought the intent of the RDF model and XML namespaces was that terms could be applied from disparate vocabularies to the description at hand.

Anyways, I am not trying to knock the draft. It's something of a curate's egg, that's true, but I am genuinely looking forward to reading it through and would encourage others to have a look at it too.

July 12, 2007

PURL Redux

Seems that there's life in the old dog yet. :~) See this post about PURL from Thom Hickey, OCLC, This extract:

OCLC has contracted with Zepheira to reimplement the PURL code which has become a bit out of date over the years. The new code will be in written in Java and released under the Apache 2.0 license.

July 10, 2007

BioNLP 2007

Just posted on Nascent a brief account of a presentation I gave recently on OTMI at BioNLP 2007. The post lists some of the feedback I received. We are very interested to get further comments so do feel free to contribute comments either directly to the post, privately to otmi@nature.com, or publicly to otmi-discuss@crossref.org. And then there's always the OTMI wiki available for comment at http://opentextmining.org/.

It is important to note that OTMI is not a universal panacea but rather an attempt at bridging the gap between publisher and researcher. We are attempting to provide a framework to enable scholarly publishers to disclose full text for machine processing purposes without compromising their normal publishing obligations.

(Note: Peter Corbett of the Unilever Centre for Molecular Informatics, Department of Chemistry, University of Cambridge has posted an account of the BioNLP 2007 workshop here.)

IBM Article on PRISM

Nice entry article on PRISM here by Uche Ogbuji, Fourthought Inc. on IBM's DeveloperWorks.

July 02, 2007

Oh, shiny!

The other day Ed and I visited the OECD to talk about all things e-publishig. At the end of our our meeting, Toby Green, the OECD's head of publishing, handed all 30+ meeting attendees a copy of their well-known OECD Factbook- on a USB stick.

Picture of the OECD Factbookbook USB stick

Before you dismiss this as a gimick- note that organizations like the OECD get a lot of political and marketing mileage with "leave behinds"- print copies of their key reports, conference proceedings and reference works. While researchers might prefer electronic versions of the publications for their day-to-day work, print versions of the same publications seemed to continue to play a critical role as an "awareness tool." I know that, for this very reason, several NGO/IGOs that I've spoken to have despaired of ever ramping down their print operations.

I think that the OECD might have figured out a solution to this dilemma. It's difficult to describe how viscerally satisfying it was to receive one of these Factbook USB-sticks. From the way in which the other meeting attendees swarmed around Toby as he was handing them out, I think that they might have had the same reaction.

As we headed back to London on the Eurostar, I almost immediately popped the USB stick into my laptop and started browsing through the Factbook, much as I would have thumbed through a print version of the same (although -truth be told- I would have been tempted to conveniently "forget" the print version in order to not have to shlep it from Paris back to Oxford).

In short, I think the system works. Kudos to the OECD for a simple, inexpensive and creative experiment in e-publishing.