Main

August 3, 2010

XMP in RSC PDFs

Just a quick heads-up to say that we've had a go at incorporating InChIs and ontology terms into our PDFs with XMP. There isn't a lot of room in an XMP packet so we've had to be a bit particular about what we include.

  • InChIs: the bigger the molecule the longer the InChI, so we've standardized on the fixed-length InChIKey. This doesn't mean anything on its own, so we've gone the Semantic Web route of including an InChI resolver HTTP URI. Alternatively you can extract the InChIKeys with a regular expression.
  • Ontology terms: we're using HTTP URIs again and pointing to either Open Biomedical Ontology URIs (biology, biomedicine; slashy) or RSC ontology terms (chemistry; hashy). Often the OBO URIs resolve to a specific web page, but for the moment the RSC URIs just point to a large OWL file. Slashy URIs are quite a bit more involved so we'll have to see what the demand is like.

There's only about 4K to play with, so it's only ever going to be a best-of. More detailed article metadata has to go in either a sidecar file, as Tony has pointed out before, or ideally on the article landing page. The example files are here and I've posted something with a different slant on the RSC technical blog.

December 9, 2009

Add CrossRef metadata to PDFs using XMP

In order to encourage publishers and other content producers to embed metadata into their PDFs, we have released an experimental tool called "pdfmark", This open source tool allows you to add XMP metadata to a PDF. What's really cool, is that if you give the tool a CrossRef DOI, it will lookup the metadata in CrossRef and then apply said metadata to the PDF. More detail can be found on the pdfmark page on the CrossRef Labs site. The usual weasels words and excuses about "experiments" apply.

June 10, 2009

XMP Primer

xmp-primer.jpg

There's a new XMP Primer (PDF) by Ron Roskiewicz (ed. Dianne Kennedy) available from XMP-Open. This is copyrighted 2008 but I only just saw this now. This is a 43 page document which provides a very gentle introduction to metadata and labelling of media and then introduces XMP into the content lifecycle and talks to the business case for using XMP. The primer covers the following areas:

  • Introduction to Metadata
  • Introduction to XMP
  • XMP and the Content Lifecycle
  • XMP in Action; Use Cases
  • Additional XMP Resources

One small gripe would be that this seems to have been prepared for US letter-sized pages and although is printable on A4 there is the slightest of clippings on the right-hand margin with no real loss of information but it does confer a sense of "incompleteness". Really there can be little excuse these days for this parochialism. Also, for a document talking up the benefits of using XMP, it's decidedly odd that it doesn't make use of XMP itself - or rather there is a default XMP packet in the PDF with no real useful properties such as title, author, or date. Could have been a nice little object lesson in using XMP.

January 16, 2009

XMP Library for Flash

Update about new XMP Library from Adobe Labs:

"The new Adobe XMP Library for ActionScript is now available for download on Adobe Labs. Adobe Extensible Metadata Platform (XMP) is a labeling technology that allows you to embed data about a file, known as metadata, into the file itself. XMP is an open technology based on RDF and RDF/XML. With this new library you can read existing XMP metadata from Flash based file formats via the Adobe Flash Player."
Any volunteers?

October 20, 2008

XMP Marches On

For those who may be interested in the progress of XMP, Adobe's Gunar Penikis has just announced two new releases of XMP SDKs: XMP Toolkit 4.4 (with support for new file formats), and FileInfo SDK (for customizing CS4 UIs). More importantly, though, may be the new edition of the XMP spec - see here, which is bumped from a modest 112 page document to a 3-parter at 199 pages.

Looks to be quite a thorough spec bar one telling particular: there is no version number and no date! The previous version was likewise unnumbered though at least dated as "September 2005". Btw, I'm not sure of there is any archive of XMP specs being maintained by Adobe. At least, I'm not aware of any page with that information. Perhaps we can refer to our earlier call to have XMP turned over to a standards organization to formalize a public spec.

October 17, 2007

Hybrid

So, back on the old XMP tack. The simple vision from the XMP spec is that XMP packets are embedded in media files and transported along with them - and as such are relatively self-contained units, see Fig 1.

Hybrid - A.jpg
Fig. 1 - Media files with fully encapsulated descriptions.

But this is too simple. Some preliminary considerations lead us to to see why we might want to reference additional (i.e. external) sources of metadata from the original packet:

PDFs
PDFs are tightly structured and as such it can be difficult to write a new packet, or to update an existing packet. One solution proposed earlier is to embed a minimal packet which could then reference a more complete description in a standalone packet. (And in turn this standalone packet could reference additional sources of metadata.)

Images
While considerably simpler to write into web-delivery image formats (e.g. JPEG, GIF, PNG), it is the case that metadata pertinent to the image only is likely to be embedded. Also, of interest is the work from which the image is derived which is most likely to be presented externally to the image as a standalone document. (And in turn this standalone packet could reference additional sources of metadata.)

(Continues)

Continue reading "Hybrid" »

October 13, 2007

I Want My XMP

Now, assuming XMP is a good idea - and I think on balance it is (as blogged here earlier), why are we not seeing any metadata published in scholarly media files? The only drawbacks that occur to me are:

  1. Hard to write - it's too damn difficult, no tools support, etc.
  2. Hard to model - rigid, "simple" XMP data model, both complicates and constrains the RDF data model

Well, I don't really believe that 1) is too difficult to overcome. A little focus and ingenuity should do the trick. I do, however, think 2) is just a crazy straitjacket that Adobe is forcing us all to wear but if we have to live with that then so be it. Better in Bedlam than without. (RSS 1.0 wasn't so much better but allowed us to do some useful things. And that came from the RDF community itself.) We could argue this till the cows come home but I don't see any chance of any change any time soon.

(Continues)

Continue reading "I Want My XMP" »

Metadata - For the Record

Interesting post here from Gunar Penikis of Adobe entitled "Permanent Metadata" (Oct. '04). He talks about the the issues of embedding metadata in media and comes up with this:

"It may be the case that metadata in the file evolves to become a "cache of convenience" with the authoritative information living on a web service. The web service model is designed to provide the authentication and permissions needed. The link between the two provided by unique IDs. In fact, unique IDs are already created by Adobe applications and stored in the XMP - that is what the XMP Media Management properties are all about."

An intriguing idea. Of course, Gunar's (and Adobe's) preoccupations with metadata revolve mainly around document workflow whereas, at least as things stand currently, scholarly publisher concerns are mainly with the dissemination of media in final form. Hence some differences in thinking:
Subject
As just noted Adobe are more interested in workflow than in work. Scholarly articles are rich in descriptive metadata about the work itself and have a well-developed ctation model. Academic interest is in the intellectual content rather than the vehicle used to carry and preserve that content - the file format.

Unique IDs
Workflow IDs are UUIDs which identify specific instances and expressions, but do not identify the abstract work. UUIDs provide a unique identifier but there is no central registry for such identifiers, hence they cannot be "looked up". CrossRef publishers should be concerned to associate closely the DOI for the underlying work with a given media file. That's the identifier that this community is actively promoting.

Read/Write
Because of the focus on workflow, the XMP specification recommends that XMP packets be "writeable", that is that they be marked as "writeable" and that they include padding whitespace which can accommodate updates without changing packet size. Publishers distributing final form documents are more likely to want to distribute "read-only" metadata which is authoritative and which describes the work, rather than the document format and workflow. Of course, this should not preclude additional sources of metadata which may be added "by reference" rather than "by value". That is, a pointer to a web page (or service) may be sufficient to relate additional publisher terms and user annotations instead of embedding them directly in the file for various reasons: a) file integrity, b) limiting growth of file size, c) term authority, d) dynamic production (in forward time), and e) multiple sources.

September 25, 2007

XMP-Ville

Been so busy looking into the technical details of XMP that I almost forgot to check out the current landcsape. Luckily I chanced on these articles by Ron Roszkiewicz for The Seybold Report (and apologies for lifting the title of this post from his last). The articles about XMP are well worth reading and chart the painful progress made to date:

From the earlier characterization of XMP as "underachieving teenager" Roszkiewicz is cautiously optimistic that IDEAlliance's XMP Open initiative (an initiative to advance XMP as an open industry specification) will help outreach and foster adoption of this fledgling technology.

(Continues.)

Continue reading "XMP-Ville" »

September 20, 2007

The Name's The Thing

I'm always curious about names and where they come from and what they mean. Hence, my interest was aroused with the constant references to "XAP" in XMP. As the XMP Specifcation (Sept. 2005) says:

"NOTE: The string “XAP” or “xap” appears in some namespaces, keywords, and related names in this document and in stored XMP data. It reflects an early internal code name for XMP; the names have been preserved for compatibility purposes."

Actually, it occurs in most of the core namespaces: XAP, rather than XMP.

(Continues.)

Continue reading "The Name's The Thing" »

September 11, 2007

Marking up DOI

(Update - 2007.09.15: Clean forgot to add in the rdf: namespace to the examples for xmp:Identifier in this post. I've now added in that namespace to the markup fragments listed. Also added in a comment here which shows the example in RDF/XML for those who may prefer that over RDF/N3.)

So, as a preliminary to reviewing how a fuller metadata description of a CrossRef resource may best be fitted into an XMP packet for embedding into a PDF, let's just consider how a DOI can be embedded into XMP. And since it's so much clearer to read let's just conduct this analysis using RDF/N3. (Life is too short to be spent reading RDF/XML or C++ code. :~)

(And further to Chris Shillum's comment here on my earlier post Metadata in PDF: 2. Use Cases where he notes that Elsevier are looking to upgrade their markup of DOI in PDF to use XMP, I'm really hoping that Elsevier may have something to bring to the party and share with us. A consensus rendering of DOI within XMP is going to be of benefit to all.)

(Continues.)

Continue reading "Marking up DOI" »

September 10, 2007

XMP - Some Other Gripes

Following on from the missing XMP Specification version number discussed in the previous post here below are listed some miscellaneous gripes I've got with XMP (on what otherwise is a very promising technology). I would be more than happy to be proved wrong on any of these points.

Continue reading "XMP - Some Other Gripes" »

W5M0MpCehiHzreSzNTczkc9d

What on earth can this string mean: 'W5M0MpCehiHzreSzNTczkc9d'? This occurs in the XMP packet header:

<?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?>

Well from the XMP Specification (September 2005) which is available here there is this text:

"The required id attribute must follow begin. For all packets defined by this version of the syntax, the value of id is the following string: W5M0MpCehiHzreSzNTczkc9d"


(See: 3 XMP Storage Model / XMP Packet Wrapper / Header / Attribute: id)

OK, so it's no big deal to cut and paste that string, it's just mighty curious why this cryptic key is needed in an open specification, especially since (contrary to what might be implied by the text) it doesn't seem to vary with version. (Or hasn't yet, at any rate - more below.)

Continue reading "W5M0MpCehiHzreSzNTczkc9d" »