Main

November 28, 2011

Turning DOIs into formatted citations

Today two new content types were added to dx.doi.org resolution for CrossRef DOIs. These allow anyone to retrieve DOI bibliographic metadata as formatted bibliographic entries. To perform the formatting we're using the citation style language processor, citeproc-js which supports a shed load of citation styles and locales. In fact, all the styles and locales found in the CSL repositories, including many common styles such as bibtex, apa, ieee, harvard, vancouver and chicago are supported.

First off, if you'd like to try citation formatting without using content negotiation, there's a simple web UI that allows input of a DOI, style and locale selection.

If you're more into accessing the web via your favorite programming language, have a look at these content negotiation curl examples. To make a request for the new "text/bibliography" content type:

$ curl -LH "Accept: text/bibliography; style=bibtex" http://dx.doi.org/10.1038/nrd842

@article{Atkins_Gershell_2002, title={From the analyst's couch: Selective anticancer drugs}, volume={1}, DOI={10.1038/nrd842}, number={7}, journal={Nature Reviews Drug Discovery}, author={Atkins, Joshua H. and Gershell, Leland J.}, year={2002}, month={Jul}, pages={491-492}}

A locale can be specified with the "locale" content type parameter, like this:

$ curl -LH "Accept: text/bibliography; style=mla; locale=fr-FR" http://dx.doi.org/10.1038/nrd842

Atkins, Joshua H., et Leland J. Gershell. « From the analyst's couch: Selective anticancer drugs ». Nature Reviews Drug Discovery 1.7 (2002): 491-492.

You may want to process metadata through CSL yourself. For this use case, there's another new content type, "application/citeproc+json" that returns metadata in a citeproc-friendly JSON form:

$ curl -LH "Accept: application/citeproc+json" http://dx.doi.org/10.1038/nrd842

{"volume":"1","issue":"7","DOI":"10.1038/nrd842","title":"From the analyst's couch: Selective anticancer drugs","container-title":"Nature Reviews Drug Discovery","issued":{"date-parts":[[2002,7]]},"author":[{"family":"Atkins","given":"Joshua H."},{"family":"Gershell","given":"Leland J."}],"page":"491-492","type":"article-journal"}

Finally, to retrieve lists of supported styles and locales, either hit these URLs:

or check out the CSL style and locale repositories.

There's one big caveat to all this. The CSL processor will do its best with CrossRef metadata which can unfortunately be quite patchy at times. There may be pieces of metadata missing, inaccurate metadata or even metadata items stored under the wrong field, all resulting in odd-looking formatted citations. Most of the time, though, it works.

April 19, 2011

Content Negotiation for CrossRef DOIs

So does anybody remember the posting DOIs and Linked Data: Some Concrete Proposals?

Well, we went with option "D."

From now on, DOIs, expressed as HTTP URIs, can be used with content-negotiation.

Let's get straight to the point. If you have curl installed, you can start playing with content-negotiation and CrossRef DOIs right away:

curl -D - -L -H   "Accept: application/rdf+xml" "http://dx.doi.org/10.1126/science.1157784" 

curl -D - -L -H   "Accept: text/turtle" "http://dx.doi.org/10.1126/science.1157784"

curl -D - -L -H   "Accept: application/atom+xml" "http://dx.doi.org/10.1126/science.1157784"

Or if you are already using CrossRef's "unixref" format:

curl -D - -L -H "Accept: application/unixref+xml" "http://dx.doi.org/10.1126/science.1157784" 

This will work with over 46 million CrossRef DOIs as of today, but the beauty of the setup is that from now on, any DOI registration agency can enable content negotiation for their constituencies as well. DataCite- we're looking at you ;-) .

It also means that, as registration agency members (CrossRef publishers, for instance) start providing more complete and richer representations of their content, we can simply redirect content-negotiated requests directly to them.

We expect that that this development will round-out CrossRef's efforts to support standard APIs including OpenURL and OAI_PMH and we look forward to seeing DOIs increasingly used in linked data applications.

Finally, CrossRef would just like to thank the IDF and CNRI for their hard work on this as well as Tony Hammond and Leigh Dodds for their valuable advice and persistent goading.







December 9, 2009

Add CrossRef metadata to PDFs using XMP

In order to encourage publishers and other content producers to embed metadata into their PDFs, we have released an experimental tool called "pdfmark", This open source tool allows you to add XMP metadata to a PDF. What's really cool, is that if you give the tool a CrossRef DOI, it will lookup the metadata in CrossRef and then apply said metadata to the PDF. More detail can be found on the pdfmark page on the CrossRef Labs site. The usual weasels words and excuses about "experiments" apply.

October 19, 2009

Recommendations on RSS Feeds for Scholarly Publishers

We're pleased to announce that a CrossRef working group has released a set of best practice recommendations for scholarly publishers producing RSS feeds.

Variations in practice amongst publisher feeds can be irritating for end-users, but they can be insurmountable for automated processes. RSS feeds are increasingly being consumed by knowledge discovery and data mining services. In these cases, variations in date formats, the practice of lumping all authors together in one element, or generating invalid XML can render the RSS feed useless to the service accessing it.

The recommendations intended to facilitate good practice in the production and provision of TOC RSS Feeds. The guidelines include general recommendations for good practice, specific recommendations on the use of RSS Modules and an example RSS TOC feed. Ultimately, we expect that industry wide adoption of these best practices will help drive more traffic to publisher web sites. Note that most of these recommendation can also be applied to non-TOC RSS feeds such as thematic feeds, automated search result feeds, etc.

March 20, 2009

Citation Typing Ontology

I was happy to read David Shotton's recent Learned Publishing article, Semantic Publishing: The Coming Revolution in scientific journal publishing, and see that he and his team have drafted a Citation Typing Ontology.*

Anybody who has seen me speak at conferences knows that I often like to proselytize about the concept of the "typed link", a notion that hypertext pioneer, Randy Trigg, discussed extensively in his 1983 Ph.D. thesis.. Basically, Trigg points out something that should be fairly obvious- a citation (i.e. "a link") is not always a "vote" in favor of the thing being cited.

In fact, there are all sorts of reasons that an author might want to cite something. They might be elaborating on the item cited, they might be critiquing the item cited, they might even be trying to refute the item cited (For an exhaustive and entertaining survey of the use and abuse of citations in the humanities, Anthony Grafton's, The Footnote: A Curious History, is a rich source of examples)

Unfortunately, the naive assumption that a citation is tantamount to a vote of confidence has become inshrined in everything from the way in which we measure scholarly reputation, to the way in which we fund universities and the way in which search engines rank their results. The distorting affect of this assumption is profound. If nothing else, it leads to a perverse situation in which people will often discuss books, articles, and blog postings that they disagree with without actually citing the relevant content, just so that they can avoid inadvertently conferring "wuffie" on the item being discussed. This can't be right.

Having said that, there has been a half-hearted attempt to introduce a gross level of link typology with the introduction of the "nofollow" link attribute- an initiative started by Google in order to try to address the increasing problem of "Spamdexing". But this is a pretty ham-fisted form of link typing- particularly in the way it is implemented by the Wikipedia where CrossRef DOI links to formally published scholarly literature have a "nofollow" attribute attached to them but, inexplicably, items with a PMID are not so hobbled (view the HTML source of this page, for example). Essentially, this means that, the Wikipedia is a black-hole of reputation. That is, it absorbs reputation (through links too the Wikipedia), but it doesn't let reputation back out again. Hell, I feel dirty for even linking to it here ;-).

Anyway, scholarly publishers should certainly read Shotton's article because it is full of good, and practical ideas about what can can be done with today's technology in order to help us move beyond the "digital incunabula" that the industry is currently churning out. The sample semantic article that Shotton's team created is inspirational and I particularly encourage people to look at the source file for the ontology-enhanced bibliography which reveals just how much more useful metadata can be associated with the humble citation.

And now I wonder whether CiteULike, Connotea, 2Collab or Zotero will consider adding support for the CItation Typing Ontology into their respective services?


* Disclosure:

a) I am on the editorial board of Learned Publishing
b) CrossRef has consulted with David Shotton on the subject of semantically enhancing journal articles

January 6, 2009

Poorboy Metadata Hack

I was playing around recently and ran across this little metadata hack. At first, I thought somebody was doing something new. But no, nothing so forward apparently. (Heh! :)

I was attempting to grab the response headers from an HTTP request on an article page and was using by default the Perl LWP library. For some reason I was getting metadata elements being spewed out as response headers - at least from some of the sites I tested. With some further investigation I tracked this back to LWP itself which parses HTML headers and generates HTTP pseudo-headers using an X-Meta- style header. (This can be viewed either as a feature of LWP or a bug as this article bemoans.)

What this means anyway is that I can issue a simple call like this to get the HTML metadata - shown here for doi:10.1087/095315108X288947:

% lwp-request -ed 'http://dx.doi.org/10.1087/095315108X288947' | grep -i x-meta
X-Meta-DC.Creator: Rapple, Charlie
X-Meta-DC.Identifier: info:doi/10.1087/095315108X288947
X-Meta-DC.Publisher: Association of Learned and Professional Society Publishers
X-Meta-DC.Title: Knowledge bases: improving the information supply chain
X-Meta-DC.Type: Text
X-Meta-DCTERMS.BibliographicCitation: Learned Publishing, 21, 2, 110-115(6)
X-Meta-DCTERMS.IsPartOf: urn:ISSN:0953-1513
X-Meta-DCTERMS.Issued: April 2008
X-Meta-IC.Identifier: alpsp/lp/2008/00000021/00000002/art00005

This shows a simple (read lazy) means of accessing metadata added as <meta> tags in HTML headers, such as those we added for Nature. (Of course, machine readable metadata is best added using RDFa as noted earlier, but does not preclude also adding in <meta> tags which are also usable with HTML as well as XHTML.)

(Btw, wouldn't it be fun if CrossRef had a random DOI facility? That would be real handy for testing as well as giving users a feel for what real-life DOIs look like and what lies at the other end of them.)

December 22, 2008

And the DOI is ...

Once structured metadata is added to a file then retrieving a given metadata element is usually a doddle. For example, for PDFs with embedded XMP one can use Phil Harvey's excellent Exiftool utility.

Exiftool is a Perl library and application which I've blogged about here earlier which is available as a '.zip' file for Windows (no Perl required) or '.dmg' for MacOS. Note that Phil maintains this actively and has done so over the last five years. (And when I say actively I mean just that. I once made the mistake of printing out the change file.)

If Perl's not your thing, then there's a Ruby wrapper gem (MiniExiftool) to access the Exiftool command in trouper OO fashion. Here's an example Ruby one-liner to get the DOI from a PDF (broken here to meet column width restriction):

% ruby -rubygems -e 'require "mini_exiftool";
    puts MiniExiftool.new("test.pdf")["doi"]'
10.1038/nphoton.2008.200
Of course, that could also have been run against an image, audio or video file with XMP packet.

(Makes one wonder vaguely about the feasibility of having a Swiss Army knife type of utility that could read any file to get the DOI using the embedded XMP, RDFa, RDF, HTML headers, COiNS, etc. Possibly even as last resort fall back to scanning the raw text - if any.)

November 19, 2008

Machine Readable: Are We There Yet?

The guidelines for CrossRef publishers ("DOI Name Information and Guidelines" - PDF, 210K) has this to say in "Sect. 6.3 The response page" regarding the response page for a DOI:

"A minimal response page must contain a full bibliographic citation displayed to the user. A response page without bibliographic information should never be presented to a user."
which would seem to be all fine and dandy. But if that user is a machine (or an agent acting for a user) they'll likely be out of luck as the metadata in the bibliographic citation is generally targeted at human users.

So here's a quick and dirty implementation of what a machine readable page could look like using RDFa. (The demo uses Jeni Tennison's wonderful rdfQuery plugin which I blogged about earlier.)

Clicking the DOI link below will bring up in a sub-window a bibliographic citation which might be found in a typical DOI repsonse page. If you now click the "Read Me" link you should see an alert message which presents the bibliographic metadata as a complete RDF document (in a simple N3 – or Notation3 – format). This document is assembled on the fly by rdfQuery using the RDFa markup embedded in the page.

doi:10.1038/nature05634 (Click for demo)

See the "View Source" link to list the actual XHTML markup and the RDFa properties which have been added. And note also that some of the properties are partially "hidden" to the human reader, e.g. a publication date is given in year form only whereas the machine record has the date in full, and some of the properties are fully "hidden": print and electronic ISSNs, issue number, ending page, etc.

(Continues below.)

Continue reading "Machine Readable: Are We There Yet?" »

November 17, 2008

rdfQuery

Whaddya know? I was just on the point of blogging about the real nice demo given by Jeni Tennison at last week's SWIG UK meeting at HP Labs in Bristol of rdfQuery (an RDF plugin for jQuery - the zip file is here). And there today on her blog I see that she has a full writeup on rdfQuery, so I'll defer to the expert. :~)

All I can really add to that is that rdfQuery is a pretty darn cool way to add and manipulate RDFa using jQuery. Does it get any better?

And now that RDFa is a W3C Rec since last month (see Primer and Syntax) it will be interesting to see how CrossRef members might begin to deploy it on their pages - especially on DOI landing pages.

October 24, 2008

PRISM 2.1

Yesterday a new PRISM spec (v2.1) was released for public comment - zip file here. (Comment period lasts up to Dec. 3, '08.)

Changes are listed in pages 8 and 9 of the Introduction document. Some highlights:

  • New PRISM Usage Rights namespace
  • Accordingly usage of prism:copyright, prism:embargoDate, and prism:expirationDate no longer recommended
  • New element prism:isbn introduced for book serials

An updated mod_prism RSS 1.0 module is available which lists all versions of PRISM specs including the forthcoming v2.1 spec. I will see about getting this added now to a more permanent location. Current version of PRISM remains at v2.0. Versions 2.0 and 2.1 are especially of interest to users of CrossRef because of their support for prism:doi and prism:url and users should consider upgrading their applications, e.g. RSS feeds.

July 21, 2008

Metadata Matters

Andy Powell has published on Slideshare this talk about metadata - see his eFoundations post for notes. It's 130 slides long and aims

"to cover a broad sweep of history from library cataloguing, thru the Dublin Core, Web search engines, IEEE LOM, the Semantic Web, arXiv, institutional repositories and more."
Don't be fooled by the length though. This is a flip through and is a readily accessible overview on the importance of metadata. Slides 86-91 might be of interest here. ;)


July 9, 2008

PRISM Press Release

The PRISM metadata standards group issued a press release yesterday which covered three points:

PRISM Cookbook
The Cookbook provides "a set of practical implementation steps for a chosen set of use cases and provides insights into more sophisticated PRISM capabilities. While PRISM has 3 profiles, the cookbook only addresses the most commonly used profile #1, the well-formed XML profile. All recipes begin with a basic description of the business purpose it fulfills, followed by ingredients (typically a set of PRISM metadata fields or elements), and, closes with a step-by-step implementation method with sample XMLs and illustrative images."
PRISM 2.0 Errata
The Errata "addresses a range of issues, from editorial to technical, that have been reported by the PRISM user community."
PRISM 2.1
The next version of the PRISM Specification, PRISM 2.1, is slated for release in late 2008. "This release will address complex rights for multi-platform and global distribution channels."

July 1, 2008

Exposing Public Data: Options

This is a follow-on to an earlier post which set out the lie of the land as regards DOI services and data for DOIs registered with CrossRef. That post differentiated between a native DOI resolution through a public DOI service which acts upon the "associated values held in the DOI resolution record" (per ISO CD 26324) and other related DOI protected and/or private services which merely use the DOI as a key into non-public database offering.

Following the service architecture outlined in that post, options for exposing public data appear as follows:

  1. Private Service
    1. Publisher hosted – Publisher private service
  2. Protected Service
    1. CrossRef hosted – Industry protected service
    2. CrossRef routed – Publisher private service
  3. Public Service
    1. Handle System (DOI handle) – Global public service (native DOI service)
    2. Handle System (DOI ‘buddy’ handle) – Publisher public service

(Continues below.)

Continue reading "Exposing Public Data: Options" »

May 20, 2008

Metadata Reuse Policies

Following on from yesterday's post about making metadata available on our Web pages, I wanted to ask here about "metadata reuse policies". Does anybody have a clue as to what might constitute a best practice in this area? I'm specifically interested in license terms, rather than how those terms would be encoded or carried. Increasingly we are finding more channels to distribute metadata (RSS, HTML, OAI-PMH, etc.) but don't yet have any clear statement for our customers as to how they might reuse that data.

Time to put the caveats aside and focus on the actuals.

May 19, 2008

Nature's Metadata for Web Pages

Well, we may not be the first but wanted anyway to report that Nature has now embedded metadata (HTML meta tags) into all its newly published pages including full text, abstracts and landing pages (all bar four titles which are currently being worked on). Metadata coverage extends back through the Nature archives (and depth of coverage varies depending on title). This conforms to the W3C's Guideline 13.2 in the Web Content Accessibility Guidelines 1.0 which exhorts content publishers to "provide metadata to add semantic information to pages and sites".

Metadata is provided in both DC and PRISM formats as well as in Google’s own bespoke metadata format. This generally follows the DCMI recommendation "Expressing Dublin Core metadata using HTML/XHTML meta and link elements, and the earlier RFC 2731 "Encoding Dublin Core Metadata in HTML". (Note that schema name is normalized to lowercase.) Some notes:

  • The DOI is included in the "dc.identifier" term in URI form which is the CrossRef recommendation for citing DOI.
  • We could consider adding also "prism.doi" for disclosing the native DOI form. This requires the PRISM namespace declaration to be bumped to v2.0. We might consider synchronizing this change with our RSS feeds which are currently pegged at v1.2, although note that the RSS module mod_prism currently applies only to PRISM v1.2.
  • We could then also add in a "prism.url" term to link back (through the DOI proxy server) to the content site. The namespace issue listed above still holds.
  • The "citation_" terms are not anchored in any published namespace which does make this term set problematic in application reuse. It would be useful to be able to reference a namespace (e.g. "rel="schema.gs" href="..."") for these terms and to cite them as e.g. "gs.citation_title".

The HTML metadata sets from an example landing page are presented below.

Continue reading "Nature's Metadata for Web Pages" »

March 26, 2008

Word Add-in for Scholarly Authoring and Publishing

Last week Pablo Fernicola sent me email announcing that Microsoft have finally released a beta of their Word plugin for marking-up manuscripts with the NLM DTD. I say "finally" because we've know this was on the way and have been pretty excited to see it. We once even hoped that MS might be able to show the plug-in at the ALPSP session on the NLM DTD, but we couldn't quite manage it.

The plugin is targeted at production/editorial staff, but, of course, it will be interesting to see if any of this work can be pushed back to the author. I won't hold my breath on the latter score, but it will be fun to watch.

One thing I would note is that the NLM DTD can also be used in the humanities and social sciences, so, frankly, I think they should market it more broadly.

Anyway- the plugin can be downloaded from the Microsoft site.

And Pablo has setup a blog where testers can discuss the add-in.

And there is also an entry for the project on the Microsoft Research site (an interesting place to peruse, if you have a moment).

Congatulations to Pablo and his team.

February 22, 2008

prism:doi

The new PRISM spec (v. 2.0) was published this week, see the press release. (Downloads are available here.)

This is a significant development as there is support for XMP profiles, to complement the existing XML and RDF/XML profiles. And, as PRISM is one of the major vocabularies being used by publishers, I would urge you all to go take a look at it and to consider upgrading your applications to using it.

One caveat. There's a new element prism:doi (PRISM Namespace, 4.2.13) which sits alongside another new element prism:url (PRISM Namespace, 4.2.55). Unfortunately the prism:doi element is shown to take DOI proxy URL as its value - and not the DOI string itself, e.g.

  • Model #1
    <prism:doi rdf:resource=”http://dx.doi.org/10.1030/03054”/>
  • Model #2
    <prism:doi>http://dx.doi.org/10.1030/03054</prism:doi>"
This seems to me to just plain wrong. The DOI in itself is not a URL (or URI) - although can, and should, be represented in URI form when used in Web contexts (i.e. pretty much most of the time). As a literal it should be used in its native form as specified in ANSI/NISO Z39.84 - 2005 Syntax for the Digital Object Identifier. This would only satisfy Model #2 above.

To satisfy Model #1 above a URI form for DOI would be required. And this is not the service URI denoted by the proxy. It would either have to be:

  • Model #1 - Registered URI Form
    <prism:doi rdf:resource=”info:doi/10.1030/03054”/>
  • Model #1 - Unregistered URI Form
    <prism:doi rdf:resource=”doi:10.1030/03054”/>

Any comments? Some guidelines from CrossRef would be useful - although maybe further discussion is required. It is, of course, a constant bugbear that "doi:" remains an unregistered URI scheme.

February 9, 2008

CrossRef Citation Plugin (for WordPress)

OK, after a number of delays due to everything from indexing slowness to router problems, I'm happy to say that the first public beta of our WordPress citation plugin is available for download via SourceForge. A Movable Type version is in the works.

And congratulations to Trey at OpenHelix who became laudably impatient, found the SourceForge entry for the plugin back on February 8th and seems to have been testing it since. He has a nice description of how it works (along with screenshots), so I won't repeat the effort here.

Having said that, I do include the text of the README after the jump. Please have a look at it before you install, because it might save you some mystification.

Continue reading "CrossRef Citation Plugin (for WordPress)" »

November 6, 2007

DC in (X)HTML Meta/Links

This message posted out yesterday on the dc-general list (with following extract) may be of interest:

"Public Comment on encoding specifications for Dublin Core metadata in HTML and XHTML


2007-11-05, Public Comment is being held from 5 November through 3 December 2007 on the DCMI Proposed Recommendation, "Expressing Dublin Core metadata using HTML/XHTML meta and link elements" <http://dublincore.org/documents/2007/11/05/dc-html/> by Pete Johnston and Andy Powell. Interested members of the public are invited to post comments to the DC-ARCHITECTURE mailing list <http://www.jiscmail.ac.uk/lists/dc-architecture.html> , including "[DC-HTML Public Comment]" in the subject line. Depending on comments received, the specification may be finalized after the comment period as a DCMI Recommendation."

October 14, 2007

OpenDocument Adds RDF

Bruce D'Arcus left a comment here in which he linked to post of his: "OpenDocument's New Metadata System". Not everybody reads comments so I'm repeating it here. His post is worth reading on two counts:

  1. He talks about the new metadata functionality for OpenDocument 1.2 which uses generic RDF. As he says:
    "Unlike Microsoft’s custom schema support, we provide this through the standard model of RDF. What this means is that implementors can provide a generic metadata API in their applications, based on an open standard, most likely just using off-the-shelf code libraries."
    This is great. It means that description is left up to the user rather than being restricted by any vendor limitation. (Ideally we would like to see the same for XMP. But Adobe is unlikely to budge because of the legacy code base and documents. It's a wonder that Adobe still wants XMP to breathe.)
  2. He cites a wonderful passage from Rob Weir of IBM (something which I had been considering to blog but too late now) about the changing shape of documents. Can only say, go read Bruce's post and then Rob's post. But anyway a spoiler here:
    "The concept of a document as being a single storage of data that lives in a single place, entire, self-contained and complete is nearing an end. A document is a stream, a thread in space and time, connected to other documents, containing other documents, contained in other documents, in multiple layers of meaning and in multiple dimensions."
I think the ODF initiative is fantastic and wish that Adobe could follow suit. However, I do still hold out something for XMP. After all, nobody else AFAICT is doing anything remotely similar for multimedia. Where's the W3C and co. when you really need them? (Oh yeah, faffing about the new Semantic Web logo. ;)

October 5, 2007

Scholarly DC

This announcement was just sent out to the DC-GENERAL mailing list about the new DCMI Community for Scholarly Communications. As Julie Allinson says:

"The aim of the group is to provide a central place for individuals and organisations to exchange information, knowledge and general discussion on issues relating to using Dublin Core for describing items of 'scholarly communications', be they research papers, conference presentations, images, data objects. With digital repositories of scholarly materials increasingly being established across the world, this group would like to offer a home for exploring the metadata issues faced."
There's also a DC-SCHOLAR mailing list (subscribe here). Not too much there yet, but it may be useful to track - or even to participate. :)

September 15, 2007

Custom Panel for CC

Creative Commons now have a custom panel for adding CC licenses using Adobe apps - see here.

Interesting on two counts:

  • Machine readable licenses
  • XMP metadata

But I still think that batch solutions for adding XMP metadata are really required for publishing workflows. And ideally there should be support for adding arbitrary XMP packets if we're going to have truly rich metadata. I rather fear the constraints that custom panels place upon the publisher.

September 13, 2007

Last Orders Please!

Public comment period on the PRISM 2.0 draft ends Saturday (Sept. 15) ahead of next week's WG meeting to review feedback and finalize the spec.

(I put in some comments about XMP already. Hope they got that.)

September 11, 2007

The Second Wave

You might have been wondering why I've been banging on about XMP here. Why the emphasis on one vendor technology on a blog focussed on an industry linking solution? Well, this post is an attempt to answer that.

Four years ago we at Nature Publishing Group, along with a select few early adopters, started up our RSS news feeds. We chose to use RSS 1.0 as the platform of choice which allowed us to embed a rich metadata term set using multiple schemas - especially Dublin Core and PRISM. We evangelized this much at the time and published documents on XML.com (Jul. '03) and in D-Lib Magazine (Dec. '04) as well as speaking about this at various meetings and blogging about it. Since that time many more publishers have come on board and now provide RSS routinely, many of them choosing to enrich their feeds with metadata.

Well, RSS can be seen in hindsight as being the First Wave of projecting a web presence beyond the content platform using standard markup formats. With this embedded metadata a publisher can expand their web footprint and allow users to link back to their content server.

Now, XMP with its potential for embedding metadata in rich media can be seen as a Second Wave. Media assets distributed over the network can now carry along their own metadata and identity which can be leveraged by third-party applications to provide interesting new functionalities and link-back capability. Again a projection of web presence.

(Continues.)

Continue reading "The Second Wave" »

August 28, 2007

Stop Press

Boy, was I ever so wrong! Contrary to what I said in yesterday's post, the new PRISM 2.0 spec does support XMP value type mappings for its terms. See the table below which lists the PRISM basic vocabulary terms and the XMP value types.

Many thanks to Dianne Kennedy and the rest of the PRISM Working Group for having added this support to PRISM 2.0.

Continue reading "Stop Press" »

August 27, 2007

ExifTool

(Update - 2007.08.28: I inadvertently missed out the term names in the last example of XMP as RDF/N3 with QNames and have now added these in. Also - a biggie - I said that PRISM had no XMP schema defined. This is actually wrong and as I blogged here today, the new PRISM 2.0 spec does indeed have a mapping of PRISM terms to XMP value types. Should actually have read the spec instead of just blogging about it earlier here. :~)

Having previously stooped to an extremely crass hack for pulling out a document information dictionary from PDFs (for which no apologies are sufficient but it does often work) I feel I should make some kind of amends and mention the wonderful ExifTool by Phil Harvey for reading and writing metadata to media files. This is both a Perl library and command-line application (so it's cross-platform - a Windows .exe and Mac OS .dmg are also provided.) Besides handling EXIF tags in image files this veritable swissknife of metadata inspectors can also read PDFs for the information dictionary and the document XMP packet. And moreover, intriguingly, can dump the raw (document) XMP packet.

I'm still experimenting with it. There's quite a number of features to explore. But some preliminary finds are listed below.

Continue reading "ExifTool" »

August 22, 2007

Weird Scenes Inside the Gold Mine

So, following up on my recent posts here on Metadata in PDFs (Strategies, Use Cases, Deployment), I finally came across PDF/A and PDF/X, two ISO standardized subsets of PDF. the former (ISO 19005-1:2005) for archiving and the latter (ISO 15929:2002, ISO 15930-1:2001, etc.) for prepress digital data exchange.

Both formats share some common ground such as minimizing surprises between producer and consumer and keeping things open and predictable. But my interest here is specifically in metadata and to see what guidance these standards might provide us. Not unsurprisingly, metadata is a key issue for PDF/A, less so for PDF/X. I'll discuss PDF/X briefly but the bulk of this post is focussed on PDF/A. See below.

Continue reading "Weird Scenes Inside the Gold Mine" »

August 2, 2007

PRISM 2.0

Only just caught up with this but the PRISM 2.0 draft is now available (since July 12) for public comment. See this posted by Dianne Kennedy:

"Just a note to let you know that PRISM 2.0 has just been posted at www.prismstandard.org <http://www.prismstandard.org/> . This is the first major revision to PRISM. We have incorporated new elements to support online content and have expanded and revised our controlled vocabularies. In addition we have added a profile to support PRISM in an XMP environment.

We invite you to review the new specification (in 6 documents organized by namespace) and provide your comments before September 15. Please just email comments and questions to me, dkennedy@idealliance.org. "

Metadata in PDF: 3. Deployment

So, assuming we know the form of the metadata we wish to add to our PDFs (or else to comply with if there is already a set of guidelines, or some industry initiative in effect) how can we realize this? And, on the flip side, how can we make it easier for consumers to extract metadata we have embedded in our PDFs.

Below are some considerations on deploying metadata in PDFs and consumer access.

Continue reading "Metadata in PDF: 3. Deployment" »

August 1, 2007

Metadata in PDF: 2. Use Cases

Well, this is likely to be a fairly brief post as I'm not aware of many use cases of metadata in PDFs from scholarly publishers. Certainly, I can say for Nature that we haven't done much in this direction yet although are now beginning to look into this.

I'll discuss a couple cases found in the wild but invite comment as to others' practices. Let me start though with the CNRI handle plugin demo for Acrobat which I blogged here.

Continue reading "Metadata in PDF: 2. Use Cases" »

Metadata in PDF: 1. Strategies

Emboldened by my own researches, by the recent handle plugin announcement from CNRI (on which, more in a follow-on post), and by Alexander Griekspoor's comment to my earlier post, I thought I'd write a more extensive piece about embedding metadata in PDF with a view to the following:

  • Discover what other publishers are currently doing
  • Stimulate discussions between content providers and/or consumers
  • Lay groundwork for a CrossRef best practice guidelines

Why should CrossRef be interested? Well, at minimum to embed the DOI along with the digital asset would seem to be inherently "a good thing". (And, in fact, this is precisely the approach that CNRI have taken for their plugin demos. I'll look later at what they actually did and consider whether that is a model that CrossRef publishers might usefully follow.)

Why include the DOI as an explicit piece of metadata rather than have it included by virtue of its appearance in a content section? The main reason is that it is then unambiguously accessible. Content sections in PDFs are typically filtered and sometimes encrypted), whereas metadata is usually plain text and moreover is marked up as to field type.

Another question concerns whether to add in the identifier alone, or to embed a full metadata set. Why not just embed the identifier and visit CrossRef for the metadata? This is feasible in some cases although it does involve an extra network trip, requires an application to service the identifier and is obviously not workable in offline contexts. Seems like a "no-brainer" to include a fuller description from the outset. Note that publishers frequently make some of this information available anyway in other metadata delivery channels, e.g. RSS feeds.

Continue reading "Metadata in PDF: 1. Strategies" »

July 31, 2007

Handle Acrobat Reader Plugin

Just announced on the handle-info list is a new plugin from CNRI for Acrobat Reader - see here. The announcement says:

"It is intended to demonstrate the utility of embedding a identifying
handle in a PDF document.
 
...
 
A set of demonstration documents, each with an embedded identifying
handle, is packaged with the plug-in to show potential uses. To make
productive use of this technology, a given industry or community of
users would have to agree on one or more specific applications and
populate the relevant handle records accordingly."

Two immediate comments:

  • This is a Windows-only plugin (realized that right after hitting the download button and seeing the '.exe' file) and also needs admin rights to install. (So I solved the first hurdle and am trying to clear the second hurdle. Lockdown is not an uncommon practice for enterprise or institutional computers.)
    (Update: Actually, I think I got this wrong. I need admin privileges to install Adobe Acrobat 8. Still scuppered, though. Can't even see the sample PDF files.)
  • The plugin seems to be aimed at the user rather than at the user agent and thus is necessarily limited in scope, i.e. it needs a human driver. (Ideally content providers would embed metadata within media files using structured markup techniques which would be readily accessible to any downstream app which could leverage this data transparently to provide enhanced user services.)

Anyway, I'll add something more when I can get it installed. I think this tool could be a useful addition to publishing toolkits but also that content providers could do much more for consumers by disclosing metadata for their digital assets in a neutral, structured form.

July 27, 2007

XMP: First Hacks

(Update - 2007.07.28: I meant to reference in this entry Pierre Lindenbaum's post back in May Is there any XMP in scientific pdf ? (No), which btw also references Roderic Page's post on XMP but forgot to add in the links in my haste to scoot off. Well, truth is we still can't answer Pierre in the affirmative but at least we can take the first steps towards rectifying this.)

I've been revisiting Adobe's XMP just recently. (I blogged here about the new XMP Toolkit 4.1 back in March.)

I wanted to share some of my early experiences. First off, after a couple of previous attempts which got pushed aside due to other projects, I managed to compile the libraries and the sample apps that ship with the C++ SDK under Xcode on the Mac. I also needed to compile Expat first which doesn't ship with the distribution.

OK, so far, so good. What this basically leaves one with is a couple of XMP dump utilities (DumpMainXMP and DumpScannedXMP) and two others (XMPCoreCoverage and XMPFilesCoverage) which is a good start anyways for exploring. And turns out that our PDFs already have some workflow metadata in them. This is encouraging because the SDK allows apps to read and update existing XMP packets from files, though not to write new packets into files (as far as I understand).

I thought I would take this opportunity anyway to:

  1. See what XMP metadata terms we might consider adding
  2. Try and add these to existing XMP packets

Ugly details are presented below, but by updating the XMP packet metadata in one of our PDFs (Nature 445, 37 (2007), C.J. Hogan) we can teach Acrobat Reader to read - see the "before" (PDF here) and "after" (PDF here) screenshots in the figure.

acrobats.png

Of course, this is really about much more than getting Adobe apps to read/write metadata. It's about using XMP as a standard platform for embedding metadata in digital assets for third-party apps to read/write. If we can put ID3 tags into our podcasts then why not XMP packets into other media?

Continue reading "XMP: First Hacks" »

July 10, 2007

IBM Article on PRISM

Nice entry article on PRISM here by Uche Ogbuji, Fourthought Inc. on IBM's DeveloperWorks.

May 31, 2007

RSC's Project Prospect v1.1

We updated our Project Prospect articles today to release v1.1, with a pile of look & feel improvements to the HTML views and links. The most interesting technical addition is the launch of our enhanced RSS feeds, where we have updated our existing feeds for enhanced articles. These now include ontology terms and primary compounds both visually (as text terms and 2D images) and within the RDF - using the OBO in OWL representation and the info:inchi specification mentioned here by Tony only a few weeks ago.

The enhanced entries will soon become more common as we concentrate our enhancements on our Advance Articles, but the current example below from our Photochemical and Photobiological Sciences feed is lovely. RDF code after the jump - just as beautiful to the parents...

ProspectRSS.jpg

Continue reading "RSC's Project Prospect v1.1" »

March 22, 2007

XMP Capabilities Extended

This post on Adobe's Creative Solutions PR blog may be worth a gander:

"This new update, the Adobe XMP 4.1, provides new libraries for developers to read, write and update XMP in popular image, document and video file formats including: JPEG, PSD, TIFF, AVI, WAV, MPEG, MP3, MOV, INDD, PS, EPS and PNG. In addition, the rewritten XMP 4.1 libraries have been optimized into two major components, the XMP Core and the XMP Files.

The XMP Core enables the parsing, manipulating and serializing of XMP data, and the XMP Files enables the reading, rewriting, and injecting serialized XMP into the multiple file formats. The XMP Files can be thought of as a "file I/O" component for reading and writing the metadata that is manipulated by the XMP Core component.

Supported development environments for Adobe’s XMP 4.1 are: XCode 2.3 for Macintosh universal binaries, Visual Studio 2005 (VC8) for Windows, and Eclipse 3.x on any available platform. The XMP Core is available as C++ and Java sources with project files for the Macintosh, Windows and Linux platform. A Java version of XMP Files is under consideration for a future update."

And now I just read that last sentence again: "A Java version of XMP Files is under consideration for a future update." So, how hard do they really want to make uptake of XMP be? Am surprised they're even still considering offering full Java support, and not offering also anything in the way of support for glue languages such as Perl, Python, or Ruby.

Which leads to the question: Is anybody here using XMP and had any success to relate or lessons for the rest of us?

January 23, 2007

Use of PRISM in RSS

Was rooting around for some information and stumbled across this page which may be of interest:

http://googlereader.blogspot.com/2006/08/namespaced-extensions-in-feeds.html
Namespaced Extensions in Feeds
Thursday, August 03, 2006
posted by Mihai Parparita

“I wrote a small MapReduce program to go over our BigTable and get the top 50 namespaces based on the number of feeds that use them.”

% of FeedsNamespaceURI
29.36%Dublin Corehttp://purl.org/dc/elements/1.1/
0.21%PRISMhttp://prismstandard.org/namespaces/1.2/basic/

Seems quite an impressive percentage for PRISM.

October 3, 2006

AdsML

A new version of the AdsML Framework 2.0, Release 8 from the AdsML Consortium is now available for download from http://www.cnet.se/adsml.

Below is an extract from the "Vision" document which outlines the broad goals of AdsML.

Continue reading "AdsML" »