« April 2009 | Main | June 2009 »

May 30, 2009

Search Web Service

search-web-service.png
(Click image to enlarge graphic.)

While the OASIS Search Web Services TC is currently working towards reconciling SRU and OpenSearch, I thought it would be useful to share here a simple graphic outlining how a search web service for structured search might be architected.

Basically there are two views of this search web service (described in separate XML description files and discoverable through autodiscovery links added to HTML pages):

One can see at a glance that there's more happening down in the SRU layer. The SRU layer implements a heavyweight, robust service which provides a detailed listing of search indexes and index relations in the description document ('SRU Explain'), is searchable using a standard query grammar - CQL ('Contextual Query Language'), responds with result sets inside a standard XML wrapper and expressed as an XML record set (e.g. PAM) that is validatable using W3C XML Schema, and makes available a full roster of diagnostics.

By contrast the OpenSearch layer provides a lightweight view onto the search web service in which a simple opaque query string is sent to the server and a simple XML result set returned (usually RSS or Atom). Again a description document is made available ('OpenSearch Description') but this is much more coarse grained than the SRU description - e.g. it does not specify query components such as indexes or relations.

In practice, both views can be provided for by the same search web service. While OpenSearch does not specify any structured query it can make use of a CQL packaged query. That is, a single parameter value for the OpenSearch 'query' parameter can be unpacked by a CQL parser to yield a complex search query. The search query does not need to be splattered all over the URL querystring which is already using its parameter set to provide control information for the search (e.g. pagination, encoding and the like).

And how would this relate to existing platform-hosted search services? Well, such services are usually bound to the host platform and are not intended to support remote applications. A search web service, on the other hand, would be ideally suited to offering direct support for running structured searches on platform-hosted content using off-platform apps.

Structured Search Using PRISM Elements

We just registered in the SRU (Search and Retrieve by URL) search registry the following components:

Context Sets
Schemas
This means that an SRU (Search and Retrieve by URL) search engine that supported one of the PRISM context sets registered above could accept CQL (Contextual Query Language) queries such as the following:
  1. prism.doi = "10.1038/nature05398"
  2. prism.publicationName = "Nature" and prism.volume = "444" and prism.number = "7119" and prism.startingPage = "E9"
  3. dc.identifier = "doi:10.1038/nature05398"
  4. dc.creator = "Jones-Smith" and prism.publicationName = "Nature" and prism.publicationDate > "2006-01-01"
  5. dc.title any "fractal pollock" and prism.publicationName = "Nature" sortBy prism.publicationDate/sort.descending
  6. "fractal anlysis" and prism.publicationDate within "2005-01-01 2008-12-31" sortBy dc.creator/sort.ascending
(Note that the quotes are only needed above for the DOI strings which contain a "/" character. Otherwise they are optional in the above examples.)

Any query such as one of the above (here #1) could be sent to the server on a querystring like so:

?version=1.1&operation=searchRetrieve&query=prism.doi=%2210.1038/nature05398%22
and if the server were also equipped to respond with PAM (PRISM Aggregator Message) format for result records, a response might look like this:

fractal-analysis-pam.jpg
PAM was discussed here earlier.


Such a structured response would provide the metadata elements for applications to build various interfaces into the original article:
fractal-analysis.jpg
We think that these PRISM components (context sets and schemas) will be useful for structured search of scholarly publications.

May 26, 2009

OAI-ORE: Workshop Slides

This is a very slick presentation by Herbert Van de Sompel on OAI-ORE which he's due to give today for a workshop at the INFORUM 2009 15th Conference on Prrofessional Information Resources in Prague. It's on the long side at 167 slides but even if you just flip though or sample it selectively you'll be bound to come away with something.

Describing aggregations of resources is a subject that really has to be of interest to CrossRef publishers.

May 8, 2009

PRISM Aggregator Message

The new OAI-PMH interface to Nature.com sports one particular novelty which may well be of interest here: it makes use of the PRISM Aggregator Message. (For an announcement of this service see the post on our web publishing blog Nascent.)

As a protocol for the harvesting of metadata records within a digital repository, OAI-PMH records may be expressed in a variety of different metadata formats. For reasons of interoperability a base metadata format ('Dublin Core') is mandated for all OAI-PMH implementations. The expectation is that this base format would be augmented by community-specific vocabularies.

Our natural inclination was to mirror the article descriptions which we already circulate in our RSS feeds and within our HTML pages (as META tags) and PDF files (as XMP packets). In these cases we have used open data models (e.g. RDF) with simple properties cherry-picked from the DC and PRISM namespaces. But OAI-PMH has a special 'gotcha' in this regard: any metadata format must allow for W3C XML Schema validation. That is, the properties need to be constrained by an XSD data model. Enter PRISM Aggregator Message (PAM).

(Continues)

For the longest time I must confess I did not 'get' what PAM was about. PRISM was clearly a metadata vocabulary and yet with PAM there was all this wrangling with content, which as an academic publisher we frankly had no interest in as we already had our own journal article DTD and for interop we were beginning to look at NLM DTD. And then it dawned on me (albeit slowly) that the PAM DTD is the equivalent to NLM DTD but for trade magazine publishing, where there might not be such a strong practice of XML. And since the release of PRISM 2.0 (February 2008) there was now also an W3C XML Schema defined for PAM. (Note that the latest revision of PRISM 2.1 is about to be published, although the changes there do not have any bearing on this implementation.)

So, PAM defines PRISM elements to be used with XML content markup. Examining further reveals that within a PAM message there are one or more articles with metadata packaged into a head section, and content (if present) in a body section.

pam-message.png

Section 4.3 in the PAM 2.0 specification lists the allowable head elements by logical grouping, 11 in all: key elements, title, creative origin, publication, publication date, additional article ID, positional, topic, length, related content, rights & usage. Note that not all PRISM elements are supported; in fact only 43 of the 57 PRISM 2.0 elements are supported. Among the missing are 'prism:endingPage'. Also only 7 of the 15 DC elements are supported. Nevertheless we found that the bulk of the article descriptions could easily be accommodated within the PAM format. And because this is W3C XML Schema constrained there is an element ordering prescribed, and hence there is an interleaving of DC and PRISM elements.

The Nature.com OAI-PMH service has two access points:

User interface:
http://www.nature.com/oai

Service endpoint:
http://www.nature.com/oai/request

So, to work an example, if we want to get the record for doi:10.1038/nature01234 (which has an OAI-PMH identifier of oai:nature.com:10.1038/nature01234) we could use this call to get the description in PAM format:

http://www.nature.com/oai/request?verb=GetRecord&identifier=10.1038/nature01234&metadataPrefix=pam

(Note that as a convenience for the user we also allow a DOI to be used directly in place of the full OAI-PMH identifier as there is a one-to-one correspondence between the two within our repository. Simplifies cut and paste operations.)

This returns the following properties (shown in document order and by PAM logical grouping):

pam-elements.jpg

With PAM we are thus able to replicate in OAI-PMH the same journal article descriptions that we are currently disseminating through other service/content channels.

May 6, 2009

CrossRef's OpenURL query interface

Over the past two weeks we've focused on our OpenURL query interface with the goal being to improve its reliability. I'd like to mention some things we've done.

1) We now require an OpenURL account to use this interface (see the registration page) . This account is still free, there are no fixed usage limits, and the terms of use have been greatly simplified.

2) Resources have been re-arranged dedicating more horse-power to the OpenURL function.

3) The OpenURL function is now in our advanced monitoring function which means some lucky staff member will be getting phone calls at 3AM (me included!).

I should note that #1 has already reduced inappropriate usage. This also is not the end of planned changes. CrossRef has undertaken a major rewrite of parts of our system and this will include the OpenURL interface.

Chuck

May 1, 2009

OCLC defines requirements for a "Cooperative Identities Hub"

OCLC has published a report (PDF) identifying some requirements for what they call a "Cooperative Identities Hub". A quick glance through it seems to show that the use cases focus on what we are calling the "Knowledge Discovery" use cases. As I mentioned in my interview with Martin Fenner, there is also a category of "authentication" use cases that I think needs to be addressed by a contributor identifier system. Still, this is a good report that highlights many of the complexities that an identifier system needs to address.