
(Click image for full size graphic.)
I thought I could take this opportunity to demonstrate one evolution path from traditional record-based search to a more contemporary triple-based search. The aim is to show that these two modes of search do not have to be alternative approaches but can co-exist within a single workflow.
Let me first mention a couple of terms I’m using here: ‘graphs’ and ‘properties’. I’m using ‘property’ loosely to refer to the individual RDF statement (or triple) containing a property, i.e. a triple is a ‘(subject, property, value)’ assertion. And a ‘graph’ is just a collection of ‘properties’ (or, more properly, triples). Oh, and I’ll also use the term ‘records’ when considering ‘graphs’ as pre-fabricated objects returned within a result set.
So, what do we have here? We have on the left a traditional means of disseminating search results which is typically record based. A new set of records may be generated by querying using the API provided, whether proprietary or public such as Lucene or SRU/CQL. We can thus consider this search service as a ‘record store’ – even though records tend to generated anew rather than retrieved. The individual records in the result set are collections or groupings of ‘properties’ about the subjects of the query. Note that this is somewhat similar to the way music is packaged for physical distribution with many tracks (‘properties’) combined onto a single album (‘record’ or ‘graph’) which contains a thematic coherence – either same artist or compilation around a given topic.
Digital music distribution, on the other hand, allows for albums to be atomized so that individual tracks may be cherry-picked at will. This is not dissimilar from what happens in a ‘triple store’ where the basic properties (‘tracks’) that in a regular search engine were together combined in a ‘record’ (‘album’) to present a search result can now be plucked apart and recombined into newer bespoke ensembles. Note that this querying and recombination can be applied across the full triple store or even across this triple store and remote triple stores since the same data model is applied. Certainly, at the data model level federated searching thus becomes a non-issue.
Suppose now that our search server (or record store) is an OpenSearch-type service, i.e. the result sets are distributed as some list-based format, typically RSS, and that the list-based format either provides an RDF graph or can be transformed to such a graph, we could then use that as a basis for feeding an RDF triple store.
So, now then at right we have a triple store which is a large database of triples (or properties) compiled from all the records in the record store. And since this is a triple store we can query it using SPARQL. For example, this trival SPARQL query:
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX prism: <http://prismstandard.org/namespaces/basic/2.0/>
SELECT ?doi ?title
WHERE {
?s prism:doi ?doi .
?s dc:title ?title .
FILTER regex(?title, "boson", "i" )
}
LIMIT 5
returns the first five articles (referenced by DOI) with title containing the word ‘boson’:
-------------------------------------------------------------------------------------------------- | doi | title | ================================================================================================== | "10.1038/nature05513" | "Comparison of the Hanbury Brown–Twiss effect for bosons and fermions" | | "10.1038/221999a0" | "Physics: The Intermediate Boson" | | "10.1038/313506b0" | "The nuts and bolts of bosons" | | "10.1038/301287a0" | "The search for bosons: A golden year for the weak force" | | "10.1038/424003a" | "Below-par performance hampers Fermilab quest for Higgs boson" | --------------------------------------------------------------------------------------------------
Now let’s contrast this with a conventional record-based search, such as shown at left, to find the first five articles (referenced by DOI) with title containing the word ‘boson’ would use a query (here SRU/CQL, and CQL is bolded) such as:
?query=dc.title="boson"&maximumRecords=5&httpAccept=application/rss+xml
and would receive a set of result records (here RSS) like so:
... <item rdf:about="http://dx.doi.org/10.1038/nature05513"> <title>Comparison of the Hanbury Brown–Twiss effect for bosons and fermions</title> <link>http://dx.doi.org/10.1038/nature05513</link> <dc:identifier>doi:10.1038/nature05513</dc:identifier> <dc:title>Comparison of the Hanbury Brown–Twiss effect for bosons and fermions</dc:title> ... </item> <item rdf:about="http://dx.doi.org/10.1038/221999a0"> <title>Physics: The Intermediate Boson</title> <link>http://dx.doi.org/10.1038/221999a0</link> <dc:identifier>doi:10.1038/221999a0</dc:identifier> <dc:title>Physics: The Intermediate Boson</dc:title> ... </item> ...
Note also that there is an interesting halfway house as shown in the diagram, whereby a set of result records presenting a single RDF graph can be queried as its own (very) restricted triple store.
In general, because a triple store is so primitive and it can be queried alongside other triple stores the queries that can be put together can be highly complex and customized with arbitrary data. The result from such a query differs from a traditional ‘record’ where a fixed property set is bound together in a presentation. Such a result is user-determined as opposed to the server-determined nature of traditional result ‘records’.
I hope that this post has been able to show in some degree that although there are some obvious differences there is nevertheless a synergy between these two modes of searching: prêt-à-porter and tailored.
[See this link if you're short on time: facets search client. Only tested on Firefox at this point. Caveat: At time of writing the CrossRef Metadata Search was being very slow but was still functional. Previously it was just slow.]
Following on from Geoff's announcement last month of a prototype CrossRef Metadata OpenSearch on labs.crossref.org, I wanted to show what typical OpenSearch responses might look like in a more mature implementation.
I have taken the liberty of modelling these on the response formats that we are already providing in our nature.com OpenSearch service which in turn are based on the draft syndication formats that I blogged here earlier.
I am therefore returning ATOM, JSON, JSONP and RSS responses from these four OpenSearch URL templates:
An example query ('apple') returning an ATOM feed from a CrossRef Metadata OpenSearch would be the following:
And the same query returning a JSON version of that ATOM feed would look as follows:By the way, this is just for demonstration purposes and there are still issues to be resolved including character encoding.This interface uses the existing CrossRef OpenSearch response format and parses the COinS objects embedded in that response to provide a more standard OpenSearch syndication result set format. The prototype implemenatation also has some bugs which I needed to work around. (I will forward on details of these.) And there is also a more fundamental issue of response time from the experimental search server.
But still this should give some idea of what a CrossRef Metadata OpenSearch service could look like.
To show this all in action I've worked up one of my demo OpenSearch clients for nature.com OpenSearch which displays a facetted search response for a CrossRef search. For good measure this includes also an OpenSearch interface for PubMed and the search client allows for simple selection between three journals databases: nature.com, CrossRef and PubMed.
Of course, with a reasonably uniform set of search result formats such as presented here it then becomes a simple exercise to reuse these search responses in additional search clients.
As can be anticipated it would be very straightforward to carry this over into a single metasearch service which could run across these multiple databases.

Following on from my recent post about our shiny new nature.com OpenSearch service we just put up a cheatsheet for users. I'm posting about this here as this may also be of interest especially to those exploring how SRU and OpenSearch intersect.
The cheatsheet can be downloaded from our nature.com OpenSearch test page and is available in two forms:
Naurally, all comments welcome.

(Click panels in figure to read related posts.)
Following up on my earlier posts here about the structured search technologies OpenSearch and SRU, I wanted to reference three recent posts on our web publishing blog Nascent which discuss our new nature.com OpenSearch service:
In an earlier post I talked about using the PAM (PRISM Aggregator Message) schema for an SRU result set. I have also noted in another post that a Search Web Service could support both SRU and OpenSearch interfaces. This does then beg the question of what a corresponding OpenSearch result set might look like for such a record.
Based on the OpenSearch spec and also on a new Atom extension for SRU, I have contrived to show how a PAM record might be returned in a coomon OpenSearch format. Below I offer some mocked-up examples for each of the following formats for review purposes:
cql.keywords adj "solar eclipse"In this example we imagine that two records have been requested. (The example formats also include navigational links as per the OpenSearch spec examples.)
Note that the JSON example closely follows the ATOM schema with a couple of main deviations:
It would be interesting to hear what readers think of these examples - especially the JSON format.
|
|
|
| RSS 1.0 | ATOM | JSON |
|---|
As posted here on the SRU Implementors list, the OASIS Search Web Services Technical Committee has announced the release of drafts of SRU and CQL version 2.0:
The Committee is soliciting feedback on these two documents. Comments should be posted to the SRU list by August 13.[Update - 2009.06.07: As pointed out by Todd Carpenter of NISO (see comments below) the phrase "SRU by contrast is an initiative to update Z39.50 for the Web" is inaccurate. I should have said "By contrast SRU is an initiative recognized by ZING (Z39.50 International Next Generation) to bring Z39.50 functionality into the mainstream Web".]
[Update - 2009.06.08: Bizarrely I find in mentioning query languages below that I omitted to mention SQL. I don't know what that means. Probably just that there's no Web-based API. And that again it's tied to a particular technology - RDBMS.]
There are two well-known public search APIs for generic Web-based search: OpenSearch and SRU. (Note that the key term here is "generic", so neither Solr/Lucene nor XQuery really qualify for that slot. Also, I am concentrating here on "classic" query languages rather than on semantic query languages such as SPARQL.)
OpenSearch was created by Amazon's A9.com and is a cheap and cheerful means to interface to a search service by declaring a template URL and returning a structured XML format. It therefore allows for structured result sets while placing no constraints on the query string. As outlined in my earlier post Search Web Service, there is support for search operation control parameters (pagination, encoding, etc.), but no inroads are made into the query string itself which is regarded as opaque.
SRU by contrast is an initiative to update Z39.50 for the Web and is firmly focussed on structured queries and responses. Specifically a query can be expressed in the high-level query language CQL which is independent of any underlying implementation. Result records are returned using any declared W3C XML Schema format and are transported within a defined XML wrapper format for SRU. (Note that the SRU 2.0 draft provides support for arbitrary result formats based on media type.)
One can summarize the respective OpenSearch and SRU functionalities as in this table:
| Structure | OpenSearch | SRU |
|---|---|---|
| query | no | yes |
| results | yes | yes |
| control | yes | yes |
| diagnostics | no | yes |
What I wanted to discuss here was the OpenSearch and SRU interfaces to a Search Web Service such as outlined in my previous post. The diagram at top of this post shows query forms for OpenSearch and SRU and associated result types. The Search Web Service is taken to be exposing an SRU interface. It might be simplest to walk through each of the cases.
(Continues below.)

(Click image to enlarge graphic.)
While the OASIS Search Web Services TC is currently working towards reconciling SRU and OpenSearch, I thought it would be useful to share here a simple graphic outlining how a search web service for structured search might be architected.
Basically there are two views of this search web service (described in separate XML description files and discoverable through autodiscovery links added to HTML pages):
One can see at a glance that there's more happening down in the SRU layer. The SRU layer implements a heavyweight, robust service which provides a detailed listing of search indexes and index relations in the description document ('SRU Explain'), is searchable using a standard query grammar - CQL ('Contextual Query Language'), responds with result sets inside a standard XML wrapper and expressed as an XML record set (e.g. PAM) that is validatable using W3C XML Schema, and makes available a full roster of diagnostics.By contrast the OpenSearch layer provides a lightweight view onto the search web service in which a simple opaque query string is sent to the server and a simple XML result set returned (usually RSS or Atom). Again a description document is made available ('OpenSearch Description') but this is much more coarse grained than the SRU description - e.g. it does not specify query components such as indexes or relations.
In practice, both views can be provided for by the same search web service. While OpenSearch does not specify any structured query it can make use of a CQL packaged query. That is, a single parameter value for the OpenSearch 'query' parameter can be unpacked by a CQL parser to yield a complex search query. The search query does not need to be splattered all over the URL querystring which is already using its parameter set to provide control information for the search (e.g. pagination, encoding and the like).
And how would this relate to existing platform-hosted search services? Well, such services are usually bound to the host platform and are not intended to support remote applications. A search web service, on the other hand, would be ideally suited to offering direct support for running structured searches on platform-hosted content using off-platform apps.
We just registered in the SRU (Search and Retrieve by URL) search registry the following components:
?version=1.1&operation=searchRetrieve&query=prism.doi=%2210.1038/nature05398%22and if the server were also equipped to respond with PAM (PRISM Aggregator Message) format for result records, a response might look like this:


As posted here on the SRU Implementors list, the OASIS Search Web Services Technical Committee has announced the release of five Committee Drafts, informally known as:
The next phase of work for the TC will be the development of SRU/CQL 2.0, and the Description Language.
The recently discussed (announced?) Google Knol project could make Google Scholar look like a tiny blip in the the scholarly publishing landscape.
I love the comment an authority:
"Books have authors' names right on the cover, news articles have bylines, scientific articles always have authors -- but somehow the web evolved without a strong standard to keep authors names highlighted. We believe that knowing who wrote what will significantly help users make better use of web content."
And so I suppose this means they are assigning author identifiers....
The OASIS Search Web Services TC has just put out the following document for public review (Nov 7- Dec 7, 2007):
Search Web Services v1.0 Discussion Document
From the OASIS announcement:
"This document: "Search Web Services Version 1.0 - Discussion Document - 2 November 2007", was prepared by the OASIS Search Web Services TC as a strawman proposal, for public review, intended to generate discussion and interest. It has no official status; it is not a Committee Draft. The specification is based on the SRU (Search Retrieve via URL) specification which can be found at http://www.loc.gov/standards/sru/. It is expected that this standard, when published, will deviate from SRU. How much it will deviate cannot be predicted at this time. The fact that the SRU spec is used as a starting point for development should not be cause for concern that this might be an effort to rubberstamp or fasttrack SRU. The committee hopes to preserve the useful features of SRU, eliminate those that are not considered useful, and add features that are not in SRU but are considered useful. "
ACAP has released some documents outlining the use cases they will be testing and some proposed changes to the Robots Exclusion Protocol (REP) - both robots.txt and META tags. There are some very practical proposals here to improve search engine indexing. However, the only search engine publicly participating in the project is http://www.exalead.com/ (which according to Alexa attracted 0.0043% of global internet visits over the last three months). The main docs are "ACAP pilot Summary use cases being tested", "ACAP Technical Framework - Robots Exclusion Protocol - strawman proposals Part 1", "ACAP Technical Framework - Robots Exclusion Protocol - strawman proposals Part 2", "ACAP Technical Framework - Usage Definitions - draft for pilot testing".
What would cause other search engines to recognize the ACAP protocols rather than ignore them? A lot of publishers implementing this and requiring search engines to recognize it to index content could put pressure on the engines. Maybe.
From Ray Denenberg's post to the SRU Listserv yesterday:
"The new SRU web site is now up: http://www.loc.gov/sru/It is completely reorganized and reflects the version 1.2 specifications.
(It also includes version 1.1 specifications, but is oriented to version
1.2.)...
There is an official 1.1 archive under the new site,
http://www.loc.gov/sru/sru1-1archive/. And note also, that the new spec incorporates both version 1.1 and 1.2 (anything specific to version 1.1 is annotated as such)."
OASIS has just announced a technical committee for standardising search services. This from the Call for Participation:
b. PurposeTo define Search and Retrieval Web Services, combining various current and
ongoing web service activities.Within recent years there has been a growth in activity in the development of
web service definitions for search and retrieval applications. These include
SRU, a web service based in part on the NISO/ISO Search and Retrieval standards;
the Amazon OpenSearch, which defines a means of describing and automating search
web forms; as well as many proprietary definitions (e.g. the Google and MSN
Search APIs). There are also a number of activities for defining abstract search
APIs that can be mapped onto multiple implementations either within native code
or onto remote procedural calls and web services, such as ZOOM (Z39.50 Object
Oriented Model); SQI (Simple Query Interface), an IEEE standard developed for
searching and retrieval in the IMS (Instructional Management Systems) space; and
OSIDs (Open Service Interface Definitions from the Open Knowledge Initiative.
While abstract APIs would be out of scope, these would inform the work to
increase interoperability and compatibility.
Update: All apologies to Google. Apparently this was a problem at our end which our IT folks are currently investigating. (And I thought it was just me. :)
Just managed to get this page:
"Google Error
We're sorry...
... but your query looks similar to automated requests from a computer virus or spyware application. To protect our users, we can't process your request right now.
We'll restore your access as quickly as possible, so try again soon. In the meantime, if you suspect that your computer or network has been infected, you might want to run a virus checker or spyware remover to make sure that your systems are free of viruses and other spurious software.
We apologize for the inconvenience, and hope we'll see you again on Google.
To continue searching, please type the characters you see below:"
And my search request?
ark
(Actual query is here as argument to the continue parameter.)
Was hoping to find results related to the The ARK Persistent Identifier Scheme. Maybe I missed something but I'm not impressed.
Nelson Minar has a short post on Google's Search History 'feature' and how it can be used to enhance your search experience. I guess that should be SearchULike.