Main

July 28, 2008

Does Size Matter?

Interesting post from Google, in which they say:

"Recently, even our search engineers stopped in awe about just how big the web is these days -- when our systems that process links on the web to find new content hit a milestone: 1 trillion (as in 1,000,000,000,000) unique URLs on the web at once!"
Puts CrossRef's 32,639,020 unique DOIs into some kind of perspective: 0.0033%. But nonetheless that trace percentage still seems to me to be reasonably large, especially in view of it forming a persistent and curated set.

Update: Talking of Google numbers, pingdom has a post "Map of all Google data center locations" with maps of US, Europe and World locations.

July 19, 2007

Publishing Linked Data

With these words:

"There was quite some interest in Linked Data at this year's World Wide
Web Conference (WWW2007). Therefore, Richard Cyganiak, Tom Heath and I
decided to write a tutorial about how to publish Linked Data on the
Web, so that interested people can find all relevant information, best
practices and references in a single place."

Chris Bizer announces this draft How to Publish Linked Data on the Web. It's a bright and breezy tutorial and useful (to me, anyway) for disclosing a couple of links:

The tutorial is unsurprisingly orthodox in its advocacy for all things HTTP and goes on to say:

"In the context of Linked Data, we restrict ourselves to using HTTP URIs only and avoid other URI schemes such as URNs and DOIs."

But this only relates back to Berners-Lee's piece on Linked Data referenced above in which he says:
"The second rule, to use HTTP URIs, is also widely understood. The only deviation has been, since the web started, a constant tendency for people to invent new URI schemes (and sub-schemes within the urn: scheme) such as LSIDs and handles and XRIs and DOIs and so on, for various reasons. Typically, these involve not wanting to commit to the established Domain Name System (DNS) for delegation of authority but to construct something under separate control. Sometimes it has to do with not understanding that HTTP URIs are names (not addresses) and that HTTP name lookup is a complex, powerful and evolving set of standards. This issue discussed at length elsewhere, and time does not allow us to delve into it here."

Hmm. Does make one wonder where the concept of URI ever arose. Surely the nascent WWW application should have mandated the exclusive use of HTTP identifiers? Seems that this concept snuck up on us somehow and we now have to put it back into the box. Pandora, indeed!

Back to the tutorial there are some unorthodox terms or at least I had not heard of them before. Contrasted with the defined term information resources (from AWWW) is the undefined term "non-information resources". Further on, there's a distinction made between two types of RDF triple: "literal triples" and "RDF links". I hadn't heard of either of these terms before although they are presented as if they were in common usage. The tutorial then goes on to deprecate the use of certain RDF features because it makes it "easier for clients". So, I guess that the full expressivity of RDF is either not required or the world of "linked data" is not quite so large as it would like to be.

And later on, there's this puzzling injunction:

"You should only define terms that are not already defined within well-known vocabularies. In particular this means not defining completely new vocabularies from scratch, but instead extending existing vocabularies to represent your data as required."

Am I wrong, or is there something of a Catch 22 there? To extend an arbitrary vocabulary I would need to be the namespace authority - to be the "URI owner" in W3C speak. But I can't be the authority for all namespaces/vocabularies because by the intent of the above they would likely be just the one (true?) vocabulary which I may or may not be the authority for. I thought the intent of the RDF model and XML namespaces was that terms could be applied from disparate vocabularies to the description at hand.

Anyways, I am not trying to knock the draft. It's something of a curate's egg, that's true, but I am genuinely looking forward to reading it through and would encourage others to have a look at it too.

March 02, 2007

Sir TimBL's Testimony

Just in case anybody may not have seen this, here's the testimony of Sir Tim Berners-Lee yesterday before a House of Representatives Subcommittee on Telecommunications and the Internet. Required reading.

(Via this post yesterday in the Save the Internet blog.)

eprintweb.org

IOP has created an instance of the arXiv repository calle eprintweb.org at http://www.eprintweb.org/. What's the difference from arXiv? From the eprinteweb.org site - "We have focused on your experience as a user, and have addressed issues of navigation, searching, personalization and presentation, in order to enhance that experience. We have also introduced reference linking across the entire content, and enhanced searching on all key fields, including institutional address."

The site looks very good and it's interesting to see a publisher developing a service directly engaging with a repository.

Continue reading "eprintweb.org" »

February 19, 2007

At Last! URIs for InChI

The info registry has now added in the InChI namespace (see registry entry here) which now means that chemical compounds identified by InChIs (IUPAC's International Chemical Identifiers) are expressible in URI form and thus amenable to many Web-based description technologies that use URI as the means to identify objects, e.g. XLink, RDF, etc. As an example, the InChI identifier for naphthalene is

InChI=1/C10H8/c1-2-6-10-8-4-3-7-9(10)5-1/h1-8H

and can now be legitimately expressed in URI form as

info:inchi/InChI=1/C10H8/c1-2-6-10-8-4-3-7-9(10)5-1/h1-8H

The info URI scheme exists to support legacy namespaces get a leg up onto the Web. Registered namespaces include PubMed identifiers, DOIs, handles, ADS bibcodes, etc. Increasingly we'll be expecting to see identifiers (both new and old) represented in a common form - URI.

February 01, 2007

RSC launches semantic enrichment of journal articles

The RSC has gone live today with the results of Project Prospect, introducing semantic enrichment of journal articles across all our titles. I'm pretty sure we're the first primary research publisher to do anything of this scope.

We’re identifying chemical compounds and providing synonyms, InChIs (IUPAC's Chemical Identifier), downloadable CML (Chemical Markup Language), SMILES strings and 2D images for these compounds. In terms of subject area we're marking up terms from the IUPAC Gold Book, and also Open Biomedical Ontology terms from the Gene, Cell, and Sequence Ontologies. All this stuff is currently available from an enhanced HTML view, with the additional information and links to related articles accessed via highlights in the article and popups.

The mark-up tools have been developed together with UK academics based at the Unilever Centre of Molecular Informatics and the Computing Laboratory at Cambridge University.

At launch we have about 100 articles from our 2007 publications, with the enhanced views currently free-to-air. Feel free to take a look.

January 08, 2007

What's in a URI?

First off, a Happy New Year to all!

A post of mine to the OpenURL list may possibly be of interest. Following up the recent W3C TAG (Technical Architecture Group) Finding on "The Use of Metadata in URIs" I pointed out that the TAG do not seem to be aware of OpenURL: which is both a standard prescription for including metadata in URI strings and a US information standard to boot.