Thursday, September 16, 2010

Tim BL's talk on Linked Data

I learned a lot from Tim Berners-Lee's TED talk from February 2009 about Linked Data. He talks a bit about his motivation for inventing the Web, which was that the data he encountered at CERN was in all different formats and on all different computer architectures and he spent a huge fraction of his time writing code to translate one format to another. He talks about how much of the world's data is still locked up in information silos -- a million disconnected little islands -- and how many of the world's most urgent problems require that data be made available across the boundaries between corporations, organizations, laboratories, universities, and nations. He has laid out two sets of guidelines for linked data. The first is for the technical crowd:

  1. Use URIs to identify things.
  2. Use HTTP URIs so that these things can be referred to and looked up ("dereferenced") by people and user agents.
  3. Provide useful information about the thing when its URI is dereferenced, using standard formats such as RDF/XML.
  4. Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web.

The second set is for a less technical crowd:

  1. All kinds of conceptual things, they have names now that start with HTTP.
  2. I get important information back. I will get back some data in a standard format which is kind of useful data that somebody might like to know about that thing, about that event.
  3. I get back that information it's not just got somebody's height and weight and when they were born, it's got relationships. And when it has relationships, whenever it expresses a relationship then the other thing that it's related to is given one of those names that starts with HTTP.

It's a very eloquent talk, reminding me in places of David Gelernter's prophetic book Mirror Worlds.



What's remarkable about the Linked Data idea is that, as much as people tend to dismiss the whole semantic web vision, it really is making remarkable progress. The diagram above shows several interlinked websites with large and mutually compatible data sets.
  • DBPedia aims to extract linked data from Wikipedia and make it publicly available.
  • YAGO is a huge semantic knowledge base. Currently, YAGO knows more than 2 million entities (like persons, organizations, cities, etc.). It knows 20 million facts about these entities.
  • Lexvo.org brings information about languages, words, characters, and other human language-related entities to the Linked Data Web and Semantic Web.
  • The Calais web service is an API that accepts unstructured text (like news articles, blog postings, etc.), processes them using natural language processing and machine learning algorithms, and returns RDF-formatted entities, facts and events. It takes about 0.5 to 1.0 second depending on how big a document you send and the size of your pipe.
  • Freebase is an open repository of structured data of more than 12 million entities. An entity is a single person, place, or thing. Freebase connects entities together as a graph.
  • LinkedCT is a website full of linked data about past and present clinical trials.
Berners-Lee has recommended a very small set of Linked Data principles.
  • Use URIs as names for things.
  • Use HTTP URIs so that people can look up those names.
  • When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)
  • Include links to other URIs so that they can discover more things.

No comments: