Thursday, September 16, 2010

Tim BL's talk on Linked Data

I learned a lot from Tim Berners-Lee's TED talk from February 2009 about Linked Data. He talks a bit about his motivation for inventing the Web, which was that the data he encountered at CERN was in all different formats and on all different computer architectures and he spent a huge fraction of his time writing code to translate one format to another. He talks about how much of the world's data is still locked up in information silos -- a million disconnected little islands -- and how many of the world's most urgent problems require that data be made available across the boundaries between corporations, organizations, laboratories, universities, and nations. He has laid out two sets of guidelines for linked data. The first is for the technical crowd:

  1. Use URIs to identify things.
  2. Use HTTP URIs so that these things can be referred to and looked up ("dereferenced") by people and user agents.
  3. Provide useful information about the thing when its URI is dereferenced, using standard formats such as RDF/XML.
  4. Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web.

The second set is for a less technical crowd:

  1. All kinds of conceptual things, they have names now that start with HTTP.
  2. I get important information back. I will get back some data in a standard format which is kind of useful data that somebody might like to know about that thing, about that event.
  3. I get back that information it's not just got somebody's height and weight and when they were born, it's got relationships. And when it has relationships, whenever it expresses a relationship then the other thing that it's related to is given one of those names that starts with HTTP.

It's a very eloquent talk, reminding me in places of David Gelernter's prophetic book Mirror Worlds.

What's remarkable about the Linked Data idea is that, as much as people tend to dismiss the whole semantic web vision, it really is making remarkable progress. The diagram above shows several interlinked websites with large and mutually compatible data sets.
  • DBPedia aims to extract linked data from Wikipedia and make it publicly available.
  • YAGO is a huge semantic knowledge base. Currently, YAGO knows more than 2 million entities (like persons, organizations, cities, etc.). It knows 20 million facts about these entities.
  • brings information about languages, words, characters, and other human language-related entities to the Linked Data Web and Semantic Web.
  • The Calais web service is an API that accepts unstructured text (like news articles, blog postings, etc.), processes them using natural language processing and machine learning algorithms, and returns RDF-formatted entities, facts and events. It takes about 0.5 to 1.0 second depending on how big a document you send and the size of your pipe.
  • Freebase is an open repository of structured data of more than 12 million entities. An entity is a single person, place, or thing. Freebase connects entities together as a graph.
  • LinkedCT is a website full of linked data about past and present clinical trials.
Berners-Lee has recommended a very small set of Linked Data principles.
  • Use URIs as names for things.
  • Use HTTP URIs so that people can look up those names.
  • When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)
  • Include links to other URIs so that they can discover more things.

Monday, September 13, 2010

Set theory, OWL, and the Semantic Web

Despite my interest in semantic web technology, there is one area I've had a little mental block about, which is OWL. If you just sit down and try to read the available technical information about OWL, it's clear as mud. Imagine my surprise when clarity dawned in the form of the book Semantic Web for Dummies by Jeffrey Pollock, who explains in Chapter 8 that OWL amounts to set theory. The book is surprisingly good, I recommend it.

I attended elementary school in the 1960s, when the U.S. was trying a stupid educational experiment called New Math. The basic premise was that little kids needed to know axiomatic set theory, in order for the U.S. to raise a generation of uber math geeks who could outperform the Soviet engineers who put Sputnik into orbit. If only I'd taken more seriously all this nonsense about unions and intersections and empty sets, I might have avoided all that trouble with schoolyard bullies. Oh wait.... Anyway, in order to fulfill this obviously pointless requirement, our teacher would spend the first three weeks of every school year drilling us on exercises in set theory and then move on to whatever math we actually really needed to learn for that year. The take-home lesson was that intersection was preferable to union, because writing the result of a union operation meant I had to do more writing and it made my hand hurt. In retrospect it's amazing that I retained any interest in mathematics.

Set theory came into vogue as guys like David Hilbert and Bertrand Russell were fishing around for a formal bedrock on which to place the edifice of mathematics. The hope was to establish a mathematics that was essentially automatable, in the belief that as a result it would be infallible. So they went around formalizing the definitions of various mathematical objects by injecting bits of set theory. One of the more successful examples was to use Dedekind cuts to define the real numbers in terms of the rational numbers.

Hopes of the infallibility of mathematics' new foundation were dashed by Kurt Godel's brilliant incompleteness theorem, described as “the most signal paper in logic in two thousand years.” It was possible to define mathematical ideas in set theoretic terms, and to formalize the axioms, and to automate the proof process, but at a cost. Godel proved the existence of mathematical truths that were formally undecidable -- they could neither be proved nor disproved. Hilbert had hoped that once mathematics was formalized, no stone would be left unturned, and all true mathematical statements would be provable. The story of Godel's theorem (not the history, just an outline of the proof itself) is a wonderful story, well told in Hofstatder's book Godel, Escher, Bach.

But getting back to semantic web stuff. Here are some basic ideas of OWL.
  • Everything is an instance of owl:Thing. Think of it as a base class like java.lang.Object.
  • Within an ontology, you have "instances", "classes", and "properties".
  • "Classes" are essentially sets. "Individuals" are elements of sets.
  • A "property" expresses some relationship between two individuals.
  • OWL includes representations for:
    • unions and intersections of classes (sets)
    • the idea that a set is a subset of another
    • the idea that two sets are disjoint
    • the idea that two sets are the same set
    • the idea that two instances are the same instance
  • Properties can by symmetric (like "sibling") or transitive (like "equals")
  • A property can be "functional", or a function in a mathematical sense. If p is functional, and you assert that p(x)=y and p(x)=z, then the reasoning engine will conclude that y=z.
  • One property can be declared to be the inverse of another.
  • One can declare a property to have specific classes (sets) as its domain and range.
It would be really nice if, at this point, I had some brilliantly illustrative examples of OWL hacking ready to include here. Hopefully those will be forthcoming.