Wednesday, January 27, 2010

Bayesian nets in RDF, and how to update them

I've banged my head on this for a couple of days and feel close to a solution. The graph looks like this.



Each random boolean variable gets a node, the causal relationship between them gets a node, and each variable gets a probability.
The math for updating probabilities is a little tricky, but in a fun and interesting way, so I enjoyed banging on that. At some point I'll tackle more involved cases where there aren't simply two random boolean variables, but that's the logistically simple case that exposes most of the concepts involved. Kinda like the Drosophila of Bayesian inference.

"""Let A and B be random boolean variables, with A being an
unobservable cause and B being an observable effect, and Pr(B|A) and
Pr(B|~A) are given. This might be the case if the situation has some
structure that dictates the conditional probabilities, like a 6.432
problem, or it just might be empirical.

From these we can compute probabilities for the four cases.
P11 = Pr(B^A) = Pr(A) Pr(B|A)
P10 = Pr(~B^A) = Pr(A) (1-Pr(B|A))
P01 = Pr(B^~A) = (1-Pr(A)) Pr(B|~A)
P00 = Pr(~B^~A) = (1-Pr(A)) (1-Pr(B|~A))

Treat (Pr(B|A), Pr(B|~A)) as one piece of information, and Pr(A) as a
separate independent piece of information. The first piece reflects
your beliefs about the machinery connecting A to B, and the second
reflects your beliefs about the likelihood of A, so these are
legitimately separate concerns.

If we observe that B=1, then we want to replace Pr(A) with our
previous estimate for Pr(A|B), which is given by our old numbers as
P11/Pr(A) = P11/(P11+P01), and this becomes our posterior probability
for A."""

class Link1:
    def __init__(self, PrBgivenA, PrBgivenNotA):
        self.PrBgivenA, self.PrBgivenNotA = PrBgivenA, PrBgivenNotA
    def updatePrA(self, PrA, bvalue):
        p11 = PrA * self.PrBgivenA
        p10 = PrA * (1. - self.PrBgivenA)
        p01 = (1. - PrA) * self.PrBgivenNotA
        p00 = (1. - PrA) * (1. - self.PrBgivenNotA)
        # print p11, p01, p10, p00
        assert 0.999 < p11 + p01 + p10 + p00 < 1.001
        assert PrA - 0.001 < p10 + p11 < PrA + 0.001
        if bvalue:
            # Pr(A|B)
            newPrA = p11 / (p11 + p01)
        else:
            # Pr(A|~B)
            newPrA = p10 / (p10 + p00)
        assert 0.0 <= newPrA <= 1.0
        return newPrA

link = Link1(0.6, 0.27)

PrA = 0.5
print PrA

for i in range(5):
    PrA = link.updatePrA(PrA, 0)
    print PrA

for i in range(10):
    PrA = link.updatePrA(PrA, 1)
    print PrA

Monday, January 18, 2010

Inference engines and automated reasoning

An inference engine is a computer program that reasons, using some form of knowledge representation.

This can be done with propositional logic or first-order logic, assuming each proposition is completely unambiguous and is either 100% true or 100% false. These simplistic engines are fun little exercises in programming but in real-world situations, reasoning usually needs to consider ambiguities and uncertainties. Instead of simply being true or false, propositions may be likely or unlikely, or their likelihood may be something to be tested or determined. Some elements of some propositions may be poorly defined.

In the unambiguous binary case, it's typical to express rules for generating new propositions as if-then rules with variables in them. We call these production rules because they are used to produce new propositions.
If X is a man, then X is mortal.
Given the statement "Socrates is a man", we
  • match the statement to the rule's IF clause
  • take note of all variable assignments: X=Socrates
  • plug assignments into the THEN clause: "Socrates is mortal"
Obviously this isn't rocket science, but even without handling uncertainty, it will still be useful if scaled to very large numbers of propositions, as in the semantic web.

How to handle uncertainty? This can be done by representing knowledge as a Bayesian network, a directed graph where the edges represent the influences and dependencies between random variables. There is a good tutorial about these online. Here's an example from the Wikipedia article where the probability of rain is an independent variable, and the sprinkler system is usually off if it's raining, and the grass can get wet from either rain or the sprinkler.

There are at least two open-source inference engines that work with Bayesian networks. One is SMILE, another is the OpenBayes library for the Python language. OpenBayes allows you to update the state of your knowledge with a new observation.
Suppose now that you know that the sprinkler is on and that it is not cloudy, and you wonder what's the probability of the grass being wet : Pr(w|s=1,c=0). This is called evidence...
ie.SetObs({'s':1,'c':0})
and then perform inference in the same way... The grass is much more likely to be wet because the sprinkler is on!
Here is a list of many more Bayesian network libraries, and another list. There is also a nice tutorial on Learning Bayesian Networks from Data, the process of taking a bunch of data and automatically discovering the Bayesian network that might have produced it. Another Bayesian reasoning system is BLOG.
Bayesian logic (BLOG) is a first-order probabilistic modeling language under development at MIT and UC Berkeley. It is designed for making inferences about real-world objects that underlie some observed data: for instance, tracking multiple people in a video sequence, or identifying repeated mentions of people and organizations in a set of text documents. BLOG makes it (relatively) easy to represent uncertainty about the number of underlying objects and the mapping between objects and observations.
Are production rule systems and Bayesian network systems mutually compatible? I don't yet know. Do Bayesian networks adequately represent all important forms of uncertainty or vagueness that one might encounter in working with real-world data? I don't know that either. Are there other paradigms I should be checking out? Probably.

Sunday, January 17, 2010

How hard is generating scientific hypotheses?

In the 1500s, a Danish astronomer named Tycho Brahe used Galileo's invention of the telescope to collect an enormous amount of numerical data describing the motion of the planets. Brahe's assistant Johannes Kepler studied that data and arrived at some interesting conclusions which we now know as Kepler's laws of planetary motion:
  1. The orbit of every planet is an ellipse with the Sun at a focus.
  2. A line joining a planet and the Sun sweeps out equal areas during equal intervals of time.
  3. The square of the orbital period of a planet is directly proportional to the cube of the semi-major axis of its orbit.
Kepler's laws were the starting point from which Isaac Newton formulated his law of gravitation, the inverse-square law that we all know and love.

We have here a three-step process: collect data, find mathematical patterns in the data, and create a theory that explains those patterns. Collecting data is simple in principle, and looking for mathematical patterns is also simple. Kepler's arithmetic was done by hand, but now we have computer programs (like Eureqa) which use genetic programming to find parsimonious mathematical formulas that fit sets of data. You can find Java applets on the web that demonstrate this idea.

So the first two steps aren't too hard. We can arrive rather easily at mathematical formulas that describe various experimentally measurable aspects of reality. That's a good thing. The hard job is the next step: finding theories or "likely stories" that explain why those formulas take whatever form they do. Sometimes the form of the math suggests a mechanism, because you've learned to associate elliptical orbits with conservative force fields which necessarily have an inverse-square law. (Hundreds of years after Newton, that is now a no-brainer.) But generally the problem is non-trivial and so far, as far as I'm aware, requires human insight.

Foresight Institute conference, Jan 16 and 17, 2010

The Foresight conference is just winding down. The talks were live-blogged over at NextBigFuture by Brian Wang who did a good job of concisely capturing the essentials. My own favorite talk was by Hod Lipson, who talked about a number of things, including something I find fascinating, the automation of science, about which I plan to blog more frequently.

I blogged too briefly in the past about the Adam project, but it deserves more. Reported in April 2009 by Ross King at Aberystwyth University. It used lab automation to perform experiments, and data mining to find patterns in the resulting data. Adam developed novel genomics hypotheses about S. cerevisiae yeast and tested them. Adam's conclusions were manually confirmed by human experimenters, and found to be correct. This was the first instance in human history where a machine discovered new scientific knowledge without human oversight.

Here is what I want to see computers doing in the coming years.
  • Look for patterns in data -- data mining
  • Propose falsifiable hypotheses
  • Design experiments to test those hypotheses
  • Perform the experiments and collect data
  • Confirm or deny hypotheses
  • Mine new data for new patterns, repeat the process
In the longer term, I want to see machine theoreticians and experimetalists collaborate with their human counterparts, both working in a scientific literature that is readable and comprehensible for both. This will require the development of a machine-parseable ontology (ideally a widely recognized standard) for sharing elements of the scientific reasoning process: data sets, hypotheses, predictions, deduction, induction, statistical inference, and the design of experiments.

So why do I want all this stuff? For one thing, it's interesting. For another, I am approaching the end of my life and I want to see scientific progress (and particularly medical progress) accelerate considerably in my remaining years. Finally, this looks to me like something where I can make some modestly valuable contribution to humanity with the time and energy I have left.