Will Ware's blog: bayesian

Showing posts with label bayesian. Show all posts

Friday, October 25, 2013

Bar Camp Boston 2013 talk on automation of science

This is an outline for a talk I gave at Bar Camp Boston 8 on the automation of science. It's a topic I've blogged and spoken about before. The shortened URL for this post is http://goo.gl/rv3Xik.

In 2004, a robot named Adam became the first machine in history to discover new scientific knowledge independently of its human creators. Without human guidance, Adam can create hypotheses to explain observations, design experiments to test those hypotheses, run the experiments using laboratory robotics, interpret the experimental results, and repeat the cycle to generate new knowledge. The principal investigator on the Adam project was Ross King, now at Manchester University, who published a paper on the automation of science (PDF) in 2009. Some of his other publications: 1, 2, 3.

Adam works in a very limited domain, in nearly complete isolation. There is plenty of laboratory automation but (apart from Adam) we don't yet have meaningful computer participation in the theoretical aspect of scientific work. A worldwide scientific collaboration of human and computer theoreticians working with human and computer experimentalists could advance science and medicine and solve human problems faster.

The first step is to formulate a linked language of science that machines can understand. Publish papers in formats like RDF/Turtle or JSON or JSON-LD or YAML. Link scientific literature to existing semantic networks (DBpedia, Freebase, Google Knowledge Graph, LinkedData.org, Schema.org etc). Create schemas for scientific domains and for the scientific method (hypotheses, predictions, experiments, data). Provide tutorials, tools and incentives to encourage researchers to publish machine-tractable papers. Create a distributed graph or database of these papers, in the role of scientific journals, accessible to people and machines everywhere. Maybe use Stackoverflow as a model for peer review.

Begin with very limited scientific domains (high school physics, high school chemistry) to avoid the full complexity and political wrangling of the professional scientific community in the initial stages. As this stuff approaches readiness for professional work, deploy it first in the domain of computer science and other scientific domains where it can hope to avoid overwhelming resistance.

Machine learning algorithms (clustering, classification, regression) can find patterns in data and help to identify useful abstractions. Supervised learning algorithms can provide tools of collaboration between people and computers.

The computational chemistry folks have a cool little program called Babel which translates between a large number of different file formats for representing molecular structures. It does this with a rich internal representation of structures, and pluggable read and write modules for each file format. At some point, something like this for different file formats of scientific literature might become useful, and might help to build consensus among different approaches.

A treasure trove would be available in linked patient data. In the United States this is problematic because of the privacy restrictions associated with HIPAA regulation. In countries like Iceland and Norway which have universal health care, there would be no equivalent of HIPAA, and those would be good places to initiate a Linked Patient Data project.

Thursday, October 17, 2013

The first neon sign I've ever wanted to own

This sign appears in the Cambridge UK office of Autonomy Corporation. I want one. I need to talk to the people who make neon signs. There are a few online threads (1, 2) where people express curiosity about this sign.

This equation is Bayes' Law. Thomas Bayes (1701-1761) proposed it as a way to update one's beliefs based on new information. I saw this picture on a blog post by Allen Downey, author of Think Bayes, and whom I recently had the pleasure of meeting briefly at a Boston Python meetup. Very interesting guy, also well versed in digital signal processing, another interest shared with myself. Before the other night, I probably hadn't heard the word "cepstrum" in almost twenty years.

Allen's blog is a cornucopia of delicious problems involving Bayes' Law and other statistical delights that I learned to appreciate while taking 6.432, an MIT course on detection and estimation that I'm afraid may have been retired. The online course materials they once posted for it have been taken down.

But imagine my satisfaction upon looking over Think Bayes and realizing that it is the missing textbook for that course! I haven't checked to see that it covers every little thing that was in 6.432, but it definitely covers the most important ideas. At a quick glance, I don't see much about vectors as random variables, but I think he's rightly more concerned with getting the ideas out there without the intimidation of extra mathematical complexity.

Wednesday, January 27, 2010

Bayesian nets in RDF, and how to update them

I've banged my head on this for a couple of days and feel close to a solution. The graph looks like this.

Each random boolean variable gets a node, the causal relationship between them gets a node, and each variable gets a probability.
The math for updating probabilities is a little tricky, but in a fun and interesting way, so I enjoyed banging on that. At some point I'll tackle more involved cases where there aren't simply two random boolean variables, but that's the logistically simple case that exposes most of the concepts involved. Kinda like the Drosophila of Bayesian inference.

"""Let A and B be random boolean variables, with A being an

unobservable cause and B being an observable effect, and Pr(B|A) and

Pr(B|~A) are given. This might be the case if the situation has some

structure that dictates the conditional probabilities, like a 6.432

problem, or it just might be empirical.

From these we can compute probabilities for the four cases.

P11 = Pr(B^A) = Pr(A) Pr(B|A)

P10 = Pr(~B^A) = Pr(A) (1-Pr(B|A))

P01 = Pr(B^~A) = (1-Pr(A)) Pr(B|~A)

P00 = Pr(~B^~A) = (1-Pr(A)) (1-Pr(B|~A))

Treat (Pr(B|A), Pr(B|~A)) as one piece of information, and Pr(A) as a

separate independent piece of information. The first piece reflects

your beliefs about the machinery connecting A to B, and the second

reflects your beliefs about the likelihood of A, so these are

legitimately separate concerns.

If we observe that B=1, then we want to replace Pr(A) with our

previous estimate for Pr(A|B), which is given by our old numbers as

P11/Pr(A) = P11/(P11+P01), and this becomes our posterior probability

for A."""

class Link1:

    def __init__(self, PrBgivenA, PrBgivenNotA):

        self.PrBgivenA, self.PrBgivenNotA = PrBgivenA, PrBgivenNotA

    def updatePrA(self, PrA, bvalue):

        p11 = PrA * self.PrBgivenA

        p10 = PrA * (1. - self.PrBgivenA)

        p01 = (1. - PrA) * self.PrBgivenNotA

        p00 = (1. - PrA) * (1. - self.PrBgivenNotA)

        # print p11, p01, p10, p00

        assert 0.999 < p11 + p01 + p10 + p00 < 1.001

        assert PrA - 0.001 < p10 + p11 < PrA + 0.001

        if bvalue:

            # Pr(A|B)

            newPrA = p11 / (p11 + p01)

        else:

            # Pr(A|~B)

            newPrA = p10 / (p10 + p00)

        assert 0.0 <= newPrA <= 1.0

        return newPrA

link = Link1(0.6, 0.27)

PrA = 0.5

print PrA

for i in range(5):

    PrA = link.updatePrA(PrA, 0)

    print PrA

for i in range(10):

    PrA = link.updatePrA(PrA, 1)

    print PrA

Monday, January 18, 2010

Inference engines and automated reasoning

An inference engine is a computer program that reasons, using some form of knowledge representation.

This can be done with propositional logic or first-order logic, assuming each proposition is completely unambiguous and is either 100% true or 100% false. These simplistic engines are fun little exercises in programming but in real-world situations, reasoning usually needs to consider ambiguities and uncertainties. Instead of simply being true or false, propositions may be likely or unlikely, or their likelihood may be something to be tested or determined. Some elements of some propositions may be poorly defined.

In the unambiguous binary case, it's typical to express rules for generating new propositions as if-then rules with variables in them. We call these production rules because they are used to produce new propositions.

If X is a man, then X is mortal.

Given the statement "Socrates is a man", we

match the statement to the rule's IF clause
take note of all variable assignments: X=Socrates
plug assignments into the THEN clause: "Socrates is mortal"

Obviously this isn't rocket science, but even without handling uncertainty, it will still be useful if scaled to very large numbers of propositions, as in the semantic web.

How to handle uncertainty? This can be done by representing knowledge as a Bayesian network, a directed graph where the edges represent the influences and dependencies between random variables. There is a good tutorial about these online. Here's an example from the Wikipedia article where the probability of rain is an independent variable, and the sprinkler system is usually off if it's raining, and the grass can get wet from either rain or the sprinkler.

There are at least two open-source inference engines that work with Bayesian networks. One is SMILE, another is the OpenBayes library for the Python language. OpenBayes allows you to update the state of your knowledge with a new observation.

Suppose now that you know that the sprinkler is on and that it is not cloudy, and you wonder what's the probability of the grass being wet : Pr(w|s=1,c=0). This is called evidence...

ie.SetObs({'s':1,'c':0})
and then perform inference in the same way... The grass is much more likely to be wet because the sprinkler is on!

Here is a list of many more Bayesian network libraries, and another list. There is also a nice tutorial on Learning Bayesian Networks from Data, the process of taking a bunch of data and automatically discovering the Bayesian network that might have produced it. Another Bayesian reasoning system is BLOG.

Bayesian logic (BLOG) is a first-order probabilistic modeling language under development at MIT and UC Berkeley. It is designed for making inferences about real-world objects that underlie some observed data: for instance, tracking multiple people in a video sequence, or identifying repeated mentions of people and organizations in a set of text documents. BLOG makes it (relatively) easy to represent uncertainty about the number of underlying objects and the mapping between objects and observations.

Are production rule systems and Bayesian network systems mutually compatible? I don't yet know. Do Bayesian networks adequately represent all important forms of uncertainty or vagueness that one might encounter in working with real-world data? I don't know that either. Are there other paradigms I should be checking out? Probably.