Sunday, April 18, 2010

Autosci talk at Bar Camp Boston

Yesterday I had fun giving a talk on the automation of science at Bar Camp Boston. I was very fortunate to (A) have very little to say myself, so that I quickly got out of the way for others to discuss, and (B) have some very smart people in the room who got the idea immediately, some of them able to give the scientist's-eye view of this idea.

Discussion centered around a few topics. One was how comprehensive a role would computers play in the entire scientific process. There seemed to be consensus that computers could easily identify statistical patterns in data, could perform symbolic regression in cases of limited complexity and not too many variables, but that in the creation of scientific theories and hypotheses, there are necessary intuitive leaps that a machine can't make. Personally I believe that's true but I imagine that computers might demonstrate an ability to make leaps we can't make as humans, and I have no idea what those leaps would look like because they would be the product of an alien intelligence. If no such leaps occur, at least the collection of tools available to human scientists will hopefully have grown in a useful direction.

Another topic was the willingness of scientists to provide semantic markup for research literature. Only those expert in the field are qualified to provide such markup since it requires an in-depth understanding of the field as a whole, and the paper's reasoning process in particular. It's also likely to be a lot of work, at least initially, and there is as yet no incentive to offer scientists in exchange for such work. The notion of posting papers on some kind of wiki and hoping that semantic markup could be crowd-sourced was quickly dismissed. Crowd-sourcing doesn't work when there is a very precise correct answer and the number of people with that answer is very small.

There has been a lot of Twitter traffic around Bar Camp Boston, and I was able to find a few comments on my talk afterward. It looks like people enjoyed it and found it stimulating and engaging, so that's very cool. It turned out to be a good limbering-up for an immediately following talk on Wolfram Alpha. I found one particularly evocative tweet:
Has anyone approached a CS journal to have their content semantically marked up? #BCBos @BarCampBoston
Thinking about that question, I realized that computer science is the right branch of science to begin this stuff, and that the way to make it most palatable to scientists is to publish papers that demonstrate how to do semantic markup as easily as possible at time of publication (not as a later retrofit), how a scientist can benefit himself or herself by doing that work, and how to do interesting stuff with the markup of papers that have already been published. My quick guess is that some sort of literate programming approach (wiki) is appropriate. So lots to think about.

If you attended my talk, thanks very much for being there. I had a lot of fun, and hope you did too.

Monday, April 12, 2010

A Formal System For Euclid's Elements

I came across this tidbit on the Lambda the Ultimate website. It's a pointer to a juicy paper by some Carnegie Mellon folks.
Abstract. We present a formal system, E, which provides a faithful model of the proofs in Euclid’s Elements, including the use of diagrammatic reasoning.
"Diagrammatic reasoning" is the interesting part. People have recognized the Elements as an exemplar of rigorous reasoning for many centuries, but it took some time for the question to emerge, "are the diagrams a necessary component of the logical argument?" Liebniz believed they were not:
...it is not the figures which furnish the proof with geometers, though the style of the exposition may make you think so. The force of the demonstration is independent of the figure drawn, which is drawn only to facilitate the knowledge of our meaning, and to fix the attention; it is the universal propositions, i.e. the definitions, axioms, and theorems already demonstrated, which make the reasoning, and which would sustain it though the figure were not there.
The authors note that "there is no [historical] chain linking our contemporary diagrams with the ones that Euclid actually drew; it is likely that, over the years, diagrams were often reconstructed from the text". Their abstract seems to say that the design of E recognizes some essential role for the diagrams, so I assume one must exist. I haven't finished reading the paper yet. But the whole thing is very interesting.

Saturday, April 10, 2010

Learning to live with software specifications

We software developers have a knee-jerk hatred of specifications. Rather than write a document describing work we plan to do, we would rather throw together a quick prototype and grow it into the final system. We sometimes feel like specs are for liberal-arts sissies and pointy-haired bosses. Our prehistoric brains want us to dismiss specifications as a waste of time or even an intentional misdirection of energy.

The truth of it is that specs build consensus between developers, testers, tech writers, managers, and customers. They make sure everybody agrees about what to build, how to test it, how to write a user manual for it, and what the priorities are.

The Agile guys talk about the exponentially increasing cost of fixing a bug. The later in the process you find that bug, the more troublesome and expensive it is to fix it. Fixing bugs in code is hard, even prototype code, and fixing text is easy.

Let's learn to trick our brains to work around our reluctance. The Head-First books always start with a great little explanation about how our prehistoric brain circuitry divvies up our attention, classifying things as interesting or boring, and determines what sticks in our memories. Sesame Street learned how to make stuff sticky by
  • repetition
  • lighting up more brain circuitry
  • infusing the topic with emotional content
  • relating it to things that were already sticky
One way to infuse your spec with emotional content would be to make it a turf war. That hooks into all our brain circuitry for tribes and feuds. But turf wars are traumatic and damaging to people and projects, so let's not do this.

To light up more brain circuitry, sketch out pieces of the spec on a big whiteboard. Draw a lot of pictures and diagrams. Use different colored markers. Get a few people together and generate consensus (not a turf war), and ask them to help identify issues that you forgot. That meeting is called a design review, like a code review for specs.

Who should write and own the spec?  Part three of Joel Spolsky's great four-part (1, 2, 3, 4) article answers this question, drawing on his experience at Microsoft. One person should write and own the spec, and the programmers should not report to that person. At Microsoft, that person is a program manager.

It's important to differentiate between
  • a functional spec (what the user sees and experiences, what the customer wants) dealing with features, screens, dialog boxes, UI and UX, work flow
  • and a technical spec (the stuff under the hood) dealing with system components, data structures and algorithms, communication protocols, database schemas, tools, languages, test methodologies, and external dependencies which may have hard-to-predict schedule impacts
Write the functional spec first, then the technical spec, then the code. If you love test-driven development then write the specs, then the tests, then the code.

Joel's article includes some great points on keeping the spec readable.
  • Use humor. It helps people stay awake.
  • Write simply, clearly, and briefly. Don't pontificate.
  • Re-read your own spec, many times. Eat your own literary dogfood. If you can't stay awake, nobody else will either.
  • Avoid working to a template unless politically necessary.
How do you know when the spec is done?
  • The functional spec is done when the system can be designed, built, tested, and deployed without asking more questions about the user interface or user experience.
  • The technical spec is done when each component of the system can be designed, built, tested, and deployed without asking more questions about the rest of the system.
This doesn't mean that these documents can never be updated or renegotiated. But the goal is to aim for as little subsequent change as possible.

I am still sorely tempted by the idea of a quick prototype, an "executable spec" that exposes bugs in design or logical consistency. Maybe it's OK to co-develop this with the spec, or tinker with it on one's own time, or consider it as a first phase of the coding. I'm still sorting this out. The basic rationale of a spec, that fixing bugs in text is easier and cheaper than fixing bugs in code, still needs to be observed.

Sunday, April 04, 2010

Don't covet Apple's new iPad

Back in the days of its founding, Apple championed hobbyists and experimenters, even including circuit board schematics with the Apple ][+ to help people who wanted to tinker with the electronics. Not so now. Cory Doctorow (brilliant guy, read his Disneyland sci-fi novel) recently blogged about how Apple has switched its loyalty to the DRM-and-eternal-copyright crowd, and like the iPhone, the iPad reflects this. Consequently, the common temptation to covet an iPad is an evil one.

I like my Android phone (a Motorola Droid from Verizon) except for the PHONE part, the one thing it does poorly. Every other function, I adore. Also I'd like a bigger keyboard and screen, maybe Kindle size. So: Android tablet with bigger keyboard and screen, and no phone (therefore no messy dependency on mobile carriers).

I wouldn't want to try to build a tablet from scratch, but the Touch Book from AlwaysInnovating looks good. The tablet piece (sans keyboard, which makes it a netbook) is $300, loaded with their custom Linux OS. The OS can be replaced with Ubuntu, Android, Chrome, etc. An SD card makes it easy to get apps and files onto and off the tablet. There's a wiki to help developers get up to speed.

In another video, the inventor shows how to enable route tracking on Google Maps by popping off the back cover and plugging a GPS receiver into an internal USB connector. I am currently between jobs, but this is going on my shopping list for later.

All web app frameworks lead to Rome

Earlier I blogged about how it seemed like web app development had just zoomed past me. Since then, I've buckled down and actually started to study this stuff. My earlier posting only talked about the presentation layer, HTML, Javascript, and CSS. I still have more to learn about those, but the really interesting stuff happens on the server.

In December I went to a two-day session on Hibernate and Spring, and it was full of mysterious jargon that made me sleepy: dependency injection, inversion of control, aspects, object-relational mapping, convention over configuration, blah blah blah. I kept at it, though, looking at Rails and later Django. I'm now waist-deep in building a MySQL-backed Django site. What I learned is that (A) all these web app frameworks are remarkably similar to one another, and (B) those jargon terms are a lot simpler than they seem.

Inversion of control means that the framework makes calls into your app code, rather than you calling the framework from a main() function. Dependency injection is a set of tricks to minimize dependencies between different Java source files. Aspects are Java tricks that you can do by wrapping your methods in other methods with the same signatures, a lot like decorators in Python. Object-relational mapping is creating classes to represent your DB tables: each instance represents a row, each column is represented by a setter and getter. The MVC pattern gives the lay of the land for all these frameworks, and all the presentation stuff I talked about before is limited to the "view" piece.

As I find my footing in the basics, I start to notice where the interesting bits of more advanced topics pop up. If I put a Django app and a Mediawiki on the same server, can I do a single sign-on for both of them? I think I can, by writing an AuthPlugin extension to make the Mediawiki accept Django's authentication cookie.

Don't ask Django to serve a PHP page because it doesn't include a PHP interpreter (what mod_php does for Apache). Your Apache config file must deal with PHP files before routing to Django.
    AliasMatch /([^/]*\.php) ..../phpdir/$1
    WSGIScriptAlias / ..../djangodir/django.wsgi

One thing I haven't quite understood is why the Django community seems to love Prototype and hate jQuery. Is that just because Prototype is included in the standard Django package? Is it purely historical, with jQuery the abandoned but superior Betamax to Prototype's VHS?

Friday, March 12, 2010

A new FPGA board to play with

Digilent, a partner of Xilinx, makes eval boards for Xilinx FPGAs. I bought one and plan to hack some Verilog with it. My past experiments involved a board of my own design with a FPGA and a USB-enabled microcontroller. I successfully programmed the microcontroller over the USB cable to wiggle GPIO pins, which should have allowed me to program the FPGA via JTAG. But for some reason, JTAG programming of the FPGA didn't work. This time the JTAG programming pins will be wired directly to parallel port pins and there is a Linux library for programming them, so I should have better luck this time. Fewer unknowns and variables, more easily probed.

Attention (hey, shiny!) deficit break: I stumbled across a couple of very affordable logic analyzers. Amazing stuff, just the thing for debugging errant JTAG signals.

Some nice folks have released a PCI soft core under the LGPL. I'm not ready to tackle that yet, but hope to get there before too long. Speaking of PCI,
here is a nice FPGA board for a PCI bus slot from Enterpoint in the UK. They also have a PCI soft core but the licensing is a bit pricey for a hobbyist. I wonder if the LGPLed PCI core would work on the Enterpoint board.

Thursday, March 11, 2010

TA-65 safety claims

Earlier I posted about TA-65, a telomerase activator, which some hope could reverse some of the effects of aging. Amiya Sarkar is a doctor in Calcutta who writes a fascinating blog on physiology and physics. He and I have emailed back and forth for a couple years now, starting with a very cool idea he had for an inexpensive open-source electrocardiogram. (One of these days we really need to get that project back on track.)

Amiya expressed the concern that any telomerase activator could be viewed as a potential cancer risk. Cancerous cells use telomerase to support the unlimited replication that characterizes cancer. The folks at Sierra Sciences openly recognize this concern, and give reasons why they believe it's a red herring, on this webpage:
In most cases (85–95%), cancers accomplish this indefinite cell division by turning on telomerase. For this reason, forcing telomerase to turn off throughout the body has been suggested as a cure for cancer, and there are several telomerase inhibitor drugs presently being tested in clinical trials.

So, anti-aging scientists must be out of their minds to want to turn the telomerase gene on, right?

No! Although telomerase is necessary for cancers to extend their lifespan, telomerase does not cause cancer. This has been repeatedly demonstrated: at least seven assays for cancer have been performed on telomerase-positive human cells: the soft agar assay, the contact inhibition assay, the mouse xenograft assay, the karyotype assay, the serum inhibition assay, the gene expression assay, and the checkpoint analysis assay. All reported negative results...

Paradoxically, even though cells require telomerase to become dangerous cancers, turning on telomerase may actually prevent cancer. This is not just because the risk of chromosome rearrangements is reduced, but also because telomerase can extend the lifespan of our immune cells, improving their ability to seek out and destroy cancer cells.
In support of this, they list several papers.
  • Jiang, X.-R. et al. Telomerase expression in human somatic cells does not induce changes associated with a transformed phenotype. Nature Genet., 21, 111–114 (1999)
  • Morales, C.P., et. al. Absence of cancer-associated changes in human fibroblasts immortalized with telomerase. Nature Genet., 21, 115–118 (1999)
  • Harley, C. B. Telomerase is not an oncogene. Oncogene 21(4): 494-502 (2002).
From other writings on their website, and from their postings to Twitter and Facebook, it's clear that the Sierra Sciences folks are 100% confident that telomerase activators pose zero cancer risk. They are in a much better position to know about this than I. But if I started taking TA-65 and they were somehow mistaken, they wouldn't be the ones at risk for cancer. I hope to find out about those seven assays and try to read those three papers in my abundant spare time, and maybe discuss the matter with my doctor. (My present circumstances do not permit me to afford TA-65 even if I decide I want it.) Wouldn't it be cool if the Sierra Sciences people turn out to be correct...

Saturday, February 27, 2010

Learning Ruby on Rails

I'm learning Ruby on Rails to help a friend with his website and to be able to put it on my resume, and keeping notes as I go.

I've gotten the thing to do typical CGI script stuff, and now I'm figuring out how database access works. One big surprise is that as Rails advanced to version 2.0, one of the basic commands for setting up database access changed. Google "rails 2.0 scaffolding" for details.

Wednesday, February 24, 2010

Jena: Node versus RDFNode

The Jena code has two representations for nodes in an RDF graph. One is the class Node, which has several subclasses: Node_Variable, Node_Literal, Node_URI, etc. The other is the interface RDFNode, which has many subinterfaces: Literal, Resource, Property, etc.

These two node representations have very different roles and very different idiomatic usages, and this doesn't appear to be spelled out in the Jena documentation anywhere. RDFNode is in the com.hp.hpl.jena.rdf.model package, where Node is in the com.hp.hpl.jena.graph package, but I don't think the packaging by itself is a big enough hint.

The Jena tutorials mostly talk only about the RDFNode variants, usually instantiating them by calling a "create" method on the Model. The poorly documented distinction between RDFNode and Node extends to the distinction between Model and Graph, and between Statement and Triple.

Since this information didn't appear in the documentation, we need to look at the Jena mailing list to find it.
A key difference between Resource and Node is that Resources know which model they are in, and Nodes are general. That's what makes resource.getProperty() work. Now in a query that is not a concept that has any meaning in the general case and patterns can span graphs.

We have found that Model/Statement/RDFNode (the API) works as an application interface but it's not the right thing for storage abstractions and the Graph/Triple/Node (the SPI) works better where the regularity is more valuable. That is, we have split the application-facing design from the sub-system-facing design.
So an instance of RDFNode is associated with a specific Model, where an instance of Node is free-floating, and is used to build Rules, which are also model-independent. The two representations can be connected by URIs. If you have a Node and a Model, and you want the corresponding RDFNode, do this (or use createProperty or createLiteral as needed):
Resource r = model.createResource(uri1);
and if you have an RDFNode, you can do this to get a Node:
Node uriNode = Node.createURI(
        ((Resource)rdfNode).getURI());
So I can understand that there are two very different appropriate interfaces for writing Jena apps and for interfacing to a storage system. What I don't get is why I would ever see the latter while writing an application. If I define a Rule, I need to deal in Nodes. Presumably this is because I've been constructing Rules programmatically rather than just reading them in from a file. Maybe I should stick with the latter.

Fixing a versioning problem with CUDA 2.3

In an earlier posting, I observed that CUDA 2.3 wants to use GCC 4.3, which is a problem for Fedora 11 and Ubuntu 9.10. I've been itching to upgrade my distribution on my NVIDIA Linux box, and particularly itching to move to Ubuntu. I found some help on Thomas Moelhave's blog. Thanks, Thomas!

In addition to his instructions, I needed to install some stuff.
sudo aptitude install freeglut3 \
   freeglut3-dev libXmu-dev libXi-dev
Once I did that and completed his instructions, everything worked great. The rest of my Ubuntu 9.10 installation is completely intact and happy.

Wednesday, February 17, 2010

Telomeres and aging

Recently I became aware of Sierra Sciences, a startup founded by William Andrews, previously of Geron. Andrews had done a lot of research on telomeres and telomerase.

Your cells have nuclei in them where your DNA is wadded up into packets called chromosomes. On the ends of the DNA strands there's a thing called a telomere. It protects the DNA from unravelling, like the little plastic tube on the end of your shoelace. Our telomeres shorten as we get older, and longer telomeres are strongly correlated with youth and vigor and health. There are many contributors to ageing but telomere length is currently regarded as one of the most urgent and one of the best understood.

Our reproductive cells do not suffer this effect. If we passed on shorter telomeres to our kids, they wouldn't live long, and they probably couldn't have kids of their own. To accomplish this, our reproductive cells produce stuff called telomerase which protects the telomeres from shortening. Here's the cool part: the gene for producing telomerase is present in ALL our cells, but it's only switched on in the reproductive cells. So there's a research push to find a telomerase activator that switches on the gene in all our cells. Sierra Sciences is one of the companies involved in this research.

You can buy a telomerase activator today, called TA-65. It's expensive, about $1500 per month, I think. But I haven't yet found any compelling evidence that it's a scam or a significant health risk. So I'm toying with the idea of trying it for a few months and see if I feel any different.

There is also a clinical test to measure the length of your telomeres. I know it exists but I don't know much more about it at the moment.

Friday, February 05, 2010

Playing with the Jena semantic web framework

I've begun tinkering with Jena, a semantic web framework written in Java. It embodies a lot of ideas and technologies that were once considered AI or expert systems, and which I neglected at the time (1980s).
The semantic web frames a body of knowledge as a collection of three-word sentences called triples. These can be diagrammed as a directed graph such as the one below.


The corresponding three-word sentences appear below, written in N3, a human-readable formal language used by the semantic web community.
@prefix :  <#> .
:Cat :has :Fur .
:Bear :has :Fur .
:Cat :is-a :Mammal .
:Bear :is-a :Mammal .
:Mammal :has :Vertebrae .
:Whale :is-a :Mammal .
:Whale :lives-in :Water .
:Mammal :is-a :Animal .
:Fish :is-a :Animal .
:Fish :lives-in :Water .
In Jena, a Model is one of these things. Having constructed it (or having loaded it from either a file on the internet or a file on your computer), you can apply rules that allow you to draw conclusions. So let's step through that proecess. First we need to read in the file.
private static final String baseUri =
    "file:///home/wware/wware-autosci/" +
    "semweb/java/simpleNet.n3#";
private static void modelReadFile(String filename,
                                  Model model) {
    try {
        File f = new File(filename);
        FileReader fr = new FileReader(f);
        model.read(fr, baseUri);
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    }
}
To print the contents of a model, we can use the SPARQL query language, which looks a lot like SQL.
private static void printModel(Model model) {
    String queryString = 
        "SELECT ?x ?y ?z " +
        "WHERE {" +
        "    ?x ?y ?z . " +
        "}";
    Query query = QueryFactory.create(queryString);
    QueryExecution qe =
      QueryExecutionFactory.create(query, model);
    ResultSet results = qe.execSelect();
    ResultSetFormatter.out(System.out, results, query);
    qe.close();
}
and we'll call that method from our main method. I personally find it appalling that the graph above fails to recognize that fish have vertebrae, so we'll add a triple for that.
public static void main(String[] args) {
    Model model = ModelFactory.createDefaultModel();
    modelReadFile("simpleNet.rdf", model);
    model.createResource(baseUri + "Fish")
         .addProperty(model.createProperty(
                         baseUri + "has"),
                      model.createResource(
                         baseUri + "Vertebrae"));
    printModel(model);
}
and the RDF file that imports the model was translated from the N3 above, using CWM.

<rdf:RDF
xmlns="file:///home/wware/wware-autosci/semweb/java/simpleNet.n3#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <rdf:Description rdf:about="#Bear">
        <has rdf:resource="#Fur"/>
        <is-a rdf:resource="#Mammal"/>
    </rdf:Description>
    <rdf:Description rdf:about="#Cat">
        <has rdf:resource="#Fur"/>
        <is-a rdf:resource="#Mammal"/>
    </rdf:Description>
    <rdf:Description rdf:about="#Fish">
        <is-a rdf:resource="#Animal"/>
        <lives-in rdf:resource="#Water"/>
    </rdf:Description>
    <rdf:Description rdf:about="#Mammal">
        <has rdf:resource="#Vertebrae"/>
        <is-a rdf:resource="#Animal"/>
    </rdf:Description>
    <rdf:Description rdf:about="#Whale">
        <is-a rdf:resource="#Mammal"/>
        <lives-in rdf:resource="#Water"/>
    </rdf:Description>
</rdf:RDF>

There is a Model.write(OutputStream) method, so we can write a model directly to a file instead of stepping through the triples explicitly.

So how about some actual reasoning? We should be able to conclude that a cat is an animal, and has vertebrae. This will require that we define two rules of inference, rule1 ("is-a" is transitive) and rule2 (a member of a species has the things the species has).
String rules =
        "[ rule1: (?x " + baseUri+"is-a ?y) " +
                 "(?y " + baseUri+"is-a ?z) -> " +
                 "(?x " + baseUri+"is-a ?z) ] " +
        "[ rule2: (?x " + baseUri+"is-a ?y) " +
                 "(?y " + baseUri+"has ?z) -> " +
                 "(?x " + baseUri+"has ?z) ]";
Reasoner reasoner = new
    GenericRuleReasoner(Rule.parseRules(rules));
reasoner.setDerivationLogging(true);
InfModel inf =
  ModelFactory.createInfModel(reasoner, model);
printModel(inf);
Simply creating the InfModel is enough to draw all the relevant inferences. The Reasoner's setDerivationLogging method tells the model to remember the derivations that led to any new conclusions, and these derivations can be examined for debugging purposes.
PrintWriter out = new PrintWriter(System.out);
for (StmtIterator i = inf.listStatements();
             i.hasNext(); ) {
    Statement s = i.nextStatement();
    for (Iterator id = inf.getDerivation(s);
             id.hasNext(); ) {
        Derivation deriv = id.next();
        deriv.printTrace(out, true);
    }
}
out.flush();

Wednesday, January 27, 2010

Bayesian nets in RDF, and how to update them

I've banged my head on this for a couple of days and feel close to a solution. The graph looks like this.



Each random boolean variable gets a node, the causal relationship between them gets a node, and each variable gets a probability.
The math for updating probabilities is a little tricky, but in a fun and interesting way, so I enjoyed banging on that. At some point I'll tackle more involved cases where there aren't simply two random boolean variables, but that's the logistically simple case that exposes most of the concepts involved. Kinda like the Drosophila of Bayesian inference.

"""Let A and B be random boolean variables, with A being an
unobservable cause and B being an observable effect, and Pr(B|A) and
Pr(B|~A) are given. This might be the case if the situation has some
structure that dictates the conditional probabilities, like a 6.432
problem, or it just might be empirical.

From these we can compute probabilities for the four cases.
P11 = Pr(B^A) = Pr(A) Pr(B|A)
P10 = Pr(~B^A) = Pr(A) (1-Pr(B|A))
P01 = Pr(B^~A) = (1-Pr(A)) Pr(B|~A)
P00 = Pr(~B^~A) = (1-Pr(A)) (1-Pr(B|~A))

Treat (Pr(B|A), Pr(B|~A)) as one piece of information, and Pr(A) as a
separate independent piece of information. The first piece reflects
your beliefs about the machinery connecting A to B, and the second
reflects your beliefs about the likelihood of A, so these are
legitimately separate concerns.

If we observe that B=1, then we want to replace Pr(A) with our
previous estimate for Pr(A|B), which is given by our old numbers as
P11/Pr(A) = P11/(P11+P01), and this becomes our posterior probability
for A."""

class Link1:
    def __init__(self, PrBgivenA, PrBgivenNotA):
        self.PrBgivenA, self.PrBgivenNotA = PrBgivenA, PrBgivenNotA
    def updatePrA(self, PrA, bvalue):
        p11 = PrA * self.PrBgivenA
        p10 = PrA * (1. - self.PrBgivenA)
        p01 = (1. - PrA) * self.PrBgivenNotA
        p00 = (1. - PrA) * (1. - self.PrBgivenNotA)
        # print p11, p01, p10, p00
        assert 0.999 < p11 + p01 + p10 + p00 < 1.001
        assert PrA - 0.001 < p10 + p11 < PrA + 0.001
        if bvalue:
            # Pr(A|B)
            newPrA = p11 / (p11 + p01)
        else:
            # Pr(A|~B)
            newPrA = p10 / (p10 + p00)
        assert 0.0 <= newPrA <= 1.0
        return newPrA

link = Link1(0.6, 0.27)

PrA = 0.5
print PrA

for i in range(5):
    PrA = link.updatePrA(PrA, 0)
    print PrA

for i in range(10):
    PrA = link.updatePrA(PrA, 1)
    print PrA

Monday, January 18, 2010

Inference engines and automated reasoning

An inference engine is a computer program that reasons, using some form of knowledge representation.

This can be done with propositional logic or first-order logic, assuming each proposition is completely unambiguous and is either 100% true or 100% false. These simplistic engines are fun little exercises in programming but in real-world situations, reasoning usually needs to consider ambiguities and uncertainties. Instead of simply being true or false, propositions may be likely or unlikely, or their likelihood may be something to be tested or determined. Some elements of some propositions may be poorly defined.

In the unambiguous binary case, it's typical to express rules for generating new propositions as if-then rules with variables in them. We call these production rules because they are used to produce new propositions.
If X is a man, then X is mortal.
Given the statement "Socrates is a man", we
  • match the statement to the rule's IF clause
  • take note of all variable assignments: X=Socrates
  • plug assignments into the THEN clause: "Socrates is mortal"
Obviously this isn't rocket science, but even without handling uncertainty, it will still be useful if scaled to very large numbers of propositions, as in the semantic web.

How to handle uncertainty? This can be done by representing knowledge as a Bayesian network, a directed graph where the edges represent the influences and dependencies between random variables. There is a good tutorial about these online. Here's an example from the Wikipedia article where the probability of rain is an independent variable, and the sprinkler system is usually off if it's raining, and the grass can get wet from either rain or the sprinkler.

There are at least two open-source inference engines that work with Bayesian networks. One is SMILE, another is the OpenBayes library for the Python language. OpenBayes allows you to update the state of your knowledge with a new observation.
Suppose now that you know that the sprinkler is on and that it is not cloudy, and you wonder what's the probability of the grass being wet : Pr(w|s=1,c=0). This is called evidence...
ie.SetObs({'s':1,'c':0})
and then perform inference in the same way... The grass is much more likely to be wet because the sprinkler is on!
Here is a list of many more Bayesian network libraries, and another list. There is also a nice tutorial on Learning Bayesian Networks from Data, the process of taking a bunch of data and automatically discovering the Bayesian network that might have produced it. Another Bayesian reasoning system is BLOG.
Bayesian logic (BLOG) is a first-order probabilistic modeling language under development at MIT and UC Berkeley. It is designed for making inferences about real-world objects that underlie some observed data: for instance, tracking multiple people in a video sequence, or identifying repeated mentions of people and organizations in a set of text documents. BLOG makes it (relatively) easy to represent uncertainty about the number of underlying objects and the mapping between objects and observations.
Are production rule systems and Bayesian network systems mutually compatible? I don't yet know. Do Bayesian networks adequately represent all important forms of uncertainty or vagueness that one might encounter in working with real-world data? I don't know that either. Are there other paradigms I should be checking out? Probably.

Sunday, January 17, 2010

How hard is generating scientific hypotheses?

In the 1500s, a Danish astronomer named Tycho Brahe used Galileo's invention of the telescope to collect an enormous amount of numerical data describing the motion of the planets. Brahe's assistant Johannes Kepler studied that data and arrived at some interesting conclusions which we now know as Kepler's laws of planetary motion:
  1. The orbit of every planet is an ellipse with the Sun at a focus.
  2. A line joining a planet and the Sun sweeps out equal areas during equal intervals of time.
  3. The square of the orbital period of a planet is directly proportional to the cube of the semi-major axis of its orbit.
Kepler's laws were the starting point from which Isaac Newton formulated his law of gravitation, the inverse-square law that we all know and love.

We have here a three-step process: collect data, find mathematical patterns in the data, and create a theory that explains those patterns. Collecting data is simple in principle, and looking for mathematical patterns is also simple. Kepler's arithmetic was done by hand, but now we have computer programs (like Eureqa) which use genetic programming to find parsimonious mathematical formulas that fit sets of data. You can find Java applets on the web that demonstrate this idea.

So the first two steps aren't too hard. We can arrive rather easily at mathematical formulas that describe various experimentally measurable aspects of reality. That's a good thing. The hard job is the next step: finding theories or "likely stories" that explain why those formulas take whatever form they do. Sometimes the form of the math suggests a mechanism, because you've learned to associate elliptical orbits with conservative force fields which necessarily have an inverse-square law. (Hundreds of years after Newton, that is now a no-brainer.) But generally the problem is non-trivial and so far, as far as I'm aware, requires human insight.

Foresight Institute conference, Jan 16 and 17, 2010

The Foresight conference is just winding down. The talks were live-blogged over at NextBigFuture by Brian Wang who did a good job of concisely capturing the essentials. My own favorite talk was by Hod Lipson, who talked about a number of things, including something I find fascinating, the automation of science, about which I plan to blog more frequently.

I blogged too briefly in the past about the Adam project, but it deserves more. Reported in April 2009 by Ross King at Aberystwyth University. It used lab automation to perform experiments, and data mining to find patterns in the resulting data. Adam developed novel genomics hypotheses about S. cerevisiae yeast and tested them. Adam's conclusions were manually confirmed by human experimenters, and found to be correct. This was the first instance in human history where a machine discovered new scientific knowledge without human oversight.

Here is what I want to see computers doing in the coming years.
  • Look for patterns in data -- data mining
  • Propose falsifiable hypotheses
  • Design experiments to test those hypotheses
  • Perform the experiments and collect data
  • Confirm or deny hypotheses
  • Mine new data for new patterns, repeat the process
In the longer term, I want to see machine theoreticians and experimetalists collaborate with their human counterparts, both working in a scientific literature that is readable and comprehensible for both. This will require the development of a machine-parseable ontology (ideally a widely recognized standard) for sharing elements of the scientific reasoning process: data sets, hypotheses, predictions, deduction, induction, statistical inference, and the design of experiments.

So why do I want all this stuff? For one thing, it's interesting. For another, I am approaching the end of my life and I want to see scientific progress (and particularly medical progress) accelerate considerably in my remaining years. Finally, this looks to me like something where I can make some modestly valuable contribution to humanity with the time and energy I have left.

Sunday, December 20, 2009

Snow and more snow, December 2009

We got a dusting on December 6th.

In all this cold, I had occasion to heat some water for tea, watching the pot the whole while, and contrary to common belief, it boiled anyway.

We had a bunch of snow last night and this morning. It's just stopped in the last hour.
dscn0530 dscn0531

Wednesday, December 02, 2009

Honorable mention for BetterExplained.com website and its author, Kalid Azad

Kalid Azad's BetterExplained.com website has a lot of elegantly straightforward articles on interesting topics, many of them mathematical. He's doing some really interesting stuff, including a brilliant online calculator that you can use to embed calculations in web pages.

I'm not a Microsoft fanboy by any means, but I admire the video he made, using Windows 7 for humanitarian purposes.

In an article on happiness, Kalid includes a video of Steve Jobs's commencement address at Stanford. I'm really grateful that he included this.

Remember this is a guy who had a diagnosis of terminal cancer a year earlier.
Your work is going to fill a large part of your life, and the only way to be truly satisfied is to do what you believe is great work. And the only way to do great work is to love what you do. If you haven't found it yet, keep looking. Don't settle... All external expectations, all pride, all fear of embarrassment or failure -- these things just fall away in the face of death, leaving only what is truly important. Remembering that you are going to die is the best way I know to avoid the trap of thinking you have something to lose. You are already naked. There is no reason not to follow your heart.
I recently left a job where I wasn't following my heart. The money was good and the rest of the economy was bad, so I spent a lot of energy and effort trying to make it work, but there was no passion or fun or excitement. So this talk resonates for me now.

Tuesday, November 24, 2009

A few last comments on the Hackathon

This is a random collection of notes about things I learned at the Hackathon or shortly after.

The Hackathon was a lot of fun. Pamela Fox, one of the developers of Wave, gave a couple of presentations (1, 2).

There are some article on debugging Wave extensions for Robots and Gadgets. More handy debug tips here.

If you're writing a Robot in Java, a huge amount of support is available in Google's Eclipse plugins. Be sure to have Java 1.6 installed. If you already have an earlier version, you can install 1.6 on top of it, and just switch the preference within Eclipse. I think it was Windows -> Preferences -> Java -> something...

There is also an Eclipse plugin for Python called PyDev. I don't have a lot of experience with it.

One guy at the Hackathon wrote a robot in Groovy, a dynamic language that runs on the JVM. From what I could tell, it was working fine.

At the present time, Robots must be deployed to Google App Engine. This restriction will probably be relaxed, especially as non-Google Wave servers come on line.

App Engine doesn't run PHP (I had wanted to add Mediawiki to my web app) but if you really want PHP you can try Quercus.

App Engine has a great logging facility, accessible via https://appengine.google.com. You can put logging statements in Python code (import logging) or Java code (import java.util.logging) and both will dump info statements into the GAE log.

Google likes extensions to adhere to some design principles. These are designed to maximize the broad appeal of your Wave extension and to make it run faster. Google figures that slow ugly extensions will drive users away from their ads.

There was talk at the Hackathon (and I wasn't paying close enough attention) about people setting up Wave servers. If I set up my own server, can I issue my own invitations to people so that I'm not depending on Google providing invitations? That would rock.

More useful links:
Something that would be very valuable for robot development would be a lightweight Wave server simulator running in Python, probably using the classes in the Python Wave API. You'd want a way to simulate an on-going Wave conversation, and the simulator would send events to the robot.

Monday, November 16, 2009

Tinkering with Google Wave Gadgets and Robots

Google is hosting a Hackathon on Saturday focusing on Wave. For Wave, you can create Gadgets or Robots. I wanted to be able to monitor and control my house's non-existent burglar alarm and X10 appliances from inside a Wave. A Gadget is the best thing for that, which you'll see (entitled "Home Controller") on the right side of this blog.

Assuming my machine at home is running the client code (which doesn't really control anything, it just fakes it), you can use the command "feed the cat" with password "abcd" to change the cat's state from hungry to full, or "wait a minute" to change the cat from full to hungry. Be patient, it can take up to fifteen seconds before your command produces a visible result. The Gadget and the client code are both polling a teeny web service on App Engine which passes information between them, and the polling rate is not hasty. The Gadget's source is hosted on Google Code. A not-too-detailed description is here.

Robots require working directly with the Wave API, which Gadgets don't. Here is a Robot that will help you look stuff up in the Linux kernel source code, using the LXR website. To use the Robot in a wave, first add it to your Gmail contacts. Use the name "Lxreffy" and the email address "wware-lxreffy@appspot.com". Now go into Wave and search for the contact "Lxreffy", and you'll be able to invite it into any Wave. Then type "LXR: foobar" in a blip, and hit "Done", and it will append a series of web links to the blip showing where "foobar" appears in the kernel. Example screenshot below.



Another Robot uses SymPy to enable symbolic algebra to be done within a Wave. The "SymPy { ... }" piece is typed by the human user, and the Robot responds by adding the response from SymPy.


Rosie คือโปรแกรมคอมพิวเตอร์สำหรับการแปลภาษามนุษย์เขียนโดย Google. ลักษณะที่ฉันชอบไม่ได้งานดีมาก. ฉันจะใส่ข้อความภาษาไทยนี้บนบล็อกของฉัน.

Monday, October 12, 2009

Hacking CUDA and OpenCL on Fedora 10

I discovered Fedora 11 is not compatible with NVIDIA's CUDA toolkit (now on version 2.3; see note about driver version below) because the latter requires GCC 4.3 where Fedora 11 provides GCC 4.4. So I'll have to back down to Fedora 10. Here are some handy notes for setting up Fedora 10. I installed a number of RPMs to get CUDA to build.
sudo yum install eclipse-jdt eclipse-cdt \
freeglut freeglut-devel kernel-devel \
mesa-libGLU-devel libXmu-devel libXi-devel

The Eclipse stuff wasn't all necessary for CUDA but I wanted it.

In a comment to an earlier posting, Jesper told me about OpenCL, a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. NVIDIA supports this and has an OpenCL implementation which required updating my NVIDIA drivers to version 190.29, more recent than the version 190.18 drivers on NVIDIA's CUDA 2.3 page. When I installed 190.29, it warned me that it was uninstalling the 190.18 drivers.

Python enthusiasts will be interested in PyOpenCL.

NVIDIA provides a lot of resources and literature for getting started with OpenCL.

Friday, October 09, 2009

Sampled profiling in Linux

A few years back I was doing some cross-platform work on a mostly-Python package with a bunch of math libraries. I happened to be jumping back and forth between Linux and Mac, and I discovered that the Mac had this magical performance measurement tool that, as far as I knew, didn't exist for Linux. How could that be? How could such a great idea, obviously amenable to open source, NOT have made it into the Linux world? I was flabbergasted.

That tool was a sampling profiler. It works like this. At regular intervals, you stop the processor and say, what are you doing? what process are you in? what thread are you in? what does your call stack look like? and you collect statistics on all this stuff, figure out the percentages of time spent in different places, and present it in an easily understandable visual tree format.

Thinking about it more, I saw that to make this work, you need to have built your code with debug enabled, so that all the symbols are present in the compiled binary. Most Linux software is built without debug symbols (or at least that was the common practice a few years ago) which is why this idea hadn't gotten traction in the Linux world.

So today I stumbled across a cool review of Fedora 11 Linux and near the bottom, it talks about some of the development work that Red Hat has been doing on the Eclipse IDE, including a sampling profiler called OProfile. There's a nice little video of a Red Hat engineer demonstrating OProfile. Very cool stuff. One of the most impressive things is that in addition to simply sampling at fixed time intervals, you can also choose to sample on events happening in the computer hardware, like when the processor drives data onto the front-side bus. Wow, yikes. I don't have a clear idea what that would help with, but I guess some performance issues can correlate to things happening in the hardware.

Here are some of my old notes I found on this topic. These notes are old and I didn't know about OProfile at the time, though OProfile originated in the same group at HP that did the QProf work described below. The old notes will be in italics to avoid confusion with more current information.

Finally, something the Mac has that Linux doesn't have

The Mac has this cool little utility called sample. Here's how it works. Suppose some program is running as process ID 1234. You type "sample 1234 60" in another terminal, and it interrupts that process 100 times per second for 60 seconds, getting a stack trace each time. You end up with a histogram showing the different threads and C call stacks, and it becomes very easy to figure out where the program is spending most of its time. It's very cool. It doesn't even significantly slow down the process under study.

I started looking to see if there was anything similar for Linux. There ought to be. This isn't so different from what GDB does. The closest thing I could find was something called pstack, which does this once, but doesn't seem so good at extracting the names of C routines. I've never seen pstack get a stack that was more than three calls deep. I also found a variant called lsstack.

I think the Mac can do this because Apple can dictate that every compile must have the -g switch turned on. The pstack utility came originally from the BSD world.

If there's ever a working sample utility for Linux, it will be a brilliant thing.

First you need a timer process that interrupts the target process. On each interrupt, you collect a stack trace. You do that by having an interrupt handler that gets the stack above it and throws it in some kind of data structure. In Mac OS, you can do all this without recompiling any of the target process code.

You run the timer for however long and collect all this stuff. You end up with a data structure with all these stacks in it. Now you can create a tree with histogram numbers for each node in the tree, where each node in the tree represents a program counter. Next you use map files to convert the program counters to line numbers in source files, and if you want brownie points, you even include the source line for each of them.

This should be feasible in Linux. Check out "man request_irq".

The only issue with making this work as well in Linux as it does on the Mac is that code that isn't built with "-g" will not have the debug information for pulling out function names and line numbers. Somebody (probably David Mosberger-Tang) has already done this project: