Will Ware's blog

Wednesday, January 27, 2010

Bayesian nets in RDF, and how to update them

I've banged my head on this for a couple of days and feel close to a solution. The graph looks like this.

Each random boolean variable gets a node, the causal relationship between them gets a node, and each variable gets a probability.
The math for updating probabilities is a little tricky, but in a fun and interesting way, so I enjoyed banging on that. At some point I'll tackle more involved cases where there aren't simply two random boolean variables, but that's the logistically simple case that exposes most of the concepts involved. Kinda like the Drosophila of Bayesian inference.

"""Let A and B be random boolean variables, with A being an

unobservable cause and B being an observable effect, and Pr(B|A) and

Pr(B|~A) are given. This might be the case if the situation has some

structure that dictates the conditional probabilities, like a 6.432

problem, or it just might be empirical.

From these we can compute probabilities for the four cases.

P11 = Pr(B^A) = Pr(A) Pr(B|A)

P10 = Pr(~B^A) = Pr(A) (1-Pr(B|A))

P01 = Pr(B^~A) = (1-Pr(A)) Pr(B|~A)

P00 = Pr(~B^~A) = (1-Pr(A)) (1-Pr(B|~A))

Treat (Pr(B|A), Pr(B|~A)) as one piece of information, and Pr(A) as a

separate independent piece of information. The first piece reflects

your beliefs about the machinery connecting A to B, and the second

reflects your beliefs about the likelihood of A, so these are

legitimately separate concerns.

If we observe that B=1, then we want to replace Pr(A) with our

previous estimate for Pr(A|B), which is given by our old numbers as

P11/Pr(A) = P11/(P11+P01), and this becomes our posterior probability

for A."""

class Link1:

    def __init__(self, PrBgivenA, PrBgivenNotA):

        self.PrBgivenA, self.PrBgivenNotA = PrBgivenA, PrBgivenNotA

    def updatePrA(self, PrA, bvalue):

        p11 = PrA * self.PrBgivenA

        p10 = PrA * (1. - self.PrBgivenA)

        p01 = (1. - PrA) * self.PrBgivenNotA

        p00 = (1. - PrA) * (1. - self.PrBgivenNotA)

        # print p11, p01, p10, p00

        assert 0.999 < p11 + p01 + p10 + p00 < 1.001

        assert PrA - 0.001 < p10 + p11 < PrA + 0.001

        if bvalue:

            # Pr(A|B)

            newPrA = p11 / (p11 + p01)

        else:

            # Pr(A|~B)

            newPrA = p10 / (p10 + p00)

        assert 0.0 <= newPrA <= 1.0

        return newPrA

link = Link1(0.6, 0.27)

PrA = 0.5

print PrA

for i in range(5):

    PrA = link.updatePrA(PrA, 0)

    print PrA

for i in range(10):

    PrA = link.updatePrA(PrA, 1)

    print PrA

Monday, January 18, 2010

Inference engines and automated reasoning

An inference engine is a computer program that reasons, using some form of knowledge representation.

This can be done with propositional logic or first-order logic, assuming each proposition is completely unambiguous and is either 100% true or 100% false. These simplistic engines are fun little exercises in programming but in real-world situations, reasoning usually needs to consider ambiguities and uncertainties. Instead of simply being true or false, propositions may be likely or unlikely, or their likelihood may be something to be tested or determined. Some elements of some propositions may be poorly defined.

In the unambiguous binary case, it's typical to express rules for generating new propositions as if-then rules with variables in them. We call these production rules because they are used to produce new propositions.

If X is a man, then X is mortal.

Given the statement "Socrates is a man", we

match the statement to the rule's IF clause
take note of all variable assignments: X=Socrates
plug assignments into the THEN clause: "Socrates is mortal"

Obviously this isn't rocket science, but even without handling uncertainty, it will still be useful if scaled to very large numbers of propositions, as in the semantic web.

How to handle uncertainty? This can be done by representing knowledge as a Bayesian network, a directed graph where the edges represent the influences and dependencies between random variables. There is a good tutorial about these online. Here's an example from the Wikipedia article where the probability of rain is an independent variable, and the sprinkler system is usually off if it's raining, and the grass can get wet from either rain or the sprinkler.

There are at least two open-source inference engines that work with Bayesian networks. One is SMILE, another is the OpenBayes library for the Python language. OpenBayes allows you to update the state of your knowledge with a new observation.

Suppose now that you know that the sprinkler is on and that it is not cloudy, and you wonder what's the probability of the grass being wet : Pr(w|s=1,c=0). This is called evidence...

ie.SetObs({'s':1,'c':0})
and then perform inference in the same way... The grass is much more likely to be wet because the sprinkler is on!

Here is a list of many more Bayesian network libraries, and another list. There is also a nice tutorial on Learning Bayesian Networks from Data, the process of taking a bunch of data and automatically discovering the Bayesian network that might have produced it. Another Bayesian reasoning system is BLOG.

Bayesian logic (BLOG) is a first-order probabilistic modeling language under development at MIT and UC Berkeley. It is designed for making inferences about real-world objects that underlie some observed data: for instance, tracking multiple people in a video sequence, or identifying repeated mentions of people and organizations in a set of text documents. BLOG makes it (relatively) easy to represent uncertainty about the number of underlying objects and the mapping between objects and observations.

Are production rule systems and Bayesian network systems mutually compatible? I don't yet know. Do Bayesian networks adequately represent all important forms of uncertainty or vagueness that one might encounter in working with real-world data? I don't know that either. Are there other paradigms I should be checking out? Probably.

Sunday, January 17, 2010

How hard is generating scientific hypotheses?

In the 1500s, a Danish astronomer named Tycho Brahe used Galileo's invention of the telescope to collect an enormous amount of numerical data describing the motion of the planets. Brahe's assistant Johannes Kepler studied that data and arrived at some interesting conclusions which we now know as Kepler's laws of planetary motion:

The orbit of every planet is an ellipse with the Sun at a focus.
A line joining a planet and the Sun sweeps out equal areas during equal intervals of time.
The square of the orbital period of a planet is directly proportional to the cube of the semi-major axis of its orbit.

Kepler's laws were the starting point from which Isaac Newton formulated his law of gravitation, the inverse-square law that we all know and love.

We have here a three-step process: collect data, find mathematical patterns in the data, and create a theory that explains those patterns. Collecting data is simple in principle, and looking for mathematical patterns is also simple. Kepler's arithmetic was done by hand, but now we have computer programs (like Eureqa) which use genetic programming to find parsimonious mathematical formulas that fit sets of data. You can find Java applets on the web that demonstrate this idea.

So the first two steps aren't too hard. We can arrive rather easily at mathematical formulas that describe various experimentally measurable aspects of reality. That's a good thing. The hard job is the next step: finding theories or "likely stories" that explain why those formulas take whatever form they do. Sometimes the form of the math suggests a mechanism, because you've learned to associate elliptical orbits with conservative force fields which necessarily have an inverse-square law. (Hundreds of years after Newton, that is now a no-brainer.) But generally the problem is non-trivial and so far, as far as I'm aware, requires human insight.

Foresight Institute conference, Jan 16 and 17, 2010

The Foresight conference is just winding down. The talks were live-blogged over at NextBigFuture by Brian Wang who did a good job of concisely capturing the essentials. My own favorite talk was by Hod Lipson, who talked about a number of things, including something I find fascinating, the automation of science, about which I plan to blog more frequently.

I blogged too briefly in the past about the Adam project, but it deserves more. Reported in April 2009 by Ross King at Aberystwyth University. It used lab automation to perform experiments, and data mining to find patterns in the resulting data. Adam developed novel genomics hypotheses about S. cerevisiae yeast and tested them. Adam's conclusions were manually confirmed by human experimenters, and found to be correct. This was the first instance in human history where a machine discovered new scientific knowledge without human oversight.

Here is what I want to see computers doing in the coming years.

Look for patterns in data -- data mining
Propose falsifiable hypotheses
Design experiments to test those hypotheses
Perform the experiments and collect data
Confirm or deny hypotheses
Mine new data for new patterns, repeat the process

In the longer term, I want to see machine theoreticians and experimetalists collaborate with their human counterparts, both working in a scientific literature that is readable and comprehensible for both. This will require the development of a machine-parseable ontology (ideally a widely recognized standard) for sharing elements of the scientific reasoning process: data sets, hypotheses, predictions, deduction, induction, statistical inference, and the design of experiments.

So why do I want all this stuff? For one thing, it's interesting. For another, I am approaching the end of my life and I want to see scientific progress (and particularly medical progress) accelerate considerably in my remaining years. Finally, this looks to me like something where I can make some modestly valuable contribution to humanity with the time and energy I have left.

Sunday, December 20, 2009

Snow and more snow, December 2009

We got a dusting on December 6th.

In all this cold, I had occasion to heat some water for tea, watching the pot the whole while, and contrary to common belief, it boiled anyway.

We had a bunch of snow last night and this morning. It's just stopped in the last hour.

Wednesday, December 02, 2009

Honorable mention for BetterExplained.com website and its author, Kalid Azad

Kalid Azad's BetterExplained.com website has a lot of elegantly straightforward articles on interesting topics, many of them mathematical. He's doing some really interesting stuff, including a brilliant online calculator that you can use to embed calculations in web pages.

I'm not a Microsoft fanboy by any means, but I admire the video he made, using Windows 7 for humanitarian purposes.

In an article on happiness, Kalid includes a video of Steve Jobs's commencement address at Stanford. I'm really grateful that he included this.

Remember this is a guy who had a diagnosis of terminal cancer a year earlier.

Your work is going to fill a large part of your life, and the only way to be truly satisfied is to do what you believe is great work. And the only way to do great work is to love what you do. If you haven't found it yet, keep looking. Don't settle... All external expectations, all pride, all fear of embarrassment or failure -- these things just fall away in the face of death, leaving only what is truly important. Remembering that you are going to die is the best way I know to avoid the trap of thinking you have something to lose. You are already naked. There is no reason not to follow your heart.

I recently left a job where I wasn't following my heart. The money was good and the rest of the economy was bad, so I spent a lot of energy and effort trying to make it work, but there was no passion or fun or excitement. So this talk resonates for me now.

Tuesday, November 24, 2009

A few last comments on the Hackathon

This is a random collection of notes about things I learned at the Hackathon or shortly after.

The Hackathon was a lot of fun. Pamela Fox, one of the developers of Wave, gave a couple of presentations (1, 2).

There are some article on debugging Wave extensions for Robots and Gadgets. More handy debug tips here.

If you're writing a Robot in Java, a huge amount of support is available in Google's Eclipse plugins. Be sure to have Java 1.6 installed. If you already have an earlier version, you can install 1.6 on top of it, and just switch the preference within Eclipse. I think it was Windows -> Preferences -> Java -> something...

There is also an Eclipse plugin for Python called PyDev. I don't have a lot of experience with it.

One guy at the Hackathon wrote a robot in Groovy, a dynamic language that runs on the JVM. From what I could tell, it was working fine.

At the present time, Robots must be deployed to Google App Engine. This restriction will probably be relaxed, especially as non-Google Wave servers come on line.

App Engine doesn't run PHP (I had wanted to add Mediawiki to my web app) but if you really want PHP you can try Quercus.

App Engine has a great logging facility, accessible via https://appengine.google.com. You can put logging statements in Python code (import logging) or Java code (import java.util.logging) and both will dump info statements into the GAE log.

Google likes extensions to adhere to some design principles. These are designed to maximize the broad appeal of your Wave extension and to make it run faster. Google figures that slow ugly extensions will drive users away from their ads.

There was talk at the Hackathon (and I wasn't paying close enough attention) about people setting up Wave servers. If I set up my own server, can I issue my own invitations to people so that I'm not depending on Google providing invitations? That would rock.

More useful links:

http://wavecompass.net/ -- a mediawiki that tries to fill in gaps in Google's Wave docs
http://www.reddit.com/r/wave -- random semi-useful Wave discussions on Reddit
http://code.google.com/p/pygowave-server/ -- a Wave server in Python
http://code.google.com/p/wave-protocol/ -- the official Google wave server in Java
http://completewaveguide.com/ -- Gina Trapani's Complete Guide to Google Wave is fantastic

Something that would be very valuable for robot development would be a lightweight Wave server simulator running in Python, probably using the classes in the Python Wave API. You'd want a way to simulate an on-going Wave conversation, and the simulator would send events to the robot.

Monday, November 16, 2009

Tinkering with Google Wave Gadgets and Robots

Google is hosting a Hackathon on Saturday focusing on Wave. For Wave, you can create Gadgets or Robots. I wanted to be able to monitor and control my house's non-existent burglar alarm and X10 appliances from inside a Wave. A Gadget is the best thing for that, which you'll see (entitled "Home Controller") on the right side of this blog.

Assuming my machine at home is running the client code (which doesn't really control anything, it just fakes it), you can use the command "feed the cat" with password "abcd" to change the cat's state from hungry to full, or "wait a minute" to change the cat from full to hungry. Be patient, it can take up to fifteen seconds before your command produces a visible result. The Gadget and the client code are both polling a teeny web service on App Engine which passes information between them, and the polling rate is not hasty. The Gadget's source is hosted on Google Code. A not-too-detailed description is here.

Robots require working directly with the Wave API, which Gadgets don't. Here is a Robot that will help you look stuff up in the Linux kernel source code, using the LXR website. To use the Robot in a wave, first add it to your Gmail contacts. Use the name "Lxreffy" and the email address "wware-lxreffy@appspot.com". Now go into Wave and search for the contact "Lxreffy", and you'll be able to invite it into any Wave. Then type "LXR: foobar" in a blip, and hit "Done", and it will append a series of web links to the blip showing where "foobar" appears in the kernel. Example screenshot below.

Another Robot uses SymPy to enable symbolic algebra to be done within a Wave. The "SymPy { ... }" piece is typed by the human user, and the Robot responds by adding the response from SymPy.

Rosie คือโปรแกรมคอมพิวเตอร์สำหรับการแปลภาษามนุษย์เขียนโดย Google. ลักษณะที่ฉันชอบไม่ได้งานดีมาก. ฉันจะใส่ข้อความภาษาไทยนี้บนบล็อกของฉัน.

Monday, October 12, 2009

Hacking CUDA and OpenCL on Fedora 10

I discovered Fedora 11 is not compatible with NVIDIA's CUDA toolkit (now on version 2.3; see note about driver version below) because the latter requires GCC 4.3 where Fedora 11 provides GCC 4.4. So I'll have to back down to Fedora 10. Here are some handy notes for setting up Fedora 10. I installed a number of RPMs to get CUDA to build.

sudo yum install eclipse-jdt eclipse-cdt \
freeglut freeglut-devel kernel-devel \
mesa-libGLU-devel libXmu-devel libXi-devel

The Eclipse stuff wasn't all necessary for CUDA but I wanted it.

In a comment to an earlier posting, Jesper told me about OpenCL, a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. NVIDIA supports this and has an OpenCL implementation which required updating my NVIDIA drivers to version 190.29, more recent than the version 190.18 drivers on NVIDIA's CUDA 2.3 page. When I installed 190.29, it warned me that it was uninstalling the 190.18 drivers.

Python enthusiasts will be interested in PyOpenCL.

NVIDIA provides a lot of resources and literature for getting started with OpenCL.

Friday, October 09, 2009

Sampled profiling in Linux

A few years back I was doing some cross-platform work on a mostly-Python package with a bunch of math libraries. I happened to be jumping back and forth between Linux and Mac, and I discovered that the Mac had this magical performance measurement tool that, as far as I knew, didn't exist for Linux. How could that be? How could such a great idea, obviously amenable to open source, NOT have made it into the Linux world? I was flabbergasted.

That tool was a sampling profiler. It works like this. At regular intervals, you stop the processor and say, what are you doing? what process are you in? what thread are you in? what does your call stack look like? and you collect statistics on all this stuff, figure out the percentages of time spent in different places, and present it in an easily understandable visual tree format.

Thinking about it more, I saw that to make this work, you need to have built your code with debug enabled, so that all the symbols are present in the compiled binary. Most Linux software is built without debug symbols (or at least that was the common practice a few years ago) which is why this idea hadn't gotten traction in the Linux world.

So today I stumbled across a cool review of Fedora 11 Linux and near the bottom, it talks about some of the development work that Red Hat has been doing on the Eclipse IDE, including a sampling profiler called OProfile. There's a nice little video of a Red Hat engineer demonstrating OProfile. Very cool stuff. One of the most impressive things is that in addition to simply sampling at fixed time intervals, you can also choose to sample on events happening in the computer hardware, like when the processor drives data onto the front-side bus. Wow, yikes. I don't have a clear idea what that would help with, but I guess some performance issues can correlate to things happening in the hardware.

Here are some of my old notes I found on this topic. These notes are old and I didn't know about OProfile at the time, though OProfile originated in the same group at HP that did the QProf work described below. The old notes will be in italics to avoid confusion with more current information.

Finally, something the Mac has that Linux doesn't have

The Mac has this cool little utility called sample. Here's how it works. Suppose some program is running as process ID 1234. You type "sample 1234 60" in another terminal, and it interrupts that process 100 times per second for 60 seconds, getting a stack trace each time. You end up with a histogram showing the different threads and C call stacks, and it becomes very easy to figure out where the program is spending most of its time. It's very cool. It doesn't even significantly slow down the process under study.

I started looking to see if there was anything similar for Linux. There ought to be. This isn't so different from what GDB does. The closest thing I could find was something called pstack, which does this once, but doesn't seem so good at extracting the names of C routines. I've never seen pstack get a stack that was more than three calls deep. I also found a variant called lsstack.

I think the Mac can do this because Apple can dictate that every compile must have the -g switch turned on. The pstack utility came originally from the BSD world.

If there's ever a working sample utility for Linux, it will be a brilliant thing.

First you need a timer process that interrupts the target process. On each interrupt, you collect a stack trace. You do that by having an interrupt handler that gets the stack above it and throws it in some kind of data structure. In Mac OS, you can do all this without recompiling any of the target process code.

You run the timer for however long and collect all this stuff. You end up with a data structure with all these stacks in it. Now you can create a tree with histogram numbers for each node in the tree, where each node in the tree represents a program counter. Next you use map files to convert the program counters to line numbers in source files, and if you want brownie points, you even include the source line for each of them.

This should be feasible in Linux. Check out "man request_irq".

The only issue with making this work as well in Linux as it does on the Mac is that code that isn't built with "-g" will not have the debug information for pulling out function names and line numbers. Somebody (probably David Mosberger-Tang) has already done this project:

http://www.hpl.hp.com/research/linux/qprof/

Tuesday, August 25, 2009

I blinked, and web apps advanced ten years

Back in the day, we all watched static HTML pages give way to CGI, followed by MySQL and Perl, followed by Java middleware. Back then I wrote NanoCAD, but client-side Java applets never got traction. Now we have Flash and Silverlight and AJAX. On the server side we have PHP, and on the client side, Javascript, which is totally unrelated to Java. Many modern browsers now support the canvas tag. Kinda makes me want to recast NanoCAD in more modern technology.

Simultaneously we now have netbooks - small cheap wifi-connected laptops that can run a decent web browser and not much more. Google and others are pitching the idea that all our previously shrink-wrapped apps can therefore live in the cloud. It's an intriguing idea.

Source code for a minimal AJAX example, demo here.
There are several free webhosting services listed here.
You can test whether your browser supports canvas tags here.
I'm tinkering with a molecule viewing browser app here.
More about Javascript

A cool soft body Javascript app.
Doug Crockford has written deeply insightful stuff from a linguistic perspective.
Access to DOM: 1, 2, 3, 4
Libraries: Prototype, Scriptaculous, and Dojo.
Online "IDE" for Javascript.

Tutorials and references
- Javascript: 1, 2, 3
- Canvas tag: 1, 2, 3, 4, 5
Videos

Prototype
Scriptaculous
Dojo, including general discussion of Javascript
Authoring interactive Adobe Flash content

When you're working in Java or Python, you can use an IDE like Idea or Idle. Is there anything like that for Javascript? I am not aware of it, if there is.

Tuesday, July 21, 2009

Building a GPU machine

I've been reading lately about what NVIDIA has been doing with CUDA and it's quite impressive. CUDA is a programming environment for their GPU boards, available for Windows, Linux, and Mac. I am putting together a Linux box with an NVIDIA 9600GT board to play with this stuff. The NVIDIA board cost me $150 at Staples. Eventually I intend to replace it with a GTX280 or GTX285, which both have 240 processor cores to the 9600GT's 64. I purchased the following from Magic Micro, which was about $300 including shipping:

Intel Barebones #2
* Intel Pentium Dual Core E2220 2.4 GHz, 800FSB (Dual Core) 1024K
* Spire Socket 775 Intel fan
* ASRock 4Core1600, G31, 1600FSB, Onboard Video, PCI Express, Sound, LAN
* 4GB (2x2GB) PC6400 DDR2 800 Dual Channel
* AC 97 3D Full Duplex sound card (onboard)
* Ethernet network adapter (onboard)
* Nikao Black Neon ATX Case w/side window & front USB
* Okia 550W ATX Power Supply w/ 6pin PCI-E

I scavenged an old DVD-ROM drive and a 120-gig HD from an old machine, plus a keyboard, mouse, and 1024x768 LCD monitor. I installed Slackware Linux. I went to the CUDA download website and picked up the driver, the toolkit, the SDK, and the debugger.

This is the most powerful PC I've ever put together, and it was a total investment of just a few hundred dollars. For many years I've drooled at the prospect of networking a number of Linux boxes and using them for scientific computation, but now I can do it all in one box. It's a real live supercomputer sitting on my table, and it's affordable.

I am really starting to like NVIDIA. They provide a lot of support for scientific computation. They are very good about sharing their knowledge. They post lots of videos of scientific uses for their hardware.

NVIDIA's SDK includes several demos, some of them visually attractive: n-body, smoke particles, a Julia set, and a fluid dynamics demo. When running the n-body demo, the 9600GT claims to be going at 125 gigaflops.

A few more resources...

Friday, July 10, 2009

Moore's Law and GPUs

Way back when, Gordon Moore of Intel came up with his "law" that the number of transistors on a given area of silicon would double every 18 months. Currently chip manufacturers use a 45 nm process, and are preparing to move to a 32 nm process. There is an International Technology Roadmap for Semiconductors that lays all this out. As feature sizes shrink, we need progressively more exotic technology to fabricate chips. The ITRS timeframe for a 16 nm process is 2018, well beyond the expectation set by Moore's Law. There is a lot of punditry around these days about how Moore's Law is slowing down.

That's process technology. The other way to improve computer performance is processor architecture. As advances in process technology become more expensive and less frequent, architecture plays an increasingly important role. It's always been important, and in the last 20 years, microprocessors have taken on innovations that had previously appeared only in big iron, things like microcode, RISC, pipelining, cacheing of instructions and data, and branch prediction.

Every time process technology hits a bump in the road, it's a boost for parallelism. In the 1980s, a lot of start-ups tried to build massively parallel computers. I was a fan of Thinking Machines in Cambridge, having read Danny Hillis's PhD thesis. The premise of these machines was to make thousands of processors, individually fairly feeble, arranged in a broadcast architecture. The Transputer chip was another effort in a similar direction. One issue then was that people wanted compilers that would automatically parallelize code written for serial processors, but that turned out to be an intractable problem.

Given the slowing of Moore's Law these days, it's good to be a GPU manufacturer. The GPU guys never claim to offer a parallelizing compiler -- one that can be applied to existing code written for a serial computer -- instead they just make it very easy to write new parallel code. Take a look at nVIDIA's GPU Gems, and notice there's a lot of math and very little code. Because you write GPU code in plain old C, they don't need to spend a lot of ink explaining a lot of wierd syntax.

Meanwhile the scientific community has realized over the last five years that despite the unsavory association with video games, GPUs are nowadays the most bang for your buck available in commodity computing hardware. Reading about nVIDIA's CUDA technology just makes me drool. The claims are that for scientific computation, an inexpensive GPU represents a speed-up of 20x to 100x over a typical CPU.

When I set out to write this, GPUs seemed to me like the historically inevitable next step. Having now recalled some of the earlier pendulum swings between process technology and processor architecture, I see that would be an overstatement of the case. But certainly GPU architecture and development will be important for those of us whose retirements are yet a few years off.

Wednesday, June 24, 2009

Whole-cell simulation

E-Cell is a software model of a biological cell. To use E-Cell (as it existed in its initial incarnation), you define a set of DNA sequences and chemical reactions, and the program iterates them over time, tracking the concentrations of different proteins and other goings-on in the cell.

That's very cool, but it does require you to tell it what kinds of reactions are possible, and their relative likelihoods. What you get out is concentrations, not information about individual protein molecules. That approach doesn't know the shapes of molecules, or anything detailed about how the molecules interact.

E-Cell later evolved into an umbrella project that encompasses several different simulation subprojects, all with the goal of simulating an entire biological cell. So this isn't a limitation of E-Cell per se, just that particular simulation approach.

At the extreme end, one could imagine a full-blown molecular dynamics simulation of every atom in the cell. That would be great but for two problems. First, it would require a horrendous amount of computation. Cells have something like 10¹⁵ atoms in them, and molecular dynamics simulations typically have time steps in the femtoseconds, where cellular activities frequently take place over tens of minutes.

The second problem is making sense of the forest amid the trees. You're tracking every atom, but you're really curious about membranes and DNA and mitochondria and all those interesting little structures inside the cell. The computer needs to realize that this particular collection of atoms is a ribosome in this orientation, transcribing this particular base pair and grabbing that particular amino acid. So in addition to this monstrously impossible simulation, there are some tough problems in pattern recognition.

Nevertheless I hold out hope that whole-cell simulation on a scale considerably more detailed than E-Cell is still a worthwhile goal. I suspect that there is some middle road of simulation detail. Perhaps molecules and cell structures can be represented with rigid body mechanics, with surface representations of electric charge and other attractive and repulsive forces. Some are fairly rigid, but the mooshier ones can be represented by finite element models.

Why bother with whole-cell simulation? What can it do for humanity, or for you in particular? If a half-decent simulation could run on a single desktop computer (or maybe a small number of desktop computers) it would allow large numbers of researchers and hobbyists to perform biological experiments in silico. It might advance medical science quite rapidly. It might bring about cures for diseases that are currently untreatable. At the least it would provide a lot of educational opportunities that wouldn't otherwise exist.

Cool robot videos

Willow Garage is a Bay area robot company working on a platform intended to make it easier to build little robots for research and household use. There's a nice writeup about them on Foresight's website.

They are oriented to making an impact on the field of robotics rather than making an immediate profit. Cousins explained it in these terms: the average robotics PhD student spends 90% of his time building a robot and the remaining 10% extending the state of the art. If Willow Garage succeeds, those numbers will be reversed.

Neat. I'd love to get a chance to play with one.

Friday, May 29, 2009

Molecular modeling with Hadoop?

Hadoop is Apache's implementation of the MapReduce distributed computing scheme innovated by Google. Amazon rents out Hadoop services on their cluster. It's fairly straightforward to set up Hadoop on a cluster of Linux boxes. Having myself a long-standing interest in distributed computing approaches to molecular modeling, I have been trying to figure out how Hadoop could be applied to do very large-scale molecular simulations.

MapReduce is great for problems where large chunks of computation can be done in isolation. The difficulty with molecular modeling is that every atom is pushing or pulling on every other atom on every single time step. The problem doesn't nicely partition into large isolated chunks. One could run a MapReduce cycle on each time step, but that would be horribly inefficient - on each time step, every map job needs as input the position and velocity of every atom in the entire simulation.

There are existing solutions like NAMD, which uses DPMTA for the long-range forces between atoms. For a cluster of limited size these are the appropriate tools. For large clusters with hundreds or thousands of machines, the rate of hardware failures becomes a consideration that can't be ignored.

MapReduce provides a few principles for working in the very-large-cluster domain:

Let your infrastructure handle hardware failures, just like the way the Internet invisibly routes around dead servers.
Individual machines are anonymous. You never write application code that directly addressses an individual machine.
Don't waste too much money trying to make the hardware more reliable. It won't pay off in the end.
Use a distributed file system that reliably retains the inputs to a task until that task has been successfully completed.

Could the tasks that NAMD assigns to each machine be anonymized with respect to which machine they run on, and the communications routed through a distributed filesystem like Hadoop's HDFS? Certainly it's possible in principle. Whether I'll be able to make any reasonable progress on it in my abundant spare time is another matter.

Thursday, May 28, 2009

More thinking about compensation models

I've been watching some of The Hunt for Gollum. The quality is quite good, and some of the camera effects are surprisingly clever.

I am interested in the question, how do you release a work so that it ultimately ends up in the public domain, but first make some money (perhaps a lot)? And how do you do this when your customer base is entirely aware that, in the long run, it will be available for free?

Back in the Eighties, Borland sold their Turbo Pascal development system for only $30 when competing products sold for hundreds, and did nothing in hardware or software to implement any sort of copy protection, while their competitors scrambled for complicated but unsuccessful approaches to combat piracy. Borland's approach to copy protection was simply the honor system, and making the product cheap enough that nobody minded paying for it.

The machinima Red vs. Blue is released serially as episodes. Those guys have an interesting approach:

Members of the official website can gain sponsor status for a fee of US$10 every six months. Sponsors can access videos a few days before the general public release, download higher-resolution versions of the episodes, and access special content released only to sponsors. For example, during season 5, Rooster Teeth began to release directors' commentary to sponsors for download. Additionally, while the public archive is limited to rotating sets of videos, sponsors can access content from previous seasons at any time.

They are smart guys who have been doing this for years now, so it's likely they've hit upon as optimal a solution as is practical. Of course it helps that they have a great product that attracts a lot of interest. They are following the Borland approach: sponsorship is inexpensive and there is no attempt at copy protection.

Computer performance vibes

There are a number of topics pertaining to computer performance that I want to learn more about. As an ex-EE, I should be keeping up with this stuff better.

Processors are fast, memory chips are slow. We put a cache between them so that the processor need not go out to memory on every read and write. There is a dense body of thought about cache design and optimization. I might blog about this stuff in future. It's a kinda heavy topic.

One way to make processors run really fast is to arrange steps in a pipeline. The CPU reads instruction one from instruction memory, and then it needs to read something from data memory, do an arithmetic operation on it, and put the result in a register. While reading from data memory, the CPU is simultaneously reading instruction two. While doing arithmetic, it's reading instruction three, and also doing the memory read for instruction two. And so forth, so that the processor is chopped up into a sequence of sub-processors, each busy all the time.

Apple has a nice, more detailed discussion, here.

But there is a complication with pipelining. Some of these instructions are branch instructions, which means that the next instruction could be either of two different ones. That's potentially a mess, because you've already got the pipeline full of stuff when you discover whether or not you're taking the branch, and you might find that the instructions you fetched were the wrong ones, so you have to go back and do all those same operations with a different sequence of instructions. Ick.

The work-around is branch prediction. The CPU tries to figure out, as accurately as it can, which sequence it will end up going down, and if all goes well, does that early enough to fill the pipeline correctly the first time. It doesn't have to be perfect, but it should try to guess correctly as often as possible.

A couple more things they're doing these days. One is to perform several memory transfers per cycle. Another is something Intel calls hyper-threading, where some of the CPU's registers are duplicated, allowing it to behave like two separate CPUs. This can be a win if one half is stalled waiting for a memory access; the other half just plows ahead.

That's all the vaguely intelligent stuff I have to say on this topic at the moment. Maybe I'll go into more detail in future, no promises.

Friday, May 01, 2009

Fan-made movie: The Hunt for Gollum

The Hunt for Gollum is a 40-minute high-def movie made by fans of the Lord of the Rings trilogy in general, and the Peter Jackson movies in particular. The trailers look beautiful, the cinematography looks about as good as the three movies. This is being done on a purely non-profit basis and the entire movie will be released free to the Internet on Sunday, May 3rd.

I kinda wish these guys had tried to make money with this, for a few reasons. First, they should be rewarded for such a monumental effort. No doubt many of the primary organizers will get their pick of sweet jobs, just as the primary developers of Apache, Linux, Python, etc. have gone on to sweet jobs after releasing useful free software, but other participants might have gotten some compensation for their time and effort.

Second, there was an opportunity here to experiment with compensation models whose endgame is the release of a work into the public domain. I've often wondered if a big movie could be made by an independent group and released to the public domain, and still bring in significant money. My first idea for a ransom model where each frame of the movie would be encrypted and distributed, and encryption keys for frames or groups of frames would be published as various amounts of the total desired donation were reached. There would probably be a clause that the entire key set would be released unconditionally at some future date.

I think I have a better idea on further reflection. Before the public-domain release, find some buyers who are willing to pay for the privilege of having the movie before the big release. The buyers need to be informed that a public-domain release will occur and on what date, so that they understand that the window to make any money off the movie will be limited.

Another possibility is a ransom model with a linearly-decreasing donation threshold, with the public-domain release scheduled for when the donation threshold reaches zero. If the total donations cross the threshold before then, the release occurs at that time.

Anyway, kudos to the people who made "Hunt for Gollum", thanks for your efforts, and I am eagerly looking forward to seeing the movie.

Wednesday, April 22, 2009

Machines doing actual science, not just lab work

Here's the press release: Robot scientist becomes first machine to discover new scientific knowledge

In an earlier posting, I discussed the idea of computers participating in the reasoning process of the scientific method. There are, as far as I can see, two fields that are applicable to this. One is machine learning, where a computer studies a body of data to find patterns in it. When done with statistical methods, this is called data mining. The other is automated reasoning such as is done with semantic web technology.

So I was quite interested to see the news story linked above. Researchers in the UK have connected a computer to some lab robotics and developed a system that was able to generate new scientific hypotheses about yeast metabolism, and then design and perform experiments to confirm the hypotheses.

This is important because there will always be limits to what human science can accomplish. Humans are limited in their ability to do research, requiring breaks, sleep, and vacations. Humans are limited in their ability to collaborate, because of personality conflicts, politics, and conflicting financial interests. Human talent and intelligence are limited; the Earth is not crawling with Einsteins and Feynmans.

That's obviously not to say that computers would have an unlimited capacity to do science. But their limits would be different, and their areas of strength would be different, and science as a combined effort between humans and computers would be richer and more fruitful than either alone.

I still think it's important to establish specifications for distributing this effort geographically. I would imagine it makes sense to build this stuff on top of semantic web protocols.

I like the idea that with computer assistance, scientific and medical progress might greatly accelerate, curing diseases (hopefully including aging) and offering solutions to perennial social problems like boom-and-bust economic cycles. Then we could all live in a sci-fi paradise.

Tuesday, April 21, 2009

Obama talks about trains

I voted for Obama. It's great that he's the first black president, but it's not the most important thing to me. It's gratifying that he represents a big change from the previous administration. But for me, the real thing with Obama is, every time I hear him talk, he sounds like he's actually thinking. You sometimes see him pausing to think during press conferences. It's been so long overdue to have somebody in the White House who can sustain a thought process. So every time I hear him talk, I get another little bump of good feeling about him. Thank goodness he shares so many of my values; in the other camp he'd be a significant danger.

I don't know whether his economic policies will succeed. I hope so.

The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood; who strives valiantly; who errs, who comes short again and again, because there is no effort without error and shortcoming; but who does actually strive to do the deeds; who knows great enthusiasms, the great devotions; who spends himself in a worthy cause; who at the best knows in the end the triumph of high achievement, and who at the worst, if he fails, at least fails while daring greatly, so that his place shall never be with those cold and timid souls who neither know victory nor defeat.
— Theodore Roosevelt

Alright, enough gushing. But I really do love having this guy as my president.

A few days ago, Obama and Biden presented a vision of the future of railroads in America. I think it's pretty damn cool. I live in the Northeast Corridor where train service is the best in the country, and I haven't taken the train anywhere since college thirty-mumble years ago. I'm not a big train enthusiast. But I think this is the kind of thing that can stimulate national enthusiasm, not in a trivial meaningless way, but toward a goal that creates jobs and opportunities for new businesses that create more jobs.

Wednesday, February 11, 2009

Dr. Drexler's blog

Dr. Eric Drexler founded the field of advanced nanotechnology with a 1981 paper in the Proceedings of the National Academy of Sciences, and his book Engines of Creation published in 1986. These two publications laid the intellectual foundation for a complete revision of human manufacturing technology. Like any major shift in technology, there are risks to be aware of, but the promise of advanced nanotechnology is vast: clean cheap manufacturing processes for just about anything you can imagine, products that seem nearly magical by today's standards, medical instruments and treatments far more advanced than today's medicine.

Dr. Drexler has continued to work in the field for over twenty years, promoting research into developmental pathways and awareness of the potential risks. His thoughts on nanotechnology (and technology in general) are unique. With the publication of his Metamodern blog, these are now publicly available. His postings cover a broad range of topics, ranging from the books he's been reading lately to common and misleading errors in molecular animations to his most recent observations and insights on developmental pathways to advanced technologies.

Wednesday, January 14, 2009

Can computers do scientific investigation?

I came across a 2001 paper in Science (PDF) recently that lines up with some thinking I'd been doing myself. The web is full of futuristic literature that envisions man's intellectual legacy being carried forward by computers at a greatly increased pace; this is one of the ideas covered under the umbrella term "technological singularity".

In machine learning there are lots of approaches and algorithms that are relevant to the scientific method. The ML folks have long been working on the problem of taking a bunch of data and searching it for organizing structure. This is an important part of how you would formulate a hypothesis when looking at the bunch of data. You would then design experiments to test the hypothesis. If you wanted to automate everything completely, you'd run the experiment in a robotic lab. Conceivably, science could be done by computers and robots without any human participation, and that's what the futurists envision.

The Science paper goes into pretty deep detail about the range and applicability of machine learning methods, as things stood in 2001. I find ML an interesting topic, but I can't claim any real knowledge about it. I'll assume that somebody somewhere can write code to do the things claimed by the paper's authors. It would be fascinating to try that myself some day.

To bring this idea closer to reality, what we need is a widely accepted machine-readable representation for hypotheses, experiments, and experimental results. Since inevitably humans would also participate in this process, we need representations for researchers (human, and possibly machine) and ratings (researcher X thinks hypothesis Y is important, or unimportant, or likely to be true, or likely to be false). So I have been puttering a little bit with some ideas for an XML specification for this sort of ontology.

Specifying experiments isn't that tricky: explain what equipment and conditions and procedure are required, and explain where to look for what outcome, and say which hypotheses are supported or invalidated depending on the outcome. Experimental results are likewise pretty simple. Results should refer to the experiments under test, identifying them in semantic web style with a unique permanently-assigned URI.

The tricky part is an ontology for scientific hypotheses. But you then need a machine-readable language flexible enough to express complex scientific ideas, and that's potentially challenging. Besides, some of these ideas are naturally expressible in ways humans can easily get, but in ways difficult for machines, for instance almost anything involving images.

Nevertheless an XML specification for describing hypotheses, experiments and results in a machine-readable way would be very interesting. I'm inclined to do some tinkering with all that, in my ridiculously abundant free time. Maybe I'll manage it.

Monday, December 29, 2008

Graphene memory device at Rice University

James Tour and colleagues at Rice University have demonstrated a switch (described in Nature Materials) composed of a layer of graphite about ten atoms thick. An array of such switches can be built in three dimensions, offering very high densities of storage volume, far exceeding what we now see in hard disks and flash memory USB widgets. The switch has been tested over 20,000 switching cycles with no apparent degradation. The abstract of the Nature Materials article reads:

Transistors are the basis for electronic switching and memory devices as they exhibit extreme reliabilities with on/off ratios of 10⁴–10⁵, and billions of these three-terminal devices can be fabricated on single planar substrates. On the other hand, two-terminal devices coupled with a nonlinear current–voltage response can be considered as alternatives provided they have large and reliable on/off ratios and that they can be fabricated on a large scale using conventional or easily accessible methods. Here, we report that two-terminal devices consisting of discontinuous 5–10 nm thin films of graphitic sheets grown by chemical vapour deposition on either nanowires or atop planar silicon oxide exhibit enormous and sharp room-temperature bistable current–voltage behaviour possessing stable, rewritable, non-volatile and non-destructive read memories with on/off ratios of up to 10⁷ and switching times of up to 1 μs (tested limit). A nanoelectromechanical mechanism is proposed for the unusually pronounced switching behaviour in the devices.

It will be several years before memories based on these switches are available for laptops and desktops, but it's a cool thing. To my knowledge, the mechanism is not yet known, so there may be some interesting new science involved as well.

Picasa images in blog posts

Here is an example of embedding a Picasa image in a Blogspot posting. I am including this to assist another blogger who is doing some interesting RepRap-related work but who's had a bit of trouble with Blogspot. To do this, I got the URL for the Picasa image by right-clicking on "View image" in Firefox so that I only had the image in the browser, and I used that in the "Add Image" thing in the "New post" top bar menu.

The HTML for this post (in part) looks like this:
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://lh6.ggpht.com/_Wq8BeMor5IQ/SVhFkzHai5I/AAAAAAAAACs/bnOuWjpyhIw/s640/SilicateGlueIngredients.jpg"><img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer; width: 267px; height: 200px;" src="http://lh6.ggpht.com/_Wq8BeMor5IQ/SVhFkzHai5I/AAAAAAAAACs/bnOuWjpyhIw/s640/SilicateGlueIngredients.jpg" alt="" border="0" /></a>Here is an example of embedding a Picasa image...

I hope this helps. It shouldn't be necessary to study the HTML. If the Blogspot controls are cooperative, it should be generated automatically. If they don't cooperate, the HTML could be cut and pasted.

Tuesday, December 23, 2008

Encouraging news about mechanosynthesis

Yesterday there was a very encouraging posting (by guest blogger Tihamer Toth-Fejel) on the Responsible Nanotechnology blog, regarding recent goings-on with mechanosynthesis. What the heck is mechanosynthesis? It is the idea that we will build molecules by putting atoms specifically where we want, rather than leaving them adrift in a sea of Brownian motion and random diffusion. Maybe not atoms per se, maybe instead small molecules or bits of molecules (a CH₃ group here, an OH group there) with the result that we will build the molecules we really want, with little or no waste. The precise details about how we will do this are up for a certain amount of debate. We used to talk about assemblers, now we talk about nanofactories, but the idea of intentional design and manufacture of specific molecules remains.

The two items of real interest in the CRN blog posting are these.

First, Philip Moriarty, a scientist in the UK, has secured a healthy chunk of funding to do experimental work to validate the theoretical work done by Ralph Merkle and Rob Freitas in designing tooltips and processes for carbon-hydrogen mechanosynthesis, with the goal of being able to fabricate bits of diamondoid that have been specified at an atomic level. If all goes well, writes Toth-Fejel:

Four years from now, the Zyvex-led DARPA Tip-Based Nanofabrication project expects to be able to put down about ten million atoms per hour in atomically perfect nanostructures, though only in silicon (additional elements will undoubtedly follow; probably taking six months each).

Second is that people are now starting to use small machines to build other small machines, and to do so at interesting throughputs. An article at Small Times reports:

Dip-pen nanolithography (DPN) uses atomic force microscope (AFM) tips as pens and dips them into inks containing anything from DNA to semiconductors. The new array from Chad Mirkin’s group at Northwestern University in Evanston, Ill., has 55,000 pens - far more than the previous largest array, which had 250 pens.

So there are two take-home messages here. First, researchers are getting ready to work with the large numbers of atoms needed to build anything of reasonable size in a reasonable amount of time. Second, this stuff is actually happening rather than remaining a point of academic discussion.

Toth-Fejel writes:

What happens when we use probe-based nanofabrication to build more probes? ...What happens when productive nanosystems get built, and are used to build better productive nanosystems? The exponential increase in atomically precise manufacturing capability will make Moore’s law look like it’s standing still.

Interesting stuff.

Monday, December 08, 2008

Cui's work on cancer with granulocytes

Granulocytes are a particular sort of white blood cell. Read the original New Scientist article or a related article in Science Daily. See a video of granulocytes attacking cancer cells. The video above is a talk by the primary investigator, Zheng Cui. I learned about this in stumbling across the fact that Chris Heward is seeking granulocyte donors.

Zheng Cui at Wake Forest University of Medicine in Winston-Salem, North Carolina, took blood samples from more than 100 people and mixed their granulocytes with cervical cancer cells. While granulocytes from one individual killed around 97 per cent of cancer cells within 24 hours, those from another healthy individual only killed around 2 per cent of cancer cells. Average cancer-killing ability appeared to be lower in adults over the age of 50 and even lower in people with cancer. It also fell when people were stressed, and at certain times of the year. "Nobody seems to have any cancer-killing ability during the winter months from November to April," says Cui.

Elsewhere, Cui wrote: "In 1999, we encountered a unique mouse that refused to succumb to repeated challenges with lethal cancer cells that uniformly killed all other laboratory mice, even at much lower doses. Further studies of this phenotype reveal that this unusual cancer resistance is inheritable and entirely mediated by the macrophages and neutrophils of the innate immunity. Transfer of leukocytes with this high level of cancer-killing activity (CKA) from these cancer-resistant mice cures the most aggressive advanced cancers in other mice without any side effect. Most surpisingly, a similar activity of killing cancer cells was discovered in the granulocytes and monocytes of some healthy people." When applied clinically, this is called LIFT, or "leukocyte infusion therapy".

Cui readily admits that he has not yet done much to explore the precise mechanisms involved. For the present, he is more interested in getting the treatment through clinical trials and into clinical practice. So he has gotten very little support from the medical community, and it's been difficult to secure funding for clinical trials.

Friday, December 05, 2008

Adventures in protein engineering

Proteins are a good material to consider for an early form of rationally designed nanotechnology. They are cheap and easy to manufacture, thoroughly studied, and they can do a lot of different things. Proteins are responsible for the construction of all the structures in your body, the trees outside your window, and most of your breakfast.

Why don't we already have a busy protein-based manufacturing base? Because the necessary technologies have arisen only in the last couple of decades, and because older technologies already have a solid hold on the various markets that might otherwise be interested in protein-based manufacturing. Finally, most researchers working with proteins aren't thinking about creating a new manufacturing base. But people in the nanotech community are thinking about it.

One of the classical scientific problems involving proteins is the "protein folding problem". Every protein is a sequence of amino acids. There are 20 different amino acids, which are strung together by a ribosome to create the protein. As the amino acids are strung together, the protein starts folding up into a compact structure. The "problem" with folding is that for any possible sequence of amino acids, it's not always possible to predict how it will fold up, or even whether it will always fold up the same way each time.

But maybe you don't need a solution for all possible sequences. Maybe you can limit yourself to just the sequences that are easy to predict. People have been studying proteins for a long time and it's easy to put together a much shorter list of proteins whose foldings are known. Discard any proteins that sometimes fold differently, to arrive at a subset of proteins whose foldings are well known and reliable.

The next issue is extensibility. Having identified a set of proteins whose foldings are easily predictable, would it be possible to use that knowledge to predict the foldings of larger novel amino acid sequences? A trivial analogy would be that if I know how to pronounce "ham" and I know how to pronounce "burger", then I should should know how to pronounce "hamburger". A better analogy would be Lego bricks or an Erector set, where a small alphabet of basic units can be used to construct a vast diversity of larger structures.

If we can build a large diversity of big proteins and predict their foldings correctly, we're on to something. Then we can design things with parts that move in predictable ways. Some proteins (like the keratin in your fingernails or a horse's hooves) have a good deal of rigidity, and we can think about designing with gears, cams, transmissions, and other such stuff.

Thursday, November 20, 2008

Gustav Mahler's Symphony No. 4 in G major

http://en.wikipedia.org/wiki/Symphony_No._4_(Mahler)
performed by the Vienna Philharmonic Orchestra (Wiener Philharmoniker), conducted by Leonard Bernstein

1st movement (1, 2)
2nd movement (1)
3rd movement (1, 2, 3)
4th movement (1)

Friday, November 07, 2008

Cell-level simulation and hobbyist participation

These are a few simulators for biological cells and processes.

A lot of the important things that happen in medicine are happening at the cellular level. Cell-level simulators might provide a way for large numbers of hobbyist medical researchers to construct and test hypotheses. The most promising hypotheses might be testable in real biology laboratories, and the results could be fed back to improve the accuracy of the simulators.

I'm not sure this would be an effective strategy for hastening the pace of medical progress. My intuition is biased because I've spent the last fifteen years working with open-source software (Linux, Apache, etc). I recognize that competition and profit are also powerful forces driving the rate of innovation, and that these seem to work best when people aren't sharing information so readily. The software/internet world has seen lots of progress in the last ten or twenty years, and it seems that a mixed environment with both open-source and closed-source approaches has pushed things along well.

Software is difficult but biology is much more difficult. At least it looks that way from my software engineer's point of view. The depth of expertise required for meaningful contribution to medical knowledge will likely exclude most would-be contributors. I don't know what to do about that. Perhaps cell simulators and on-line information can make that expertise more accessible.

Participation by hobbyists has become a very big part of the astronomy community. Maybe there is a legitimate place for hobbyists in the field of medical research.

Thursday, November 06, 2008

First post

Here is today's nifty piece of medical progress, an advance in the fight against cancer. A couple years ago, an unfortunate woman died of acute myelogenous leukemia, leaving behind samples of her cells, some healthy and some cancerous. A team at Washington University in St. Louis was able to sequence the DNA from the cells and compare her healthy DNA to her cancerous DNA. This became possible because the price of DNA sequencing equipment has come down by a very large factor in recent years.

They identified ten point mutations that differentiated the sick cells from the healthy ones. Two of the mutations were already known from earlier research, the other eight mutations were previously unknown. The team is continuing to study differences in the non-coding DNA as well, and they are also preparing to apply the same sequencing methodology to other cancers.

Because they were using DNA samples all from the same person, there would be very few differences among the healthy cells, just the infrequent cell-to-cell mutations that might occur in an average healthy person's body. So they had a good solid statistical baseline that made the ten cancer-related mutations really clear.

It may be years before this translates into clinical practice that saves lives. But it's nevertheless an important advance. It's something that has never been done before, and it does bring to light a few new facts. It looks strongly like point mutations are the cause of at least some, possibly all, cancers. We strongly suspected that before but this almost proves it. One of the ten mutations was present in only a fraction of the cancerous cells, suggesting that the mutations typically occur in a particular sequence, with the last one finally making the cell dangerous.

I'm interested in what social and economic factors could most hasten the rate of medical progress. My reason for this interest is simple: I'm not young any more. I'm curious about whether the development model that has been so successful for open-source software could somehow be applied to quicken the pace of medical progress.

Wednesday, September 03, 2008

Beethoven on Youtube

I've long been a fan of Beethoven's Violin Concerto (1, 2, 3). Here it is played by Yehudi Menuhin.

I also like some of the symphonies. Here's Herbert von Karajan conducting the Seventh Symphony, which I think doesn't get enough attention by comparison to the Third (1, 2, 3? 4?) or the Fifth or the Ninth (1, 2, 3a, 3b, 4). Why is it the only even-numbered Beethoven symphony you ever hear is the Sixth? In fact, none of Beethoven's power-of-two symphonies (numbers 1, 2, 4 and 8) get much airplay. Wierd.

Monday, August 25, 2008

Bach on YouTube

Brandenburg Concerto number 3, probably my favorite: first movement, second movement, third movement. For the third Brandenburg, Bach didn't really write a second movement. He just wrote a couple of chords and allowed the musicians to improvise whatever they wanted within that minimal harmonic constraint. Different groups do different things with that freedom. The first time I heard this concerto was Walter Carlos's rendition on the Switched-on Bach album back in the seventies, which included a lot of interesting sounds that people now associate with old bad sci-fi movies. But at the time, Carlos was one of the first explorers of electronic music and there wasn't yet an esthetic for it. In a later recording Carlos did something a bit more conventional, a minimal expansion on Bach's two chords with just a few flourishes.

One thing I never quite got about the first Brandenburg (first, second, third, fourth movements) is some funny work in the horns in the first movement. There are points where they just seem off-tempo with everybody else. When I first heard this I assumed the musicians had gotten lost. But now I'm hearing it in this second recording, so I have to conclude that Bach wrote it that way. Maybe he was trying to make sure the listener was awake? Perplexing.

I once read a review of the sixth Brandenburg (first, second, third movements) suggesting that it was a musical description of goings-on in the Bach household. Bach had lots of kids, all presumably running and bouncing about as kids will do, and this is a very busy concerto with a lot happening. So that might be what Bach had in mind, and it especially sounds that way in the third movement which has a real bounce to it. In this recording the cellos (maybe basses? I'm never sure) at the right end seem to have many more than four strings.

Tuesday, August 12, 2008

Multimachine

Multimachine, built by Pat Delany of Palestine, Texas, is an inspiring project. It is...

a humanitarian, open source machine tool project for developing countries... The MultiMachine all-purpose machine tool can be built by a semi-skilled mechanic with just common hand tools. For machine construction, electricity can be replaced with "elbow grease" and the necessary material can come from discarded vehicle parts. What can the MultiMachine be used for in developing countries?
AGRICULTURE...
WATER SUPPLIES...
FOOD SUPPLIES: Building steel-rolling-and-bending machines for making fuel efficient cook stoves and other cooking equipment...
TRANSPORTATION...
EDUCATION...
JOB CREATION...

The project is open source and thoroughly documented. It uses commonly available pieces. It seeks explicitly to address the needs of the developing world. It recognizes the work people did in this area (1, 2) in years past. Cool stuff. We have all kinds of Industrial Revolution era mill buildings in the greater Boston area and this would fit right in.

Frostbot, the work of Brian Schmalz, is another food fabber. It's designed to frost cookies. The CNC mechanism is from Fireball CNC. Brian's other tinkerings include a cool USB bit-whacking board available at Sparkfun.