Will Ware's blog: patient data

Friday, October 25, 2013

Bar Camp Boston 2013 talk on automation of science

This is an outline for a talk I gave at Bar Camp Boston 8 on the automation of science. It's a topic I've blogged and spoken about before. The shortened URL for this post is http://goo.gl/rv3Xik.

In 2004, a robot named Adam became the first machine in history to discover new scientific knowledge independently of its human creators. Without human guidance, Adam can create hypotheses to explain observations, design experiments to test those hypotheses, run the experiments using laboratory robotics, interpret the experimental results, and repeat the cycle to generate new knowledge. The principal investigator on the Adam project was Ross King, now at Manchester University, who published a paper on the automation of science (PDF) in 2009. Some of his other publications: 1, 2, 3.

Adam works in a very limited domain, in nearly complete isolation. There is plenty of laboratory automation but (apart from Adam) we don't yet have meaningful computer participation in the theoretical aspect of scientific work. A worldwide scientific collaboration of human and computer theoreticians working with human and computer experimentalists could advance science and medicine and solve human problems faster.

The first step is to formulate a linked language of science that machines can understand. Publish papers in formats like RDF/Turtle or JSON or JSON-LD or YAML. Link scientific literature to existing semantic networks (DBpedia, Freebase, Google Knowledge Graph, LinkedData.org, Schema.org etc). Create schemas for scientific domains and for the scientific method (hypotheses, predictions, experiments, data). Provide tutorials, tools and incentives to encourage researchers to publish machine-tractable papers. Create a distributed graph or database of these papers, in the role of scientific journals, accessible to people and machines everywhere. Maybe use Stackoverflow as a model for peer review.

Begin with very limited scientific domains (high school physics, high school chemistry) to avoid the full complexity and political wrangling of the professional scientific community in the initial stages. As this stuff approaches readiness for professional work, deploy it first in the domain of computer science and other scientific domains where it can hope to avoid overwhelming resistance.

Machine learning algorithms (clustering, classification, regression) can find patterns in data and help to identify useful abstractions. Supervised learning algorithms can provide tools of collaboration between people and computers.

The computational chemistry folks have a cool little program called Babel which translates between a large number of different file formats for representing molecular structures. It does this with a rich internal representation of structures, and pluggable read and write modules for each file format. At some point, something like this for different file formats of scientific literature might become useful, and might help to build consensus among different approaches.

A treasure trove would be available in linked patient data. In the United States this is problematic because of the privacy restrictions associated with HIPAA regulation. In countries like Iceland and Norway which have universal health care, there would be no equivalent of HIPAA, and those would be good places to initiate a Linked Patient Data project.

Tuesday, November 13, 2012

A Semantic Network of Patient Data

This idea has two inspirations. One is this TED talk by Dave deBronkart or "e-Patient Dave". The other is the work that has been done on the semantic web and linked data.

Dave's talk is about patients taking control of their medical records and sharing them with other like-minded patients, so that they can learn from one another's histories and experiences. Some of these patients, including Dave, had terminal diagnoses and were able to improve or resolve those conditions because of having shared data with others.

The semantic web is the idea of formatting information so that computers can do more with it than simply store it or transmit it or display it on a screen. Computers can understand the meaning of the information much as a human would, so they can reason about it and draw new conclusions that aren't already spelled out. I first learned about it in a 2001 article in Scientific American. There are some more details here. I've blogged in the past about some of the basic ideas.

In the semantic web, all "things" (nouns, basically) are assigned URIs (web addresses). Relationships between things (and relationships are also things) are represented as RDF, where every statement is a triple of URIs, being a subject, predicate, and object. These statements are often printed or transmitted in XML, but the N3 language is more readable for people. Typical relationships look something like this.
Will, town, "Framingham MA".
Will, name, "William Ware".
Will, pet, cat#12345.
cat#12345, name, "Kokopelli".
cat#12345, birthyear, 2003.
Strings ("William Ware", "Kokopelli") and numbers (2003) can be raw data, everything else is a URI. The idea is that a URI connects you to the rest of the semantic web of meaning, so if you don't know what a "pet" is, you can follow that URI, or query other triples with "pet" in them, to find out more.

You might wonder if it's silly to have such a primitive representation for knowledge. It allows the same kinds of economies of scale that we get by representing information in a computer with ones and zeroes. Because the format is so simple and uniform, we can build processing architectures that can be very efficient, and people have been doing that for over ten years. We have scalable databases for RDF, and when we set up rules that mimic set theory, we can build reasoning engines that extract new conclusions from the data.

When data is formatted with an appropriate ontology, it can be searched in rich complex ways, and computers can look for patterns and correlations that a human might not notice. When applied to patients' medical data, the results might be new medical knowledge or new treatment options.

There are other ways to find new information hidden in patient data. Semantic web technology is great for pure logic, but for quantitative measures (a dosage increase in this medication seems to cause a decreased amount of that neurotransmitter) we can turn to machine learning, where progress in the last decade or two has been explosive, given the data available on the web and the economic rewards for finding patterns in it.

An idea I've blogged about in the past (and spoken about at a couple of very small conferences) is applying this to general scientific literature, with the goal of hastening scientific progress and in particular medical progress (since I'm an old fart now and interested in that sort of thing).

If this topic interests you and you wish to discuss it, I'm starting a Google Groups forum for that purpose.

UPDATE: I've discovered that there is a company in Cambridge, MA called PatientsLikeMe which already pools patient data into a database, and sells subscriptions to that database. I don't know if they place the same emphasis on machine-tractable formats that I've done above. But knowing that somebody is doing it on a commercial basis, I don't see much point in trying to replicate that effort in my evenings and weekends.