 Hadoop is Apache's implementation of the MapReduce distributed computing scheme innovated by Google. Amazon rents out Hadoop services on their cluster. It's fairly straightforward to set up Hadoop on a cluster of Linux boxes. Having myself a long-standing interest in distributed computing approaches to molecular modeling, I have been trying to figure out how Hadoop could be applied to do very large-scale molecular simulations.
Hadoop is Apache's implementation of the MapReduce distributed computing scheme innovated by Google. Amazon rents out Hadoop services on their cluster. It's fairly straightforward to set up Hadoop on a cluster of Linux boxes. Having myself a long-standing interest in distributed computing approaches to molecular modeling, I have been trying to figure out how Hadoop could be applied to do very large-scale molecular simulations.MapReduce is great for problems where large chunks of computation can be done in isolation. The difficulty with molecular modeling is that every atom is pushing or pulling on every other atom on every single time step. The problem doesn't nicely partition into large isolated chunks. One could run a MapReduce cycle on each time step, but that would be horribly inefficient - on each time step, every map job needs as input the position and velocity of every atom in the entire simulation.
There are existing solutions like NAMD, which uses DPMTA for the long-range forces between atoms. For a cluster of limited size these are the appropriate tools. For large clusters with hundreds or thousands of machines, the rate of hardware failures becomes a consideration that can't be ignored.
MapReduce provides a few principles for working in the very-large-cluster domain:
- Let your infrastructure handle hardware failures, just like the way the Internet invisibly routes around dead servers.
- Individual machines are anonymous. You never write application code that directly addressses an individual machine.
- Don't waste too much money trying to make the hardware more reliable. It won't pay off in the end.
- Use a distributed file system that reliably retains the inputs to a task until that task has been successfully completed.
Could the tasks that NAMD assigns to each machine be anonymized with respect to which machine they run on, and the communications routed through a distributed filesystem like Hadoop's HDFS? Certainly it's possible in principle. Whether I'll be able to make any reasonable progress on it in my abundant spare time is another matter.
 
No comments:
Post a Comment