The Future of Hadoop in Bioinformatics: Hadoop and its ecosystem including MapReduce are the dominant open source Big Data solution by Bob Gourley.
From the post:
Earlier, I wrote on the use of Hadoop in the exciting, evolving field of Bioinformatics. I have since had the pleasure of speaking with Dr. Ron Taylor of Pacific Northwest National Library, the author of “An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics“, on what’s changed in the half-year since its publication and what’s to come.
As Dr. Taylor expected, Hadoop and it’s “ecosystem” including MapReduce are the dominant open source Big Data solution for next generation DNA sequencing analysis. This is currently the sub-field generating the most data and requiring the most computationally expensive analysis. For example, de novo assembly pieces together tens of millions of short reads (which may be 50 bases long on ABI SOLiD sequencers). To do so, every read needs to be compared to the others, which scales in proportion to n(logn), meaning, even assuming reads that are 100 base pairs in length and a human genome of 3 billion pairs, analyzing an entire human genome will take 7.5 times longer than if it scaled linearly. By dividing the task up into a Hadoop cluster, the analysis will be faster and, unlike other high performance computing alternatives, it can run on regular commodity servers that are much cheaper than custom supercomputers. This, combined with the savings from using open source software, ease of use due to seamless scaling, and the strength of the Hadoop community make Hadoop and related software the parallelization solution of choice in next generation sequencing.In other areas, however, traditional HPC is still more common and Hadoop has not yet caught on. Dr. Taylor believes that in the next year to 18 months, this will change due to the following trends:
So, over the next year to eighteen months, what do you see as the evolution of topic map software and services?
Or what problems do you see becoming apparent in bioinformatics or other areas (like the Department of Energy’s knowledgebase) that will require topic maps?
(More on the DOE project later this week.)