The DNA Data Deluge by Michael C. Schatz & Ben Langmead.
From the post:
We’re still a long way from having anything as powerful as a Web search engine for sequencing data, but our research groups are trying to exploit what we already know about cloud computing and text indexing to make vast sequencing data archives more usable. Right now, agencies like the National Institutes of Health maintain public archives containing petabytes of genetic data. But without easy search methods, such databases are significantly underused, and all that valuable data is essentially dead. We need to develop tools that make each archive a useful living entity the way that Google makes the Web a useful living entity. If we can make these archives more searchable, we will empower researchers to pose scientific questions over much larger collections of data, enabling greater insights.
A very accessible article that makes a strong case for the “DNA Data Deluge.” Literally.
The deluge of concern to the authors is raw genetic data.
They don’t address how we will connect genetic data to the semantic quagmire of clinical data and research publications.
Genetic knowledge disconnected from clinical experience will be interesting but not terribly useful.
If you want more complex data requirements, include other intersections with our genetic makeup, such as pollution, additives, lifestyle, etc.