HadoopDB: Efficient Processing of Data Warehousing Queries in a Split Execution Environment
From the post:
The buzz about Hadapt and HadoopDB has been around for a while now as it is one of the first systems to combine ideas from two different approaches, namely parallel databases based on a shared-nothing architecture and map-reduce, to address the problem of large scale data storage and analysis.
This early paper that introduced HadooDB crisply summarizes some reasons why parallel database solutions haven’t scaled to hundreds machines. The reasons include –
- As the number of nodes in a system increases failures become more common.
- Parallel databases usually assume a homogeneous array of machines which becomes impractical as the number of machines rise.
- They have not been tested at larger scales as applications haven’t demanded more than 10′s of nodes for performance until recently.
Interesting material to follow on the HPCC vs. Hadoop post.
Not to take sides, just the beginning of the type of analysis that will be required.