Clydesdale: Structured Data Processing on MapReduce by Tim Kaldewey, Eugene J. Shekita, Sandeep Tata.
MapReduce has emerged as a promising architecture for large scale data analytics on commodity clusters. The rapid adoption of Hive, a SQL-like data processing language on Hadoop (an open source implementation of MapReduce), shows the increasing importance of processing structured data on MapReduce platforms. MapReduce offers several attractive properties such as the use of low-cost hardware, fault-tolerance, scalability, and elasticity. However, these advantages have required a substantial performance sacriﬁce.
In this paper we introduce Clydesdale, a novel system for structured data processing on Hadoop – a popular implementation of MapReduce. We show that Clydesdale provides more than an order of magnitude in performance improvements compared to existing approaches without requiring any changes to the underlying platform. Clydesdale is aimed at workloads where the data ﬁts a star schema. It draws on column oriented storage, tailored join-plans, and multicore execution strategies and carefully ﬁts them into the constraints of a typical MapReduce platform. Using the star schema benchmark, we show that Clydesdale is on average 38x faster than Hive. This demonstrates that MapReduce in general, and Hadoop in particular, is a far more compelling platform for structured data processing than previous results suggest. (emphasis in original)
The authors make clear that Clydesdale is a research prototype and lacks many features needed for full production use.
But an order of magnitude and sometimes two orders of magnitude improvement should pique your interest in helping with such improvements.
I find the “re-use” of existing Hadoop infrastructure particularly exciting.
Order of magnitude or more gains with current approaches is a signal someone is thinking about issues and not simply throwing horsepower at a problem.
I first saw this in NoSQL Weekly, Issue 116.