Tenzing: A SQL Implementation On The MapReduce Framework by Biswapesh Chattopadhyay, Liang Lin, Weiran Liu, Sagar Mittal, Prathyusha Aragonda, Vera Lychagina, Younghee Kwon and Michael Wong.
Abstract:
Tenzing is a query engine built on top of MapReduce for ad hoc analysis of Google data. Tenzing supports a mostly complete SQL implementation (with several extensions) combined with several key characteristics such as heterogeneity, high performance, scalability, reliability, metadata awareness, low latency, support for columnar storage and structured data, and easy extensibility. Tenzing is currently used internally at Google by 1000+ employees and serves 10000+ queries per day over 1.5 petabytes of compressed data. In this paper, we describe the architecture and implementation of Tenzing, and present benchmarks of typical analytical queries.
Of the conclusions of the authors:
It is possible to create a fully functional SQL engine on top of the MapReduce framework, with extensions that go beyond SQL into deep analytics. With relatively minor enhancements to the MapReduce framework, it is possible to implement a large number of optimizations currently available in commercial database systems, and create a system which can compete with commercial MPP DBMS in terms of throughput and latency. The MapReduce framework provides a combination of high performance, high reliability and high scalability on cheap unreliable hardware, which makes it an excellent platform to build distributed applications that involve doing simple to medium complexity operations on large data volumes. By designing the engine and the optimizer to be aware of the characteristics of heterogeneous data sources, it is possible to create a smart system which can fully utilize the characteristics of the underlying data sources.
the last one is of the most interest to me. Which one interests you the most?
BTW, the authors mention:
We are working on various other enhancements and believe we can cut this time down to less than 5 seconds end-to-end, which is fairly acceptable to the analyst community.
I think the analyst community needs to use 2400 baud modems for a month or two. 😉
Sub-5 second performance is sometimes useful, even necessary. But as a general requirement?