Apache Tez: A New Chapter in Hadoop Data Processing by Bikas Saha.
From the post:
In this post we introduce the motivation behind Apache Tez (http://incubator.apache.org/projects/tez.html) and provide some background around the basic design principles for the project. As Carter discussed in our previous post on Stinger progress, Apache Tez is a crucial component of phase 2 of that project.
What is Apache Tez?
Apache Tez generalizes the MapReduce paradigm to execute a complex DAG (directed acyclic graph) of tasks. It also represents the next logical next step for Hadoop 2 and the introduction of with YARN and its more general-purpose resource management framework.
While MapReduce has served masterfully as the data processing backbone for Hadoop, its batch-oriented nature makes it unsuited for certain workloads like interactive query. Tez represents an alternate to the traditional MapReduce that allows for jobs to meet demands for fast response times and extreme throughput at petabyte scale. A great example of a benefactor of this new approach is Apache Hive and the work being done in the Stinger Initiative.
Motivation
Distributed data processing is the core application that Apache Hadoop is built around. Storing and analyzing large volumes and variety of data efficiently has been the cornerstone use case that has driven large scale adoption of Hadoop, and has resulted in creating enormous value for the Hadoop adopters. Over the years, while building and running data processing applications based on MapReduce, we have understood a lot about the strengths and weaknesses of this framework and how we would like to evolve the Hadoop data processing framework to meet the evolving needs of Hadoop users. As the Hadoop compute platform moves into its next phase with YARN, it has decoupled itself from MapReduce being the only application, and opened the opportunity to create a new data processing framework to meet the new challenges. Apache Tez aspires to live up to these lofty goals.
Does your topic map engine decoupled from a single merging algorithm?
I ask because SLAs may require different algorithms for data sets or sources.
Leaked U.S. military documents may have a higher priority for completeness than half-human/half-bot posts on a Twitter stream.