Hadoop MapReduce: to Sort or Not to Sort by Tendu Yogurtcu.
From the post:
What is the big deal about Sort? Sort is fundamental to the MapReduce framework, the data is sorted between the Map and Reduce phases (see below). Syncsort’s contribution allows native Hadoop sort to be replaced by an alternative sort implementation, for both Map and Reduce sides, i.e. it makes Sort phase pluggable.
Opening up the Sort phase to alternative implementations will facilitate new use cases and data flows in the MapReduce framework. Let’s look at some of these use cases:
The use cases include:
- Optimized sort implementations.
- Hash-based aggregations.
- Ability to run a job with a subset of data.
- Optimized full joins.
See Tendu’s post for the details.
I first saw this at Use Cases for Hadoop’s New Pluggable Sort by Alex Popescu.