Uncovering mysteries of InputFormat: Providing better control for your Map Reduce execution. by Boris Lublinsky and Mike Segel.
From the post:
As more companies adopt Hadoop, there is a greater variety in the types of problems for which Hadoop’s framework is being utilized. As the various scenarios where Hadoop is applied grow, it becomes critical to control how and where map tasks are executed. One key to such control is custom InputFormat implementation.
The InputFormat class is one of the fundamental classes in the Hadoop Map Reduce framework. This class is responsible for defining two main things:
- Data splits
- Record reader
Data split is a fundamental concept in Hadoop Map Reduce framework which defines both the size of individual Map tasks and its potential execution server. The Record Reader is responsible for actual reading records from the input file and submitting them (as key/value pairs) to the mapper. There are quite a few publications on how to implement a custom Record Reader (see, for example, [1]), but the information on splits is very sketchy. Here we will explain what a split is and how to implement custom splits for specific purposes.
See the post for the details.
Something for you to explore over the weekend!