Archive for the ‘MapReduce 2.0’ Category

Migrating to MapReduce 2 on YARN (For Users)

Saturday, November 9th, 2013

Migrating to MapReduce 2 on YARN (For Users) by Sandy Ryza.

From the post:

In Apache Hadoop 2, YARN and MapReduce 2 (MR2) are long-needed upgrades for scheduling, resource management, and execution in Hadoop. At their core, the improvements separate cluster resource management capabilities from MapReduce-specific logic. They enable Hadoop to share resources dynamically between MapReduce and other parallel processing frameworks, such as Cloudera Impala; allow more sensible and finer-grained resource configuration for better cluster utilization; and permit Hadoop to scale to accommodate more and larger jobs.

In this post, users of CDH (Cloudera’s distribution of Hadoop and related projects) who program MapReduce jobs will get a guide to the architectural and user-facing differences between MapReduce 1 (MR1) and MR2. (MR2 is the default processing framework in CDH 5, although MR1 will continue to be supported.) Operators/administrators can read a similar post designed for them here.

From further within the post:

MR2 supports both the old (“mapred”) and new (“mapreduce”) MapReduce APIs used for MR1, with a few caveats. The difference between the old and new APIs, which concerns user-facing changes, should not be confused with the difference between MR1 and MR2, which concerns changes to the underlying framework. CDH 4 and CDH 5 support the new and old MapReduce APIs as well as both MR1 and MR2. (Now, go back and read this paragraph again, because the naming is often a source of confusion.) (Emphasis added.)

And under Job Configuration:

As in MR1, job configuration options can be specified on the command line, in Java code, or in the mapred-site.xml on the client machine in the same way they previously were. Most job configuration options, with rare exceptions, that were available in MR1 work in MR2 as well. For consistency and clarity, many options have been given new names. The older names are deprecated, but will still work for the time being. The exceptions are mapred.child.ulimit and all options relating to JVM reuse, which are no longer supported. (Emphasis added.)

That’s all very reassuring.

Are your MapReduce engineers using the old names (deprecated) or the new names or some combination of both?

As software evolves, changing of names cannot be avoided and no doubt Cloudera has tried to avoid gratuitous name changes.

But at the bottom line, isn’t it your responsibility to track internal use of names? For consistently and maintenance?

Are You Confused? (About MR2 and YARN?) Help is on the way!

Monday, October 8th, 2012

MR2 and YARN Briefly Explained by Justin Kestelyn.

Justin writes:

With CDH4 onward, the Apache Hadoop component introduced two new terms for Hadoop users to wonder about: MR2 and YARN. Unfortunately, these terms are mixed up so much that many people are confused about them. Do they mean the same thing, or not?

Not but see Justin’s post for the details. (He also points to a longer post with more details.)

Experimenting with MapReduce 2.0

Monday, July 16th, 2012

Experimenting with MapReduce 2.0 by Ahmed Radwan.

In Building and Deploying MR2, we presented a brief introduction to MapReduce in Hadoop 0.23 and focused on the steps to setup a single-node cluster. In MapReduce 2.0 in Hadoop 0.23, we discussed the new architectural aspects of the MapReduce 2.0 design. This blog post highlights the main issues to consider when migrating from MapReduce 1.0 to MapReduce 2.0. Note that both MapReduce 1.0 and MapReduce 2.0 are included in CDH4.

It is important to note that, at the time of writing this blog post, MapReduce 2.0 is still Alpha, and it is not recommended to use it in production.

In the rest of this post, we shall first discuss the Client API, followed by configurations and testing considerations, and finally commenting on the new changes related to the Job History Server and Web Servlets. We will use the terms MR1 and MR2 to refer to MapReduce in Hadoop 1.0 and Hadoop 2.0, respectively.

How long MapReduce 2.0 remains in alpha is anyone’s guess. Suggest we start to learn about it before that status passes.