Archive for the ‘Hadoop YARN’ Category

Philosophy behind YARN Resource Management

Saturday, February 23rd, 2013

Philosophy behind YARN Resource Management by Bikas Saha.

From the post:

YARN is part of the next generation Hadoop cluster compute environment. It creates a generic and flexible resource management framework to administer the compute resources in a Hadoop cluster. The YARN application framework allows multiple applications to negotiate resources for themselves and perform their application specific computations on a shared cluster. Thus, resource allocation lies at the heart of YARN.

YARN ultimately opens up Hadoop to additional compute frameworks, like Tez, so that an application can optimize compute for their specific requirements.

The YARN Resource Manager service is the central controlling authority for resource management and makes allocation decisions. It exposes a Scheduler API that is specifically designed to negotiate resources and not schedule tasks. Applications can request resources at different layers of the cluster topology such as nodes, racks etc. The scheduler determines how much and where to allocate based on resource availability and the configured sharing policy.

If YARN does become the cluster operating system, knowing the “why” of its behavior will be as important as knowing the “how.”

Introducing… Tez: Accelerating processing of data stored in HDFS

Wednesday, February 20th, 2013

Introducing… Tez: Accelerating processing of data stored in HDFS by Arun Murthy.

From the post:

MapReduce has served us well. For years it has been THE processing engine for Hadoop and has been the backbone upon which a huge amount of value has been created. While it is here to stay, new paradigms are also needed in order to enable Hadoop to serve an even greater number of usage patterns. A key and emerging example is the need for interactive query, which today is challenged by the batch-oriented nature of MapReduce. A key step to enabling this new world was Apache YARN and today the community proposes the next step… Tez

What is Tez?

Tez – Hindi for “speed” – (currently under incubation vote within Apache) provides a general-purpose, highly customizable framework that creates simplifies data-processing tasks across both small scale (low-latency) and large-scale (high throughput) workloads in Hadoop. It generalizes the MapReduce paradigm to a more powerful framework by providing the ability to execute a complex DAG (directed acyclic graph) of tasks for a single job so that projects in the Apache Hadoop ecosystem such as Apache Hive, Apache Pig and Cascading can meet requirements for human-interactive response times and extreme throughput at petabyte scale (clearly MapReduce has been a key driver in achieving this).

With the emergence of Apache Hadoop YARN as the basis of next generation data-processing architectures, there is a strong need for an application which can execute a complex DAG of tasks which can then be shared by Apache Pig, Apache Hive, Cascading and others. The constrained DAG expressible in MapReduce (one set of maps followed by one set of reduces) often results in multiple MapReduce jobs which harm latency for short queries (overhead of launching multiple jobs) and throughput for large-scale queries (too much overhead for materializing intermediate job outputs to the filesystem). With Tez, we introduce a more expressive DAG of tasks, within a single application or job, that is better aligned with the required processing task – thus, for e.g., any given SQL query can be expressed as a single job using Tez.

If you are familiar with Michael Sperberg-McQueen and Claus Huitfeldt’s work on DAGs, you would be as excited as I am! (Goddag for example.)

On any day this would be awesome work.

Even more so coming on the heels of two other major project announcements. Securing Hadoop with Knox Gateway and The Stinger Initiative: Making Apache Hive 100 Times Faster, both from Hortonworks.

Announcing Apache Hadoop 2.0.3 Release and Roadmap

Saturday, February 16th, 2013

Announcing Apache Hadoop 2.0.3 Release and Roadmap by Arun Murthy.

From the post:

As the Release Manager for hadoop-2.x, I’m very pleased to announce the next major milestone for the Apache Hadoop community, the release of hadoop-2.0.3-alpha!

2.0 Enhancements in this Alpha Release

This release delivers significant major enhancements and stability over previous releases in hadoop-2.x series. Notably, it includes:

  • QJM for HDFS HA for NameNode (HDFS-3077) and related stability fixes to HDFS HA
  • Multi-resource scheduling (CPU and memory) for YARN (YARN-2, YARN-3 & friends)
  • YARN ResourceManager Restart (YARN-230)
  • Significant stability at scale for YARN (over 30,000 nodes and 14 million applications so far, at time of release – see more details from folks at Yahoo! here)

A beta release is a couple of months off so now is your chance to review the alpha and contribute towards the beta.

Apache Hadoop YARN Meetup at Hortonworks

Thursday, October 18th, 2012

Apache Hadoop YARN Meetup at Hortonworks – Recap by Vinod Kumar Vavilapalli.

Just in case you missed the Apache Hadoop YARN meetup, summaries and slides are available for:

  • Chris Riccomini’s on “Building Applications on YARN”
  • YARN API Discussion
  • Efforts Underway

Enjoy!

Apache Hadoop 2.0.2-alpha Released!

Wednesday, October 17th, 2012

Apache Hadoop 2.0.2-alpha Released! by Arun Murthy.

From the post:

It gives me great pleasure to announce that the Apache Hadoop community has voted to release Apache Hadoop 2.0.2-alpha.

This is the second (alpha) release of the next generation release of Apache Hadoop 2.x and comes with significant enhancements to both the major components of Hadoop:

  • HDFS HA has undergone significant enhancements since the previous release for NameNode High Availability
  • YARN has undergone significant testing and stabilization and validation as is been heavily battle-tested since the previous release.

These are exciting times indeed for the Apache Hadoop community – personally, this is very reminiscent of the period in 2009 when we finally saw the light at the end of the tunnel during the stabilization of Apache Hadoop 1.x (then called Apache Hadoop 0.20.x). A déjà vu, if you will – albeit of the pleasant kind! Yes, we have a few miles to clock, but it feels like the hardest part is already behind us. At the time of release, YARN has already been deployed on super-sized clusters with 2,000 nodes and 3,600 nodes (totaling to nearly 6,000 nodes) at Yahoo alone*.

Exciting times indeed!

Not unlike a star ship fast enough for time dilation to kick in.

Great!

But which way do you go first?

Hadoop 2.0 offers more efficient crunching of data. But efficient crunching of data is a means, not a end.

Which way will you go with Hadoop 2.0?

What questions will you ask that you can’t ask now?

How will you evaluate the answers?

Ready to Contribute to Apache Hadoop 2.0?

Tuesday, October 16th, 2012

User feedback is a contribution to a software project.

Software can only mature with feedback, your feedback.

Otherwise the final deliverable has a “works on my machine” outcome.

Don’t let Apache Hadoop 2.0 have a “works on my machine” outcome.

Download the preview and contribute your experiences back to the community.

We will all be glad you did!

Details:

Hortonworks Data Platform 2.0 Alpha is Now Available for Preview! by Jeff Sposetti.

From the post:

We are very excited to announce the Alpha release of the Hortonworks Data Platform 2.0 (HDP 2.0 Alpha).

HDP 2.0 Alpha is built around Apache Hadoop 2.0, which improves availability of HDFS with High Availability for the NameNode along with several performance and reliability enhancements. Apache Hadoop 2.0 also significantly advances data processing in the Hadoop ecosystem with the introduction of YARN, a generic resource-management and application framework to support MapReduce and other paradigms such as real-time processing and graph processing.

In addition to Apache Hadoop 2.0, this release includes the essential Hadoop ecosystem projects such as Apache HBase, Apache Pig, Apache Hive, Apache HCatalog, Apache ZooKeeper and Apache Oozie to provide a fully integrated and verified Apache Hadoop 2.0 stack

Apache Hadoop 2.0 is well on the path to General Availability, and is already deployed at scale in several organizations; but it won’t get to the current maturity levels of the Hadoop 1.0 stack (available in Hortonworks Data Platform 1.x) without feedback and contributions from the community.

Hortonworks strongly believes that for open source technologies to mature and become widely adopted in the enterprise, you must balance innovation with stability. With HDP 2.0 Alpha, Hortonworks provides organizations an easy way to evaluate and gain experience with the Apache Hadoop 2.0 technology stack, and it presents the perfect opportunity to help bring stability to the platform and influence the future of the technology.

Are You Confused? (About MR2 and YARN?) Help is on the way!

Monday, October 8th, 2012

MR2 and YARN Briefly Explained by Justin Kestelyn.

Justin writes:

With CDH4 onward, the Apache Hadoop component introduced two new terms for Hadoop users to wonder about: MR2 and YARN. Unfortunately, these terms are mixed up so much that many people are confused about them. Do they mean the same thing, or not?

Not but see Justin’s post for the details. (He also points to a longer post with more details.)

YARN Meetup at Hortonworks on Friday, Oct 12

Thursday, October 4th, 2012

YARN Meetup at Hortonworks on Friday, Oct 12 by Russell Jurney.

From the post:

Hortonworks is hosting an Apache YARN Meetup on Friday, Oct 12, to solicit feedback on the YARN APIs. We’ve talked about YARN before in a four-part series on YARN, parts one, two, three and four.

YARN, or “Apache Hadoop NextGen MapReduce,” has come a long way this year. It is now a full-fledged sub-project of Apache Hadoop and has already been deployed on a massive 2,000 node cluster at Yahoo. Many projects, both open-src and otherwise, are porting to work in YARN such as Storm, S4 and many of them are in fairly advanced stages. We also have several individuals implementing one-off or ad-hoc application on YARN.

This meetup is a good time for YARN developers to catch up and talk more about YARN, it’s current status and medium-term and long-term roadmap.

OK, it’s probably too late to get cheap tickets but if you are in New York on the 12th of October, take advantage of the opportunity!

And please blog about the meeting, with a note to yours truly! I will post a link to your posting.

Apache Hadoop YARN – NodeManager

Wednesday, September 12th, 2012

Apache Hadoop YARN – NodeManager by Vinod Kumar Vavilapalli

From the post:

In the previous post, we briefly covered the internals of Apache Hadoop YARN’s ResourceManager. In this post, which is the fourth in the multi-part YARN blog series, we are going to dig deeper into the NodeManager internals and some of the key-features that NodeManager exposes. Part one, two and three are available.

Introduction

The NodeManager (NM) is YARN’s per-node agent, and takes care of the individual compute nodes in a Hadoop cluster. This includes keeping up-to date with the ResourceManager (RM), overseeing containers’ life-cycle management; monitoring resource usage (memory, CPU) of individual containers, tracking node-health, log’s management and auxiliary services which may be exploited by different YARN applications.

Administration isn’t high on the “exciting” list, although without good administration, things can get very “exciting.”

NodeManager gives you the monitoring tools to help avoid the latter form of excitement.

Apache Hadoop YARN – Concepts and Applications

Sunday, September 9th, 2012

Apache Hadoop YARN – Concepts and Applications by Jim Walker.

From the post:

In our previous post we provided an overview and an outline of the motivation behind Apache Hadoop YARN, the latest Apache Hadoop subproject. In this post we cover the key YARN concepts and walk through how diverse user applications work within this new system.

I thought I had missed a post in this series and I had! ;-)

Enjoy!

New ‘The Future of Apache Hadoop’ Season!

Wednesday, September 5th, 2012

OK, the real title is: Four New Installments in ‘The Future of Apache Hadoop’ Webinar Series

From the post:

During the ‘Future of Apache Hadoop’ webinar series, Hortonworks founders and core committers will discuss the future of Hadoop and related projects including Apache Pig, Apache Ambari, Apache Zookeeper and Apache Hadoop YARN.

Apache Hadoop has rapidly evolved to become the leading platform for managing, processing and analyzing big data. Consequently there is a thirst for knowledge on the future direction for Hadoop related projects. The Hortonworks webinar series will feature core committers of the Apache projects discussing the essential components required in a Hadoop Platform, current advances in Apache Hadoop, relevant use-cases and best practices on how to get started with the open source platform. Each webinar will include a live Q&A with the individuals at the center of the Apache Hadoop movement.

Coming to a computer near you:

  • Pig Out on Hadoop (Alan Gates): Wednesday, September 12 at 10:00 a.m. PT / 1:00 p.m. ET
  • Deployment and Management of Hadoop Clusters with Ambari (Matt Foley): Wednesday, September 26 at 10:00 a.m. PT / 1:00 p.m. ET
  • Scaling Apache Zookeeper for the Next Generation of Hadoop Applications (Mahadev Konar): Wednesday, October 17 at 10:00 a.m. PT / 1:00 p.m. ET
  • YARN: The Future of Data Processing with Apache Hadoop ( Arun C. Murthy): Wednesday, October 31 at 10:00 a.m. PT / 1:00 p.m. ET

Registration is open so get it on your calendar!

Apache Hadoop YARN – ResourceManager

Friday, August 31st, 2012

Apache Hadoop YARN – ResourceManager by Arun Murthy

From the post:

This is the third post in the multi-part series to cover important aspects of the newly formed Apache Hadoop YARN sub-project. In our previous posts (part one, part two), we provided the background and an overview of Hadoop YARN, and then covered the key YARN concepts and walked you through how diverse user applications work within this new system.

In this post, we are going to delve deeper into the heart of the system – the ResourceManager.

In case your data processing needs run towards the big/large end of the spectrum.

Apache Hadoop YARN – Background and an Overview

Thursday, August 9th, 2012

Apache Hadoop YARN – Background and an Overview by Arun Murth.

From the post:

MapReduce – The Paradigm

Essentially, the MapReduce model consists of a first, embarrassingly parallel, map phase where input data is split into discreet chunks to be processed. It is followed by the second and final reduce phase where the output of the map phase is aggregated to produce the desired result. The simple, and fairly restricted, nature of the programming model lends itself to very efficient and extremely large-scale implementations across thousands of cheap, commodity nodes.

Apache Hadoop MapReduce is the most popular open-source implementation of the MapReduce model.

In particular, when MapReduce is paired with a distributed file-system such as Apache Hadoop HDFS, which can provide very high aggregate I/O bandwidth across a large cluster, the economics of the system are extremely compelling – a key factor in the popularity of Hadoop.

One of the keys to this is the lack of data motion i.e. move compute to data and do not move data to the compute node via the network. Specifically, the MapReduce tasks can be scheduled on the same physical nodes on which data is resident in HDFS, which exposes the underlying storage layout across the cluster. This significantly reduces the network I/O patterns and allows for majority of the I/O on the local disk or within the same rack – a core advantage.

An introduction to the architecture of Apache Hadoop YARN that starts with its roots in MapReduce.

Introducing Apache Hadoop YARN

Friday, August 3rd, 2012

Introducing Apache Hadoop YARN by Arun Murthy.

From the post:

I’m thrilled to announce that the Apache Hadoop community has decided to promote the next-generation Hadoop data-processing framework, i.e. YARN, to be a sub-project of Apache Hadoop in the ASF!

Apache Hadoop YARN joins Hadoop Common (core libraries), Hadoop HDFS (storage) and Hadoop MapReduce (the MapReduce implementation) as the sub-projects of the Apache Hadoop which, itself, is a Top Level Project in the Apache Software Foundation. Until this milestone, YARN was a part of the Hadoop MapReduce project and now is poised to stand up on it’s own as a sub-project of Hadoop.

In a nutshell, Hadoop YARN is an attempt to take Apache Hadoop beyond MapReduce for data-processing.

As folks are aware, Hadoop HDFS is the data storage layer for Hadoop and MapReduce was the data-processing layer. However, the MapReduce algorithm, by itself, isn’t sufficient for the very wide variety of use-cases we see Hadoop being employed to solve. With YARN, Hadoop now has a generic resource-management and distributed application framework, where by, one can implement multiple data processing applications customized for the task at hand. Hadoop MapReduce is now one such application for YARN and I see several others given my vantage point – in future you will see MPI, graph-processing, simple services etc.; all co-existing with MapReduce applications in a Hadoop YARN cluster.

Considering the explosive growth of Hadoop, what new data processing applications do you see emerging first in YARN?