Archive for the ‘Hadoop YARN’ Category

Announcing Apache Storm 0.9.3

Thursday, December 18th, 2014

Announcing Apache Storm 0.9.3 by Taylor Goetz

From the post:

With Apache Hadoop YARN as its architectural center, Apache Hadoop continues to attract new engines to run within the data platform, as organizations want to efficiently store their data in a single repository and interact with it for batch, interactive and real-time streaming use cases. Apache Storm brings real-time data processing capabilities to help capture new business opportunities by powering low-latency dashboards, security alerts, and operational enhancements integrated with other applications running in the Hadoop cluster.


Now there’s an early holiday surprise!



Sunday, December 14th, 2014

GearPump (GitHub)

From the wiki homepage:

GearPump is a lightweight, real-time, big data streaming engine. It is inspired by recent advances in the Akka framework and a desire to improve on existing streaming frameworks. GearPump draws from a number of existing frameworks including MillWheel, Apache Storm, Spark Streaming, Apache Samza, Apache Tez, and Hadoop YARN while leveraging Akka actors throughout its architecture.

What originally caught my attention was this passage on the GitHub page:

Per initial benchmarks we are able to process 11 million messages/second (100 bytes per message) with a 17ms latency on a 4-node cluster.

Think about that for a second.

Per initial benchmarks we are able to process 11 million messages/second (100 bytes per message) with a 17ms latency on a 4-node cluster.

The GitHub page features a word count example and pointers to the wiki with more examples.

What if every topic “knew” the index value of every topic that should merge with it on display to a user?

When added to a topic map it broadcasts its merging property values and any topic with those values responds by transmitting its index value.

When you retrieve a topic, it has all the IDs necessary to create a merged view of the topic on the fly and on the client side.

There would be redundancy in the map but de-duplication for storage space went out with preferences for 7-bit character values to save memory space. So long as every topic returns the same result, who cares?

Well, it might make a difference when the CIA want to give every contractor full access to its datastores 24×7 via their cellphones. But, until that is an actual requirement, I would not worry about the storage space overmuch.

I first saw this in a tweet from Suneel Marthi.

Hadoop Ecosystem Guide Chart

Wednesday, August 13th, 2014

As they say, you can’t tell the players without a program!

hadoop chart

From Greg Hill’s New To Hadoop? Here’s A Handy Guide To Get You Started (Part 1)

Greg’s post has a brief summary of each category.

Additional pieces that you will find handy are promised in a future post.

The Hadoop ecosystem is evolving rapidly so take this chart as a rough guide. More players are likely to appear in a matter of months if not weeks.

I first saw this in Joe Crobak’s Hadoop Weekly – July 28, 2014.

Hadoop Summit Content Curation

Thursday, July 24th, 2014

Hadoop Summit Content Curation by Jules S. Damji.

From the post:

Although the Hadoop Summit San Jose 2014 has come and gone, the invaluable content—keynotes, sessions, and tracks—is available here. We ’ve selected a few sessions for Hadoop developers, practitioners, and architects, curating them under Apache Hadoop YARN, the architectural center and the data operating system.

In most of the keynotes and tracks three themes resonated:

  1. Enterprises are transitioning from traditional Hadoop to modern Hadoop 2.
  2. YARN is an enabler, the central orchestrator that facilitates multiple workloads, runs multiple data engines, and supports multiple access patterns—batch, interactive, streaming, and real-time—in Apache Hadoop 2.
  3. Apache Hadoop 2, as part of Modern Data Architecture (MDA), is enterprise ready.

It doesn’t matter if I have cable or DirectTV, there is never a shortage of material to watch. 😉


…Setting Up an R-Hadoop System

Friday, May 30th, 2014

Step-by-Step Guide to Setting Up an R-Hadoop System by Yanchang Zhao.

From the post:

This is a step-by-step guide to setting up an R-Hadoop system. I have tested it both on a single computer and on a cluster of computers. Note that this process is for Mac OS X and some steps or settings might be different for Windows or Ubuntu.

What looks like an excellent post on installing R-Hadaoop. It is written for the Mac OS and I have yet to confirm its installation on either Windows or Ubuntu.

I won’t be installing this on Windows so if you can confirm any needed changes and post them I would appreciate it.

I first saw this in a tweet by Gregory Piatetsky.

Yahoo Betting on Apache Hive, Tez, and YARN

Sunday, May 18th, 2014

Yahoo Betting on Apache Hive, Tez, and YARN

With the usual caveats about test results:

On the other hand, Hive 0.13 query execution times were not only significantly better at higher volumes of data (Fig 3 and 4) but also executed successfully without failing. In our comparisons and observations with Shark, we saw most queries fail with the larger (10TB) dataset. These same queries ran successfully and much faster on Hive 0.13, allowing for better scale. This was extremely critical for us, as we needed a single query and BI solution on the Hadoop grid regardless of dataset size. The Hive solution resonates with our users, as they do not have to worry about learning multiple technologies and discerning which solution to use when. A common solution also results in cost and operational efficiencies from having to build, deploy, and maintain a single solution.

Successful 10TB query times and results should be enough to get your attention. Not that many of us have data in that range, today, but tomorrow, who can say?


I first saw this in a tweet by Joshua Lande.

Hortonworks Data Platform 2.1

Wednesday, April 2nd, 2014

Hortonworks Data Platform 2.1 by Jim Walker.

From the post:

The pace of innovation within the Apache Hadoop community is truly remarkable, enabling us to announce the availability of Hortonworks Data Platform 2.1, incorporating the very latest innovations from the Hadoop community in an integrated, tested, and completely open enterprise data platform.

A VM available now, full releases to follow later in April.

Just grabbing the headings from Jim’s post:

The Stinger Initiative: Apache Hive, Tez and YARN for Interactive Query

Data Governance with Apache Falcon

Security with Apache Knox

Stream Processing with Apache Storm

Searching Hadoop Data with Apache Solr

Advanced Operations with Apache Ambari

See Jim’s post for some of the details and the VM for others.

New Hue Demos:…

Friday, February 28th, 2014

New Hue Demos: Spark UI, Job Browser, Oozie Scheduling, and YARN Support by Justin Kestelyn.

From the post:

Hue, the open source Web UI that makes Apache Hadoop easier to use, is now a standard across the ecosystem — shipping within multiple software distributions and sandboxes. One of the reasons for its success is an agile developer community behind it that is constantly rolling out new features to its users.

Just as important, the Hue team is diligent in its documentation and demonstration of those new features via video demos. In this post, for your convenience, I bring you the most recent examples (released since December):

  • The new Spark Igniter App
  • Using YARN and Job Browser
  • Job Browser with YARN Security
  • Apache Oozie crontab scheduling

All short but all worthwhile. Nice way to start off your Saturday morning. The kids have cartoons and you have Hue. 😉

Empowering Half a Billion Users For Free –
Would You?

Wednesday, January 22nd, 2014

How To Use Microsoft Excel to Visualize Hadoop Data by Saptak Sen.

From the post:

Microsoft and Hortonworks have been working together for over two years now with the goal of bringing the power of Big Data to a billion people. As a result of that work, today we announced the General Availability of HDP 2.0 for Windows with the full power of YARN.

There are already over half a billion Excel users on this planet.

So, we have put together a short tutorial on the Hortonworks Sandbox where we walk through the end-to-end data pipeline using HDP and Microsoft Excel in the shoes of a data analyst at a financial services firm where she:

  • Cleans and aggregates 10 years of raw stock tick data from NYSE
  • Enriches the data model by looking up additional attributes from Wikipedia
  • Creates an interactive visualization on the model

You can find the tutorial here.

As part of this process you will experience how simple it is to integrate HDP with the Microsoft Power BI platform.

This integration is made possible by the community work to design and implement WebHDFS, an open REST API in Apache Hadoop. Microsoft used the API from Power Query for Excel to make the integration to Microsoft Business Intelligence platform seamless.

Happy Hadooping!!!

Opening up Hadoop to a half of billion users can’t do anything but drive the development of the Hadoop ecosystem.

Which will in turn return more benefits to the Excel user community, which will drive usage of Excel.

That’s what I call a smart business strategy.


PS: Where are there similar strategies possible for subject identity?

HDP 2.0 for Windows is GA

Tuesday, January 21st, 2014

HDP 2.0 for Windows is GA by John Kreisa.

From the post:

We are excited to announce that the Hortonworks Data Platform 2.0 for Windows is publicly available for download. HDP 2 for Windows is the only Apache Hadoop 2.0 based platform that is certified for production usage on Windows Server 2008 R2 and Windows Server 2012 R2.

With this release, the latest in community innovation on Apache Hadoop is now available across all major Operating Systems. HDP 2.0 provides Hadoop coverage for more than 99% of the enterprises in the world, offering the most flexible deployment options from On-Premise to a variety of cloud solutions.

Unleashing YARN and Hadoop 2 on Windows

HDP 2.0 for Windows is a leap forward as it brings the power of Apache Hadoop YARN to Windows. YARN enables a user to interact with all data in multiple ways simultaneously – for instance making use of both realtime and batch processing – making Hadoop a true multi-use data platform and allowing it to take its place in a modern data architecture.


BTW, Microsoft is working with Hortonworks to make sure Apache Hadoop works seamlessly with Microsoft Windows and Azure.

I think they call that interoperability. Or something like that. 😉

Enron, Email, Kiji, Hive, YARN, Tez (Jan. 7th, DC)

Monday, January 6th, 2014

Exploring Enron Email Dataset with Kiji and Hive; Apache YARN and Apache Tez Hadoop-DC.

Tuesday, January 7, 2014 6:00 PM to 9:30 PM
Neustar (Room: Neuview) 21575 Ridgetop Circle, Sterling, VA

From the webpage:

Exploring Enron Email Dataset with Kiji and Hive

Lee Sheng, WibiData

Apache Hive is a data warehousing system for large volumes of data stored in Hadoop that provides SQL based access for exploring datasets. KijiSchema provides evolvable schemas of primitive and compound types on top of HBase. The integration between these provides the best aspects of both worlds (ad hoc SQL based querying on top of datasets using evolvable schemas containing complex objects). This talk will present an examples of queries utilizing this integration to do exploratory analysis of the Enron email corpus. Delving into topics such as email responder pairs and sentiment analysis can expose many of the interesting points in the rise and fall of Enron.

Apache YARN & Apache Tez

Tom McCuch Technical Director, Hortonworks

Apache Hadoop has become synonymous with Big Data and powers large scale data processing across some of the biggest companies in the world. Hadoop 2 is the next generation release of Hadoop and marks a pivotal point in its maturity with YARN – the new Hadoop compute framework. YARN – Yet Another Resource Negotiator – is a complete re-architecture of the Hadoop compute stack with a clean separation between platform and application. This opens up Hadoop data processing to new applications that can be executed IN Hadoop instead of outside Hadoop, thus improving efficiency, performance, data sharing and lowering operation costs. The Big Data ecosystem is already converging on YARN with new applications like Apache Tez being written specifically for YARN. Apache Tez aims to provide high performance and efficiency out of the box, across the spectrum of low latency queries and heavy-weight batch processing. The talk will provide a brief overview of key Hadoop 2 innovations, focusing in on YARN and Tez – covering architecture, motivational use cases and future roadmap. Finally, the impact of YARN on the Hadoop community will be demonstrated through running interactive queries with both Hive on Tez and with Hive on MapReduce, and comparing their performance side-by-side on the same Hadoop 2 cluster.

When I saw the low tomorrow in DC is going to be 16F and the high 21F, I thought I should pass this along.

Does anyone have a very large set of phone metadata that is public?

Thinking rather than grinding over Enron’s stumbles, again, phone metadata could be hands-on training for a variety of careers. 😉

Looking forward to seeing videos of these presentations!

Building Hadoop-based Apps on YARN

Wednesday, December 18th, 2013

Building Hadoop-based Apps on YARN

Hortonworks has put together resources that may ease your way to your first Hadoop-base app on YARN.

The resources are organized in steps:

  • STEP 1. Understand the motivations and architecture for YARN.
  • STEP 2. Explore example applications on YARN.
  • STEP 3. Examine real world applications YARN.

Further examples and real work applications would be welcomed by anyone studying YARN.

Getting Started Writing YARN Applications [Webinar – December 18th]

Friday, December 13th, 2013

Getting Started Writing YARN Applications by Lisa Sensmeier.

From the post:

There is a lot of information available on the benefits of Apache YARN but how do you get started building applications? On December 18 at 9am Pacific Time, Hortonworks will host a webinar and go over just that: what independent software vendors (ISVs) and developers need to do to take the first steps towards developing applications or integrating existing applications on YARN.

Register for the webinar here.

My experience with webinars has been uneven to say the least.

Every Mike McCandless webinar (live or recorded) has been a real treat. Great presentation skills, high value content and well organized.

I have seen other webinars with poor presentation skills, low value or mostly ad content that were poorly organized.

No promises on what you will see on the 18th of December but let’s hope for the former and not the latter. (No pressure, no pressure. 😉 )

Fast Search and Analytics on Hadoop with Elasticsearch

Monday, November 25th, 2013

Fast Search and Analytics on Hadoop with Elasticsearch by Lisa Sensmeier.

From the post:

Hortonworks customers can now enhance their Hadoop applications with Elasticsearch real-time data exploration, analytics, logging and search features, all designed to help businesses ask better questions, get clearer answers and better analyze their business metrics in real-time.

Hortonworks Data Platform and Elasticsearch make for a powerful combination of technologies that are extremely useful to anyone handling large volumes of data on a day-to-day basis. With the ability of YARN to support multiple workloads, customers with current investments in flexible batch processing can also add real-time search applications from Elasticsearch.

Not much in the way of substantive content but it does have links to good resources on Hadoop and Elasticsearch.

Migrating to MapReduce 2 on YARN (For Users)

Saturday, November 9th, 2013

Migrating to MapReduce 2 on YARN (For Users) by Sandy Ryza.

From the post:

In Apache Hadoop 2, YARN and MapReduce 2 (MR2) are long-needed upgrades for scheduling, resource management, and execution in Hadoop. At their core, the improvements separate cluster resource management capabilities from MapReduce-specific logic. They enable Hadoop to share resources dynamically between MapReduce and other parallel processing frameworks, such as Cloudera Impala; allow more sensible and finer-grained resource configuration for better cluster utilization; and permit Hadoop to scale to accommodate more and larger jobs.

In this post, users of CDH (Cloudera’s distribution of Hadoop and related projects) who program MapReduce jobs will get a guide to the architectural and user-facing differences between MapReduce 1 (MR1) and MR2. (MR2 is the default processing framework in CDH 5, although MR1 will continue to be supported.) Operators/administrators can read a similar post designed for them here.

From further within the post:

MR2 supports both the old (“mapred”) and new (“mapreduce”) MapReduce APIs used for MR1, with a few caveats. The difference between the old and new APIs, which concerns user-facing changes, should not be confused with the difference between MR1 and MR2, which concerns changes to the underlying framework. CDH 4 and CDH 5 support the new and old MapReduce APIs as well as both MR1 and MR2. (Now, go back and read this paragraph again, because the naming is often a source of confusion.) (Emphasis added.)

And under Job Configuration:

As in MR1, job configuration options can be specified on the command line, in Java code, or in the mapred-site.xml on the client machine in the same way they previously were. Most job configuration options, with rare exceptions, that were available in MR1 work in MR2 as well. For consistency and clarity, many options have been given new names. The older names are deprecated, but will still work for the time being. The exceptions are mapred.child.ulimit and all options relating to JVM reuse, which are no longer supported. (Emphasis added.)

That’s all very reassuring.

Are your MapReduce engineers using the old names (deprecated) or the new names or some combination of both?

As software evolves, changing of names cannot be avoided and no doubt Cloudera has tried to avoid gratuitous name changes.

But at the bottom line, isn’t it your responsibility to track internal use of names? For consistently and maintenance?

Hortonworks Sandbox Version 2.0

Wednesday, October 30th, 2013

Hortonworks Sandbox Version 2.0

From the web page:

Sandbox is a personal, portable Hadoop environment that comes with a dozen interactive Hadoop tutorials. Sandbox includes many of the most exciting developments from the latest HDP distribution, packaged up in a virtual environment that you can get up and running in 15 minutes!

Sandbox comes with:

Component Version
Apache Hadoop 2.2.0
Apache Hive 0.12.0
Apache HCatalog 0.12.0
Apache HBase 0.96.0
Apache ZooKeeper 3.4.5
Apache Pig 0.12.0
Apache Sqoop 1.4.4
Apache Flume 1.4.0
Apache Oozie 4.0.0
Apache Ambari 1.4.1
Apache Mahout 0.8.0
Hue 2.3.0

If you check the same listing at the Hortonworks page, you will see the Hue lacks a hyperlink. I had forgotten why until I ran the link down. 😉


HDP 2.0 and its YARN-based architecture…delivered!

Wednesday, October 23rd, 2013

HDP 2.0 and its YARN-based architecture…delivered! By Shaun Connolly.

From the post:

Typical delivery of enterprise software involves a very controlled date with a secret roadmap designed to wow prospects, customers, press and analysts…or at least that is the way it usually works. Open source, however, changes this equation.

As described here, the vision for extending Hadoop beyond its batch-only roots in support of interactive and real-time workloads was set by Arun Murthy back in 2008. The initiation of YARN, the key technology for enabling this vision, started in earnest in 2011, was declared GA by the community in the recent Apache Hadoop 2.2 release, and is now delivered for mainstream enterprises and the broader commercial ecosystem with the release of Hortonworks Data Platform 2.0.

HDP 2.0, and its YARN foundation, is a huge milestone for the Hadoop market since it unlocks that vision of gathering all data in Hadoop and interacting with that data in many ways and with predictable performance levels… but you know this because Apache Hadoop 2.2 went GA last week.

If you know Sean, do you think he knows that only 6.4% of projects costing => $10 million succeed? ( website ‘didn’t have a chance in hell’)

HDP 2.0 did take longer than

But putting 18 children in a room isn’t going to produce an 18 year old in a year.

Sean does a great job in this post and points to other HDP 2.0 resources you will want to see.

BTW, maybe you should not mention the 6.4% success rate to Sean. It might jinx how open source software development works and succeeds. 😉

Apache Hadoop 2 is now GA!

Wednesday, October 16th, 2013

Apache Hadoop 2 is now GA! by Arun Murthy.

From the post:

I’m thrilled to note that the Apache Hadoop community has declared Apache Hadoop 2.x as Generally Available with the release of hadoop-2.2.0!

This represents the realization of a massive effort by the entire Apache Hadoop community which started nearly 4 years to date, and we’re sure you’ll agree it’s cause for a big celebration. Equally, it’s a great credit to the Apache Software Foundation which provides an environment where contributors from various places and organizations can collaborate to achieve a goal which is as significant as Apache Hadoop v2.

Congratulations to everyone!
(emphasis in the original)

See Arun’s post for his summary of Hadoop 2.

Take the following graphic I stole from his post as motivation to do so:

Hadoop Stack

Apache Tez: A New Chapter in Hadoop Data Processing

Sunday, September 15th, 2013

Apache Tez: A New Chapter in Hadoop Data Processing by Bikas Saha.

From the post:

In this post we introduce the motivation behind Apache Tez ( and provide some background around the basic design principles for the project. As Carter discussed in our previous post on Stinger progress, Apache Tez is a crucial component of phase 2 of that project.

What is Apache Tez?

Apache Tez generalizes the MapReduce paradigm to execute a complex DAG (directed acyclic graph) of tasks. It also represents the next logical next step for Hadoop 2 and the introduction of with YARN and its more general-purpose resource management framework.

While MapReduce has served masterfully as the data processing backbone for Hadoop, its batch-oriented nature makes it unsuited for certain workloads like interactive query. Tez represents an alternate to the traditional MapReduce that allows for jobs to meet demands for fast response times and extreme throughput at petabyte scale. A great example of a benefactor of this new approach is Apache Hive and the work being done in the Stinger Initiative.


Distributed data processing is the core application that Apache Hadoop is built around. Storing and analyzing large volumes and variety of data efficiently has been the cornerstone use case that has driven large scale adoption of Hadoop, and has resulted in creating enormous value for the Hadoop adopters. Over the years, while building and running data processing applications based on MapReduce, we have understood a lot about the strengths and weaknesses of this framework and how we would like to evolve the Hadoop data processing framework to meet the evolving needs of Hadoop users. As the Hadoop compute platform moves into its next phase with YARN, it has decoupled itself from MapReduce being the only application, and opened the opportunity to create a new data processing framework to meet the new challenges. Apache Tez aspires to live up to these lofty goals.

Does your topic map engine decoupled from a single merging algorithm?

I ask because SLAs may require different algorithms for data sets or sources.

Leaked U.S. military documents may have a higher priority for completeness than half-human/half-bot posts on a Twitter stream.

Hoya (HBase on YARN) : Application Architecture

Friday, August 9th, 2013

Hoya (HBase on YARN) : Application Architecture by Steve Loughran.

From the post:

At Hadoop Summit in June, we introduced a little project we’re working on: Hoya: HBase on YARN. Since then the code has been reworked and is now up on Github. It’s still very raw, and requires some local builds of bits of Hadoop and HBase – but it is there for the interested.

In this article we’re going to look at the architecture, and a bit of the implementation.

We’re not going to look at YARN in this article -for that we have a dedicated section of the Hortonworks site -including sample chapters of Arun Murthy’s forthcoming book. Instead we’re going to cover how Hoya makes use of YARN.

If you are interested in where Hadoop is likely to go beyond MapReduce and don’t mind getting your hands dirty, this is for you.

Hortonworks Data Platform 2.0 Community…

Saturday, June 29th, 2013

Hortonworks Data Platform 2.0 Community Preview Now Available

June 26, 2013—Hortonworks, a leading contributor and provider to enterprise Apache™ Hadoop®, today announced the availability of the Hortonworks Data Platform (HDP) 2.0 Community Preview and the launch of the Hortonworks Certification Program for Apache Hadoop YARN to accelerate the availability of YARN-based partner solutions. Based on the next evolution of Apache Hadoop, including the first functional Apache YARN framework that has been more than four years in the making, the 100-percent open source HDP 2.0 features the latest advancements from the open source community that are igniting a new wave of Hadoop innovation.

[Jumping to the chase]

Please join Hortonworks for a webinar on HDP 2.0 on Wednesday, July 10 at 10 a.m. PT / 1:00 p.m. ET. To register for the webinar, please visit:


Hortonworks Data Platform 2.0 Community Preview is available today as a downloadable single-node instance that runs inside a virtual machine, and also as a complete installation for deployment to distributed infrastructure. To download HDP 2.0, please visit:

New in this release: Apache YARN, Apache Tex, and, Stinger.

Hadoop YARN

Wednesday, June 26th, 2013

Hadoop YARN by Steve Loughran, Devaraj Das & Eric Baldeschwieler.

From the post:

A next-generation framework for Hadoop data processing.

Apache™ Hadoop® YARN is a sub-project of Hadoop at the Apache Software Foundation introduced in Hadoop 2.0 that separates the resource management and processing components. YARN was borne of a need to enable a broader array of interaction patterns for data stored in HDFS beyond MapReduce. The YARN-based architecture of Hadoop 2.0 provides a more general processing platform that is not constrained to MapReduce.


As part of Hadoop 2.0, YARN takes the resource management capabilities that were in MapReduce and packages them so they can be used by new engines. This also streamlines MapReduce to do what it does best, process data. With YARN, you can now run multiple applications in Hadoop, all sharing a common resource management. Many organizations are already building applications on YARN in order to bring them IN to Hadoop.



One of the more accessible explanations of the importance of Hadoop YARN.

Likely not anything new to you but may be helpful when talking to others.

Introducing Hoya – HBase on YARN

Tuesday, June 25th, 2013

Introducing Hoya – HBase on YARN by Steve Loughran, Devaraj Das & Eric Baldeschwieler.

From the post:

In the last few weeks, we have been getting together a prototype, Hoya, running HBase On YARN. This is driven by a few top level use cases that we have been trying to address. Some of them are:

  • Be able to create on-demand HBase clusters easily -by and or in apps
    • With different versions of HBase potentially (for testing etc.)
  • Be able to configure different Hbase instances differently
    • For example, different configs for read/write workload instances
  • Better isolation
    • Run arbitrary co-processors in user’s private cluster
    • User will own the data that the hbase daemons create
  • MR jobs should find it simple to create (transient) HBase clusters
    • For Map-side joins where table data is all in HBase, for example
  • Elasticity of clusters for analytic / batch workload processing
    • Stop / Suspend / Resume clusters as needed
    • Expand / shrink clusters as needed
  • Be able to utilize cluster resources better
    • Run MR jobs while maintaining HBase’s low latency SLAs

If you are interested in getting in on the ground floor on a promising project, here’s your chance!

True, it is a HBase cluster management project but cluster management abounds in as many subjects as any other IT management area.

Not to mention that few of us ever do just “one job,” at most places. Having multiple skills makes you more marketable.

Streaming IN Hadoop: Yahoo! release Storm-YARN

Saturday, June 15th, 2013

Streaming IN Hadoop: Yahoo! release Storm-YARN by Jim Walker.

From the post:

Over the past year, customers have told us they want to store all their data in one place and interact with it in multiple ways… they want to use Hadoop, but in order to do so, it needs to extend beyond batch. It also needs to be interactive and real-time (among others).

This is the entire principle behind YARN, which together with others in the community, Arun Murthy and the team at Hortonworks have been working on for more than 5 years! The YARN based architecture of Hadoop 2.0 is hugely significant and we have been working closely with many partners to incorporate it into their applications.

Storm-YARN Released as Open Source

Yahoo! has been testing Hadoop 2 and its YARN-based architecture for quite some time. All the while they have worked on the convergence of the streaming framework Storm with Hadoop. This work has resulted in a YARN based version of Storm that will radically improve performance and resource management for streaming.

The release blog post from Yahoo.

Processing of data, even big data, is approaching “interactive and real-time,” although I suspect definitions of those terms vary. What is “interactive” for an automated trader might be too fast for human trader.

What I haven’t seen is concurrent development on the handling of the semantics of big data.

After the initial hysteria over the scope of NSA snooping, except for cases where the NSA was given the identity of a suspect (and not always then), was its data gathering of any use.

In topic map terms, the semantic impedance between the data systems was too great for useful manipulation of the data sets as one.

Streaming in Hadoop is welcome news, but until we can robustly manages the semantics of data in streams, much gold is going to pass uncollected from streams.

Philosophy behind YARN Resource Management

Saturday, February 23rd, 2013

Philosophy behind YARN Resource Management by Bikas Saha.

From the post:

YARN is part of the next generation Hadoop cluster compute environment. It creates a generic and flexible resource management framework to administer the compute resources in a Hadoop cluster. The YARN application framework allows multiple applications to negotiate resources for themselves and perform their application specific computations on a shared cluster. Thus, resource allocation lies at the heart of YARN.

YARN ultimately opens up Hadoop to additional compute frameworks, like Tez, so that an application can optimize compute for their specific requirements.

The YARN Resource Manager service is the central controlling authority for resource management and makes allocation decisions. It exposes a Scheduler API that is specifically designed to negotiate resources and not schedule tasks. Applications can request resources at different layers of the cluster topology such as nodes, racks etc. The scheduler determines how much and where to allocate based on resource availability and the configured sharing policy.

If YARN does become the cluster operating system, knowing the “why” of its behavior will be as important as knowing the “how.”

Introducing… Tez: Accelerating processing of data stored in HDFS

Wednesday, February 20th, 2013

Introducing… Tez: Accelerating processing of data stored in HDFS by Arun Murthy.

From the post:

MapReduce has served us well. For years it has been THE processing engine for Hadoop and has been the backbone upon which a huge amount of value has been created. While it is here to stay, new paradigms are also needed in order to enable Hadoop to serve an even greater number of usage patterns. A key and emerging example is the need for interactive query, which today is challenged by the batch-oriented nature of MapReduce. A key step to enabling this new world was Apache YARN and today the community proposes the next step… Tez

What is Tez?

Tez – Hindi for “speed” – (currently under incubation vote within Apache) provides a general-purpose, highly customizable framework that creates simplifies data-processing tasks across both small scale (low-latency) and large-scale (high throughput) workloads in Hadoop. It generalizes the MapReduce paradigm to a more powerful framework by providing the ability to execute a complex DAG (directed acyclic graph) of tasks for a single job so that projects in the Apache Hadoop ecosystem such as Apache Hive, Apache Pig and Cascading can meet requirements for human-interactive response times and extreme throughput at petabyte scale (clearly MapReduce has been a key driver in achieving this).

With the emergence of Apache Hadoop YARN as the basis of next generation data-processing architectures, there is a strong need for an application which can execute a complex DAG of tasks which can then be shared by Apache Pig, Apache Hive, Cascading and others. The constrained DAG expressible in MapReduce (one set of maps followed by one set of reduces) often results in multiple MapReduce jobs which harm latency for short queries (overhead of launching multiple jobs) and throughput for large-scale queries (too much overhead for materializing intermediate job outputs to the filesystem). With Tez, we introduce a more expressive DAG of tasks, within a single application or job, that is better aligned with the required processing task – thus, for e.g., any given SQL query can be expressed as a single job using Tez.

If you are familiar with Michael Sperberg-McQueen and Claus Huitfeldt’s work on DAGs, you would be as excited as I am! (Goddag for example.)

On any day this would be awesome work.

Even more so coming on the heels of two other major project announcements. Securing Hadoop with Knox Gateway and The Stinger Initiative: Making Apache Hive 100 Times Faster, both from Hortonworks.

Announcing Apache Hadoop 2.0.3 Release and Roadmap

Saturday, February 16th, 2013

Announcing Apache Hadoop 2.0.3 Release and Roadmap by Arun Murthy.

From the post:

As the Release Manager for hadoop-2.x, I’m very pleased to announce the next major milestone for the Apache Hadoop community, the release of hadoop-2.0.3-alpha!

2.0 Enhancements in this Alpha Release

This release delivers significant major enhancements and stability over previous releases in hadoop-2.x series. Notably, it includes:

  • QJM for HDFS HA for NameNode (HDFS-3077) and related stability fixes to HDFS HA
  • Multi-resource scheduling (CPU and memory) for YARN (YARN-2, YARN-3 & friends)
  • YARN ResourceManager Restart (YARN-230)
  • Significant stability at scale for YARN (over 30,000 nodes and 14 million applications so far, at time of release – see more details from folks at Yahoo! here)

A beta release is a couple of months off so now is your chance to review the alpha and contribute towards the beta.

Apache Hadoop YARN Meetup at Hortonworks

Thursday, October 18th, 2012

Apache Hadoop YARN Meetup at Hortonworks – Recap by Vinod Kumar Vavilapalli.

Just in case you missed the Apache Hadoop YARN meetup, summaries and slides are available for:

  • Chris Riccomini’s on “Building Applications on YARN”
  • YARN API Discussion
  • Efforts Underway


Apache Hadoop 2.0.2-alpha Released!

Wednesday, October 17th, 2012

Apache Hadoop 2.0.2-alpha Released! by Arun Murthy.

From the post:

It gives me great pleasure to announce that the Apache Hadoop community has voted to release Apache Hadoop 2.0.2-alpha.

This is the second (alpha) release of the next generation release of Apache Hadoop 2.x and comes with significant enhancements to both the major components of Hadoop:

  • HDFS HA has undergone significant enhancements since the previous release for NameNode High Availability
  • YARN has undergone significant testing and stabilization and validation as is been heavily battle-tested since the previous release.

These are exciting times indeed for the Apache Hadoop community – personally, this is very reminiscent of the period in 2009 when we finally saw the light at the end of the tunnel during the stabilization of Apache Hadoop 1.x (then called Apache Hadoop 0.20.x). A déjà vu, if you will – albeit of the pleasant kind! Yes, we have a few miles to clock, but it feels like the hardest part is already behind us. At the time of release, YARN has already been deployed on super-sized clusters with 2,000 nodes and 3,600 nodes (totaling to nearly 6,000 nodes) at Yahoo alone*.

Exciting times indeed!

Not unlike a star ship fast enough for time dilation to kick in.


But which way do you go first?

Hadoop 2.0 offers more efficient crunching of data. But efficient crunching of data is a means, not a end.

Which way will you go with Hadoop 2.0?

What questions will you ask that you can’t ask now?

How will you evaluate the answers?

Ready to Contribute to Apache Hadoop 2.0?

Tuesday, October 16th, 2012

User feedback is a contribution to a software project.

Software can only mature with feedback, your feedback.

Otherwise the final deliverable has a “works on my machine” outcome.

Don’t let Apache Hadoop 2.0 have a “works on my machine” outcome.

Download the preview and contribute your experiences back to the community.

We will all be glad you did!


Hortonworks Data Platform 2.0 Alpha is Now Available for Preview! by Jeff Sposetti.

From the post:

We are very excited to announce the Alpha release of the Hortonworks Data Platform 2.0 (HDP 2.0 Alpha).

HDP 2.0 Alpha is built around Apache Hadoop 2.0, which improves availability of HDFS with High Availability for the NameNode along with several performance and reliability enhancements. Apache Hadoop 2.0 also significantly advances data processing in the Hadoop ecosystem with the introduction of YARN, a generic resource-management and application framework to support MapReduce and other paradigms such as real-time processing and graph processing.

In addition to Apache Hadoop 2.0, this release includes the essential Hadoop ecosystem projects such as Apache HBase, Apache Pig, Apache Hive, Apache HCatalog, Apache ZooKeeper and Apache Oozie to provide a fully integrated and verified Apache Hadoop 2.0 stack

Apache Hadoop 2.0 is well on the path to General Availability, and is already deployed at scale in several organizations; but it won’t get to the current maturity levels of the Hadoop 1.0 stack (available in Hortonworks Data Platform 1.x) without feedback and contributions from the community.

Hortonworks strongly believes that for open source technologies to mature and become widely adopted in the enterprise, you must balance innovation with stability. With HDP 2.0 Alpha, Hortonworks provides organizations an easy way to evaluate and gain experience with the Apache Hadoop 2.0 technology stack, and it presents the perfect opportunity to help bring stability to the platform and influence the future of the technology.