Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 26, 2012

Hadoop Beyond MapReduce, Part 1: Introducing Kitten

Filed under: Hadoop,MapReduce — Patrick Durusau @ 7:01 pm

Hadoop Beyond MapReduce, Part 1: Introducing Kitten by Josh Wills

From the post:

This week, a team of researchers at Google will be presenting a paper describing a system they developed that can learn to identify objects, including the faces of humans and cats, from an extremely large corpus of unlabeled training data. It is a remarkable accomplishment, both in terms of the system’s performance (a 70% improvement over the prior state-of-the-art) and its scale: the system runs on over 16,000 CPU cores and was trained on 10 million 200×200 pixel images extracted from YouTube videos.

Doug Cutting has described Apache Hadoop as “the kernel of a distributed operating system.” Until recently, Hadoop has been an operating system that was optimized for running a certain class of applications: the ones that could be structured as a short sequence of MapReduce jobs. Although MapReduce is the workhorse programming framework for distributed data processing, there are many difficult and interesting problems– including combinatorial optimization problems, large-scale graph computations, and machine learning models that identify pictures of cats– that can benefit from a more flexible execution environment.

Hadoop 0.23 introduced a substantial re-design of the core resource scheduling and task tracking system that will allow developers to create entirely new classes of applications for Hadoop. Cloudera’s Ahmed Radwan has written an excellent overview of the architecture of the new resource scheduling system, known as YARN. Hadoop’s open-source foundation and its broad adoption by industry, academia, and government labs means that, for the first time in history, developers can assume that a common platform for distributed computing will be available at organizations all over the world, and that there will be a market for applications that take advantage of that common platform to solve problems at scales that have never been considered before.

I suppose it would not be fair to point out that a human and fertile male/female couple could duplicate this feat without 10 million images from YouTube. 😉

And while YARN is a remarkable achievement, in the United States it isn’t possible to get federal agencies to share data, much less time on computing platforms. May be able to presume a common platform, but access, well, that may be a more difficult issue.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress