September 19, 2012
Analyzing Twitter Data with Hadoop by Jon Natkins
From the post:
Social media has gained immense popularity with marketing teams, and Twitter is an effective tool for a company to get people excited about its products. Twitter makes it easy to engage users and communicate directly with them, and in turn, users can provide word-of-mouth marketing for companies by discussing the products. Given limited resources, and knowing we may not be able to talk to everyone we want to target directly, marketing departments can be more efficient by being selective about whom we reach out to.
In this post, we’ll learn how we can use Apache Flume, Apache HDFS, Apache Oozie, and Apache Hive to design an end-to-end data pipeline that will enable us to analyze Twitter data. This will be the first post in a series. The posts to follow to will describe, in more depth, how each component is involved and how the custom code operates. All the code and instructions necessary to reproduce this pipeline is available on the Cloudera Github.
Looking forward to more posts in this series!
Social media is a focus for marketing teams for obvious reasons.
Analysis of snaps, crackles and pops en masse.
What if you wanted to communicate securely with others using social media?
Thinking of something more robust and larger than two (or three) lovers agreeing on code words.
How would you hide in a public data stream?
Or the converse, how would you hunt for someone in a public data stream?
How would you use topic maps to manage the semantic side of such a process?
Comments Off on Analyzing Twitter Data with Hadoop [Hiding in a Public Data Stream]
September 15, 2012
MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That’s Not a Nail! by Jimmy Lin.
Abstract:
Hadoop is currently the large-scale data analysis “hammer” of choice, but there exist classes of algorithms that aren’t “nails”, in the sense that they are not particularly amenable to the MapReduce programming model. To address this, researchers have proposed MapReduce extensions or alternative programming models in which these algorithms can be elegantly expressed. This essay espouses a very different position: that MapReduce is “good enough”, and that instead of trying to invent screwdrivers, we should simply get rid of everything that’s not a nail. To be more specific, much discussion in the literature surrounds the fact that iterative algorithms are a poor fit for MapReduce: the simple solution is to find alternative non-iterative algorithms that solve the same problem. This essay captures my personal experiences as an academic researcher as well as a software engineer in a “real-world” production analytics environment. From this combined perspective I reflect on the current state and future of “big data” research.
Following the abstract:
Author’s note: I wrote this essay specifically to be controversial. The views expressed herein are more extreme than what I believe personally, written primarily for the purposes of provoking discussion. If after reading this essay you have a strong reaction, then I’ve accomplished my goal 🙂
The author needs to work on being “controversial.” He gives away the pose “throw away everything not a nail” far too early and easily.
Without the warnings, flashing lights, etc, the hyperbole might be missed, but not by anyone who would benefit from the substance of the paper.
The paper reflects careful thought on MapReduce and its limitations. Merits a careful and close reading.
I first saw this mentioned by John D. Cook.
Comments Off on MapReduce is Good Enough?… [How to Philosophize with a Hammer?]
September 12, 2012
Cloudera Enterprise in Less Than Two Minutes by Justin Kestelyn.
I had to pause “Born Under A Bad Sign” by Cream to watch the video but it was worth it!
Good example of selling technique too!
Focused on common use cases and user concerns. Promises a solution without all the troublesome details.
Time enough for that after a prospect is interested. And even then, ease them into the details.
Comments Off on Cloudera Enterprise in Less Than Two Minutes
Welcome Hortonworks Data Platform 1.1 by Jim Walker.
From the post:
Hortonworks Data Platform 1.1 Brings Expanded High Availability and Streaming Data Capture, Easier Integration with Existing Tools to Improve Enterprise Reliability and Performance of Apache Hadoop
It is exactly three months to the day that Hortonworks Data Platform version 1.0 was announced. A lot has happened since that day…
- Our distribution has been downloaded by thousands and is delivering big value to organizations throughout the world,
- Hadoop Summit gathered over 2200 Hadoop enthusiasts into the San Jose Convention Center,
- And, our Hortonworks team grew by leaps and bounds!
In these same three months our growing team of committers, engineers, testers and writers have been busy knocking out our next release, Hortonworks Data Platform 1.1. We are delighted to announce availability of HDP 1.1 today! With this release, we expand our high availability options with the addition of Red Hat based HA, add streaming capability with Flume, expand monitoring API enhancements and have made significant performance improvements to the core platform.
New features include high availability, capturing data streams (Flume), improved operations management and performance increases.
For the details, see the post, documentation or even download Hortonworks Data Platform 1.1 for a spin.
Unlike Odo’s Klingon days, a day with several items from Hortonworks is a good day. Enjoy!
Comments Off on Welcome Hortonworks Data Platform 1.1
September 9, 2012
Apache Hadoop YARN – Concepts and Applications by Jim Walker.
From the post:
In our previous post we provided an overview and an outline of the motivation behind Apache Hadoop YARN, the latest Apache Hadoop subproject. In this post we cover the key YARN concepts and walk through how diverse user applications work within this new system.
I thought I had missed a post in this series and I had! 😉
Enjoy!
Comments Off on Apache Hadoop YARN – Concepts and Applications
September 7, 2012
Images for your next Hadoop and Big Data presentation
Images that will help with your next Hadoop/Big Data presentation.
Question: What images will you use for your next topic map presentation?
Possibles:
A bit too tame for my tastes. And its doesn’t say: “map” to me. You?
Hmmm, presumptuous don’t you think? Plus lacking that “map” quality as well.
It claims to be a map, of sorts. But scarring potential customers isn’t good strategy.
Will be familiar soon enough. Not sure anyone wants a reminder.
Suggestions?
Comments Off on Images for your next Hadoop and Big Data presentation [Topic Map Images?]
September 6, 2012
Meet the Committer, Part One: Alan Gates by Kim Truong.
From the post:
Series Introduction
Hortonworks is on a mission to accelerate the development and adoption of Apache Hadoop. Through engineering open source Hadoop, our efforts with our distribution, Hortonworks Data Platform (HDP), a 100% open source data management platform, and partnerships with the likes of Microsoft, Teradata, Talend and others, we will accomplish this, one installation at a time.
What makes this mission possible is our all-star team of Hadoop committers. In this series, we’re going to profile those committers, to show you the face of Hadoop.
Alan Gates, Apache Pig and HCatalog Committer
Education is a key component of this mission. Helping companies gain a better understanding of the value of Hadoop through transparent communications of the work we’re doing is paramount. In addition to explaining core Hadoop projects (MapReduce and HDFS) we also highlight significant contributions to other ecosystem projects including Apache Ambari, Apache HCatalog, Apache Pig and Apache Zookeeper.
Alan Gates is a leader in our Hadoop education programs. That is why I’m incredibly excited to kick off the next phase of our “Future of Apache Hadoop” webinar series. We’re starting off this segment with 4-webinar series on September 12 with “Pig out to Hadoop” with Alan Gates (twitter:@alanfgates). Alan is an original member of the engineering team that took Pig from a Yahoo! Labs research project to a successful Apache open source project. Alan is also a member of the Apache Software Foundation and a co-founder of Hortonworks.
My only complaint is that the interview is too short!
Looking forward to the Pig webinar!
Comments Off on Meet the Committer, Part One: Alan Gates
September 5, 2012
OK, the real title is: Four New Installments in ‘The Future of Apache Hadoop’ Webinar Series
From the post:
During the ‘Future of Apache Hadoop’ webinar series, Hortonworks founders and core committers will discuss the future of Hadoop and related projects including Apache Pig, Apache Ambari, Apache Zookeeper and Apache Hadoop YARN.
Apache Hadoop has rapidly evolved to become the leading platform for managing, processing and analyzing big data. Consequently there is a thirst for knowledge on the future direction for Hadoop related projects. The Hortonworks webinar series will feature core committers of the Apache projects discussing the essential components required in a Hadoop Platform, current advances in Apache Hadoop, relevant use-cases and best practices on how to get started with the open source platform. Each webinar will include a live Q&A with the individuals at the center of the Apache Hadoop movement.
Coming to a computer near you:
- Pig Out on Hadoop (Alan Gates): Wednesday, September 12 at 10:00 a.m. PT / 1:00 p.m. ET
- Deployment and Management of Hadoop Clusters with Ambari (Matt Foley): Wednesday, September 26 at 10:00 a.m. PT / 1:00 p.m. ET
- Scaling Apache Zookeeper for the Next Generation of Hadoop Applications (Mahadev Konar): Wednesday, October 17 at 10:00 a.m. PT / 1:00 p.m. ET
- YARN: The Future of Data Processing with Apache Hadoop ( Arun C. Murthy): Wednesday, October 31 at 10:00 a.m. PT / 1:00 p.m. ET
Registration is open so get it on your calendar!
Comments Off on New ‘The Future of Apache Hadoop’ Season!
What Do Real-Life Hadoop Workloads Look Like? by Yanpei Chen.
From the post:
Organizations in diverse industries have adopted Hadoop-based systems for large-scale data processing. As a leading force in Hadoop development with customers in half of the Fortune 50 companies, Cloudera is in a unique position to characterize and compare real-life Hadoop workloads. Such insights are essential as developers, data scientists, and decision makers reflect on current use cases to anticipate technology trends.
Recently we collaborated with researchers at UC Berkeley to collect and analyze a set of Hadoop traces. These traces come from Cloudera customers in e-commerce, telecommunications, media, and retail (Table 1). Here I will explain a subset of the observations, and the thoughts they triggered about challenges and opportunities in the Hadoop ecosystem, both present and in the future.
Specific (and useful) to Hadoop installations but I suspect more useful for semantic processing in general.
Questions like:
- What topics are “hot spots” of merging activity?
- Where do those topics originate?
- How do changes in merging rules impact the merging process?
are only some of the ones that may be of interest.
Comments Off on What Do Real-Life Hadoop Workloads Look Like?
August 31, 2012
Developing CDH Applications with Maven and Eclipse by Jon Natkins
Learn how to configure a basic Maven project that will be able to build applications against CDH
Apache Maven is a build automation tool that can be used for Java projects. Since nearly all the Apache Hadoop ecosystem is written in Java, Maven is a great tool for managing projects that build on top of the Hadoop APIs. In this post, we’ll configure a basic Maven project that will be able to build applications against CDH (Cloudera’s Distribution Including Apache Hadoop) binaries.
Maven projects are defined using an XML file called pom.xml, which describes things like the project’s dependencies on other modules, the build order, and any other plugins that the project uses. A complete example of the pom.xml described below, which can be used with CDH, is available on Github. (To use the example, you’ll need at least Maven 2.0 installed.) If you’ve never set up a Maven project before, you can get a jumpstart by using Maven’s quickstart archetype, which generates a small initial project layout.
I don’t have a Fairness Doctrine but thought since I had a post on make today that one on Maven would not be out of place.
Both of them are likely to figure in an active topic map/semantic application work.
BTW, since both “make” and “maven” have multiple meanings, how would you index this post to separate the uses in this post from other meanings?
Would it make a difference if, as appears above, some instances are surrounded with hyperlinks?
How would I indicate that the hyperlinks are identity references?
Or some subset of hyperlinks are identity references?
Comments Off on Developing CDH Applications with Maven and Eclipse
August 27, 2012
HAIL – Only Aggressive Elephants are Fast Elephants
From the post:
Typically we store data based on any one of the different physical layouts (such as row, column, vertical, PAX etc). And this choice determines its suitability for a certain kind of workload while making it less optimal for other kinds of workloads. Can we store data under different layouts at the same time? Especially within a HDFS environment where each block is replicated a few times. This is the big idea that HAIL (Hadoop Aggressive Indexing Library) pursues.
At a very high level it looks like to understand the working of HAIL we will have to look at the three distinct workflows the system is organized around namely –
- The data/file upload pipeline
- The indexing pipeline
- The query pipeline
Every unit of information makes its journey through these three pipelines.
Be sure to see the original paper.
How much of what we “know” about modeling is driven by the needs of ancestral storage layouts?
Given the performance of modern chips, are those “needs” still valid considerations?
Or perhaps better, at what size data store or processing requirement do the physical storage model needs re-assert themselves?
Not just a performance question but also one of uniformity of identification.
What was once a “performance” requirement, that data have some common system of identification, may not longer be the case.
Comments Off on HAIL – Only Aggressive Elephants are Fast Elephants
Hadoop on your PC: Cloudera’s CDH4 virtual machine by Andrew Brust.
From the post:
Want to learn Hadoop without building your own cluster or paying for cloud resources? Then download Cloudera’s Hadoop distro and run it in a virtual machine on your PC. I’ll show you how.
As good a way as any to get your feet wet with Hadoop.
Don’t be surprised if in a week or two you have both the nucleus of a cluster and a cloud account. Reasoning that you need to be prepared for any client’s environment of choice.
Or at least that is what you will tell your significant other.
Comments Off on Hadoop on your PC: Cloudera’s CDH4 virtual machine
Pig as Hadoop Connector, Part Two: HBase, JRuby and Sinatra by Russell Jurney.
From the post:
Hadoop is about freedom as much as scale: providing you disk spindles and processor cores together to process your data with whatever tool you choose. Unleash your creativity. Pig as duct tape facilitates this freedom, enabling you to connect distributed systems at scale in minutes, not hours. In this post we’ll demonstrate how you can turn raw data into a web service using Hadoop, Pig, HBase, JRuby and Sinatra. In doing so we will demonstrate yet another way to use Pig as connector to publish data you’ve processed on Hadoop.
When (not if) the next big cache of emails or other “sensitive” documents drops, everyone who has followed this and similar tutorials should be ready.
Comments Off on Pig as Hadoop Connector, Part Two: HBase, JRuby and Sinatra
August 24, 2012
Big Data on Heroku – Hadoop from Treasure Data by Istvan Szegedi.
From the post:
This time I write about Heroku and Treasure Data Hadoop solution – I found it really to be a ‘gem’ in the Big Data world.
Heroku is a cloud platform as a service (PaaS) owned by Salesforce.com. Originally it started with supporting Ruby as its main programming language but it has been extended to Java, Scala, Node.js, Python and Clojure, too. It also supports a long list of addons including – among others – RDBMS and NoSQL capabilities and Hadoop-based data warehouse developed by Treasure Data.
Not to leave the impression that your only cloud option is AWS.
I don’t know of any comparisons of cloud services/storage plus cost on an apples to apples basis.
Do you?
Comments Off on Big Data on Heroku – Hadoop from Treasure Data
Process a Million Songs with Apache Pig by Justin Kestelyn.
From the post:
The following is a guest post kindly offered by Adam Kawa, a 26-year old Hadoop developer from Warsaw, Poland. This post was originally published in a slightly different form at his blog, Hakuna MapData!
Recently I have found an interesting dataset, called Million Song Dataset (MSD), which contains detailed acoustic and contextual data about a million songs. For each song we can find information like title, hotness, tempo, duration, danceability, and loudness as well as artist name, popularity, localization (latitude and longitude pair), and many other things. There are no music files included here, but the links to MP3 song previews at 7digital.com can be easily constructed from the data.
The dataset consists of 339 tab-separated text files. Each file contains about 3,000 songs and each song is represented as one separate line of text. The dataset is publicly available and you can find it at Infochimps or Amazon S3. Since the total size of this data sums up to around 218GB, processing it using one machine may take a very long time.
Definitely, a much more interesting and efficient approach is to use multiple machines and process the songs in parallel by taking advantage of open-source tools from the Apache Hadoop ecosystem (e.g. Apache Pig). If you have your own machines, you can simply use CDH (Cloudera’s Distribution including Apache Hadoop), which includes the complete Apache Hadoop stack. CDH can be installed manually (quickly and easily by typing a couple of simple commands) or automatically using Cloudera Manager Free Edition (which is Cloudera’s recommended approach). Both CDH and Cloudera Manager are freely downloadable here. Alternatively, you may rent some machines from Amazon with Hadoop already installed and process the data using Amazon’s Elastic MapReduce (here is a cool description writen by Paul Lemere how to use it and pay as low as $1, and here is my presentation about Elastic MapReduce given at the second meeting of Warsaw Hadoop User Group).
An example of offering the reader their choice of implementation detail, on or off a cloud. 😉
Suspect that is going to become increasingly common.
Comments Off on Process a Million Songs with Apache Pig
August 21, 2012
Apache to Drill for big data in Hadoop
From the post:
A new Apache Incubator proposal should see the Drill project offering a new open source way to interactively analyse large scale datasets on distributed systems. Drill is inspired by Google’s Dremel but is designed to be more flexible in terms of supported query languages. Dremel has been in use by Google since 2006 and is now the engine that powers Google’s BigQuery analytics.
The project is being led at Apache by developers from MapR where the early Drill development was being done. Also contributing are Drawn To Scale and Concurrent. Requirement and design documentation will be contributed to the project by MapR. Hadoop is good for batch queries, but by allowing quicker queries of huge data sets, those data sets can be better explored. The Drill technology, like the Google Dremel technology, does not replace MapReduce or Hadoop systems. It works along side them, offering a system which can analyse the output of the batch processing system and its pipelines, or be used to rapidly prototype larger scale computations.
Drill is comprised of a query language layer with parser and execution planner, a low latency execution engine for executing the plan, nested data formats for data storage and a scalable data source layer. The query language layer will focus on Drill’s own query language, DrQL, and the data source layer will initially use Hadoop as its source. The project overall will closely integrate with Hadoop, storing its data in Hadoop and supporting the Hadoop FileSystem and HBase and supporting Hadoop data formats. Apache’s Hive project is also being considered as the basis for the DrQL.
The developers hope that by developing in the open at Apache, they will be able to create and establish Drill’s own APIs and ensure a robust, flexible architecture which will support a broad range of data sources, formats and query languages. The project has been accepted into the incubator and so far has an empty subversion repository.
Q: Is anyone working on/maintaining a map between the various Hadoop related query languages?
Comments Off on Apache to Drill for big data in Hadoop
Getting Started with R and Hadoop by David Smith.
From the post:
Last week's meeting of the Chicago area Hadoop User Group (a joint meeting the Chicago R User Group, and sponsored by Revolution Analytics) focused on crunching Big Data with R and Hadoop. Jeffrey Breen, president of Atmosphere Research Group, frequently deals with large data sets in his airline consulting work, and R is his "go-to tool for anything data-related". His presentation, "Getting Started with R and Hadoop" focuses on the RHadoop suite of packages, and especially the rmr package to interface R and Hadoop. He lists four advantages of using rmr for big-data analytics with R and Hadoop:
- Well-designed API: code only needs to deal with basic R objects
- Very flexible I/O subsystem: handles common formats like CSV, and also allows complex line-by-line parsing
- Map-Reduce jobs can easily be daisy-chained to build complex workflows
- Concise code compared to other ways of interfacing R and Hadoop (the chart below compares the number of lines of code required to implement a map-reduce analysis using different systems)
Slides, detailed examples, presentation, pointers to other resources.
Other than processing your data set, doesn’t look like it leaves much out. 😉
Ironic that we talk about “big data” sets when the Concept Annotation in the CRAFT corpus took two and one-half years (that 30 months for you mythic developer types) to tag ninety-seven (97) medical articles.
That’s an average of a little over three (3) articles per month.
And I am sure the project leads would concede that more could be done.
Maybe “big” data should include some notion of “complex” data?
Comments Off on Getting Started with R and Hadoop
UC Irvine Medical Center: Improving Quality of Care with Apache Hadoop by Charles Boicey.
From the post:
With a single observation in early 2011, the Hadoop strategy at UC Irvine Medical Center started. While using Twitter, Facebook, LinkedIn and Yahoo we came to the conclusion that healthcare data although domain specific is structurally not much different than a tweet, Facebook posting or LinkedIn profile and that the environment powering these applications should be able to do the same with healthcare data.
In healthcare, data shares many of the same qualities as that found in the large web properties. Each has a seemingly infinite volume of data to ingest and it is all types and formats across structured, unstructured, video and audio. We also noticed the near zero latency in which data was not only ingested but also rendered back to users was important. Intelligence was also apparent in that algorithms were employed to make suggestion such as people you may know.
We started to draw parallels to the challenges we were having with the typical characteristic of Big Data, volume, velocity and variety.
…
The start of a series Hadoop in health care.
I am more interested in the variety question than volume or velocity but for practical applications, all three are necessary considerations.
From further within the post:
We saw this project as vehicle for demonstrating the value of Applied Clinical Informatics and promoting the translational effects of rapidly moving from “code side to bedside”. (emphasis added)
Just so you know to add the string “Applied Clinical Informatics” to your literature searches in this area.
The wheel will be re-invented often enough without your help.
Comments Off on UC Irvine Medical Center: Improving Quality of Care with Apache Hadoop
August 19, 2012
Analogies – Romans, Train, Dabbawalla, Oil Refinery, Laundry – and Hadoop
Scroll down for the laundry analogy. It’s at least as accurate and more entertaining than the IBM analogy. 😉
I encountered this at Hadoopsphere.com.
Comments Off on Analogies – Romans, Train, Dabbawalla, Oil Refinery, Laundry – and Hadoop
August 17, 2012
Marching Hadoop to Windows
From the post:
Bringing Hadoop to Windows and the two-year development of Hadoop 2.0 are two of the more exciting developments brought up by Hortonworks’s Cofounder and CTO, Eric Baldeschwieler, in a talk before a panel at the Cloud 2012 Conference in Honolulu.
(video omitted)
The panel, which was also attended by Baldeschwieler’s Cloudera counterpart Amr Awadallah, focused on insights into the big data world, a subject Baldeschwieler tackled almost entirely with Hadoop. The eighteen-minute discussion also featured a brief history of Hadoop’s rise to prominence, improvements to be made to Hadoop, and a few tips to enterprising researchers wishing to contribute to Hadoop.
“Bringing Hadoop to Windows,” says Baldeschwieler “turns out to be a very exciting initiative because there are a huge number of users in Windows operating system.” In particular, the Excel spreadsheet program is a popular one for business analysts, something analysts would like to see integrated with Hadoop’s database. That will not be possible until, as Baldeschwieler notes, Windows is integrated into Hadoop later this year, a move that will also considerably expand Hadoop’s reach.
However, that announcement pales in comparison to the possibilities provided by the impending Hadoop 2.0. “Hadoop 2.0 is a pretty major re-write of Hadoop that’s been in the works for two years. It’s now in usable alpha form…The real focus in Hadoop 2.0 is scale and opening it up for more innovation.” Baldeschwieler notes that Hadoop’s rise has been result of what he calls “a happy accident” where it was being developed by his Yahoo team for a specific use case: classifying, sorting, and indexing each of the URLs that were under Yahoo’s scope.
Integration of Excel and Hadoop?
Is that going to be echoes of Unix – The Hole Hawg?
Comments Off on Marching Hadoop to Windows
August 16, 2012
Pig as Hadoop Connector, Part One: Pig, MongoDB and Node.js by Russell Jurney.
From the post:
Series Introduction
Apache Pig is a dataflow oriented, scripting interface to Hadoop. Pig enables you to manipulate data as tuples in simple pipelines without thinking about the complexities of MapReduce.
But Pig is more than that. Pig has emerged as the ‘duct tape’ of Big Data, enabling you to send data between distributed systems in a few lines of code. In this series, we’re going to show you how to use Hadoop and Pig to connect different distributed systems, to enable you to process data from wherever and to wherever you like.
Working code for this post as well as setup instructions for the tools we use are available at https://github.com/rjurney/enron-node-mongo and you can download the Enron emails we use in the example in Avro format at http://s3.amazonaws.com/rjurney.public/enron.avro. You can run our example Pig scripts in local mode (without Hadoop) with the -x local
flag: pig -x local
. This enables new Hadoop users to try out Pig without a Hadoop cluster.
Introduction
In this post we’ll be using Hadoop, Pig, mongo-hadoop, MongoDB and Node.js to turn Avro records into a web service. We do so to illustrate Pig’s ability to act as glue between distributed systems, and to show how easy it is to publish data from Hadoop to the web.
I was tempted to add ‘duct tape’ as a category. But there could only be one entry. 😉
Take an early weekend and have some fun with this tomorrow. August will be over sooner than you think.
Comments Off on Pig as Hadoop Connector, Part One: Pig, MongoDB and Node.js
HBase Replication: Operational Overview by Himanshu Vashishtha
From the post:
This is the second blogpost about HBase replication. The previous blogpost, HBase Replication Overview, discussed use cases, architecture and different modes supported in HBase replication. This blogpost is from an operational perspective and will touch upon HBase replication configuration, and key concepts for using it — such as bootstrapping, schema change, and fault tolerance.
The sort of post that makes you long for one or more mini-clusters. 😉
Comments Off on HBase Replication: Operational Overview
August 13, 2012
CDH3 update 5 is now available by Arvind Prabhakar
From the post:
We are happy to announce the general availability of CDH3 update 5. This update is a maintenance release of CDH3 platform and provides a considerable amount of bug-fixes and stability enhancements. Alongside these fixes, we have also included a few new features, most notable of which are the following:
- Flume 1.2.0 – Provides a durable file channel and many more features over the previous release.
- Hive AvroSerDe – Replaces the Haivvreo SerDe and provides robust support for Avro data format.
- WebHDFS – A full read/write REST API to HDFS.
Maintenance release. Installation is good practice before major releases.
Comments Off on CDH3 update 5 is now available
August 9, 2012
Groundhog: Hadoop Fork Testing by Anupam Seth.
From the post:
Hadoop is widely used at Yahoo! to do all kinds of processing. It is used for everything from counting ad clicks to optimizing what is shown on the front page for each individual user. Deploying a major release of Hadoop to all 40,000+ nodes at Yahoo! is a long and painful process that impacts all users of Hadoop. It involves doing a staged rollout onto different clusters of increasing importance (e.g. QA, sandbox, research, production) and asking all teams that use Hadoop to verify that their applications work with this new version. This is to harden the new release before it is deployed on clusters that directly impact revenue, but it comes at the expense of the users of these clusters because they have to share the pain of stabilizing a newer version. Further, this process can take over 6 months. Waiting 6 months to get a new feature, which users have asked for, onto a production system is way too long. It stifles innovation both for Hadoop and for the code running on Hadoop. Other software systems avoid these problems by more closely following continuous integration techniques.
Groundhog is an automated testing tool to help ensure backwards compatibility (in terms of API, functionality, and performance) between releases of Hadoop before deploying a new release onto clusters with a high QoS. Groundhog does this by providing an automated mechanism to capture user jobs (currently limited to pig scripts) as they are run on a cluster and then replay them on a different cluster with a different version of Hadoop to verify that they still produce the same results. The test cluster can take inevitable downtime and still help ensure that the latest version of Hadoop has not introduced any new regressions. It is called groundhog because that way Hadoop can relive a pig script over and over again until it gets it right, like the movie Groundhog Day. There is similarity in concept to traditional fork/T testing in that jobs are duplicated and ran on another location. However, Hadoop fork testing differs in that the testing will not occur in real-time but instead the original job with all needed inputs and outputs will be captured and archived. Then at any later date, the archived job can be re-ran.
The main idea is to reduce the deployment cycle of a new Hadoop release by making it easier to get user oriented testing started sooner and at a larger scope. Specifically, get testing running to quickly discover regressions and backwards incompatibility issues. Past efforts to bring up a test cluster and have Hadoop users run their jobs on the test cluster has been less successful than desired. Therefore, fork testing is a method for reducing the human effort needed to get user oriented testing ran against a Hadoop cluster. Additionally, if the level of effort to capture and run tests is reduced, then testing can be performed more often and experiments can also be run. All of this must happen while following data governance policies though.
Thus, Fork testing is a form of end to end testing. If there was a complete suite of end to end tests for Hadoop, the need for fork testing might not exist. Alas, the end to end suite does not exist and creating fork testing is deemed a faster path to achieving the testing goal.
Groundhog currently is limited to work only with pig jobs. The majority of user jobs run on Hadoop at Yahoo! are written in pig. This is what allows Groundhog to nevertheless have a good sampling of production jobs.
This is way cool!
Discovering problems, even errors, before they show up in live installations is always a good thing.
When you make changes to merging rules, how do you test the impact on your topic maps?
I first saw this at: Alex Popescu’s myNoSQL under Groundhog: Hadoop Automated Testing at Yahoo!
Comments Off on Groundhog: Hadoop Fork Testing
Apache Hadoop YARN – Background and an Overview by Arun Murth.
From the post:
MapReduce – The Paradigm
Essentially, the MapReduce model consists of a first, embarrassingly parallel, map phase where input data is split into discreet chunks to be processed. It is followed by the second and final reduce phase where the output of the map phase is aggregated to produce the desired result. The simple, and fairly restricted, nature of the programming model lends itself to very efficient and extremely large-scale implementations across thousands of cheap, commodity nodes.
Apache Hadoop MapReduce is the most popular open-source implementation of the MapReduce model.
In particular, when MapReduce is paired with a distributed file-system such as Apache Hadoop HDFS, which can provide very high aggregate I/O bandwidth across a large cluster, the economics of the system are extremely compelling – a key factor in the popularity of Hadoop.
One of the keys to this is the lack of data motion i.e. move compute to data and do not move data to the compute node via the network. Specifically, the MapReduce tasks can be scheduled on the same physical nodes on which data is resident in HDFS, which exposes the underlying storage layout across the cluster. This significantly reduces the network I/O patterns and allows for majority of the I/O on the local disk or within the same rack – a core advantage.
An introduction to the architecture of Apache Hadoop YARN that starts with its roots in MapReduce.
Comments Off on Apache Hadoop YARN – Background and an Overview
August 7, 2012
HttpFS for CDH3 – The Hadoop FileSystem over HTTP by Alejandro Abdelnur
From the post:
HttpFS is an HTTP gateway/proxy for Hadoop FileSystem implementations. HttpFS comes with CDH4 and replaces HdfsProxy (which only provided read access). Its REST API is compatible with WebHDFS (which is included in CDH4 and the upcoming CDH3u5).
HttpFs is a proxy so, unlike WebHDFS, it does not require clients be able to access every machine in the cluster. This allows clients to to access a cluster that is behind a firewall via the WebHDFS REST API. HttpFS also allows clients to access CDH3u4 clusters via the WebHDFS REST API.
Given the constant interest we’ve seen by CDH3 users in Hoop, we have backported Apache Hadoop HttpFS to work with CDH3.
Another step in the evolution of Hadoop.
And another name change. (Hoop to HttpFS)
Not that changing names would confuse any techie types. Or their search engines. 😉
I wonder if Hadoop is a long tail community? Thoughts?
Comments Off on HttpFS for CDH3 – The Hadoop FileSystem over HTTP
August 6, 2012
Twitter’s Scalding – Scala and Hadoop hand in hand by Istvan Szegedi.
From the post:
If you have read the paper published by Google’s Jeffrey Dean and Sanjay Ghemawat (MapReduce: Simplied Data Processing on Large Clusters), they revealed that their work was inspired by the concept of functional languages: “Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages….Our use of a functional model with user-specified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance.”
Given the fact the Scala is a programming language that combines objective oriented and functional progarmming and runs on JVM, it is a fairly natural evolution to introduce Scala in Hadoop environment. That is what Twitter engineers did. (See more on how Scala is used at Twitter: “Twitter on Scala” and “The Why and How of Scala at Twitter“). Scala has powerful support for mapping, filtering, pattern matching (regular expressions) so it is a pretty good fit for MapReduce jobs.
Another guide to Scalding.
Comments Off on Twitter’s Scalding – Scala and Hadoop hand in hand
August 5, 2012
More Fun with Hadoop In Action Exercises (Pig and Hive) by Sujit Pal.
From the post:
In my last post, I described a few Java based Hadoop Map-Reduce solutions from the Hadoop in Action (HIA) book. According to the Hadoop Fundamentals I course from Big Data University, part of being a Hadoop practioner also includes knowing about the many tools that are part of the Hadoop ecosystem. The course briefly touches on the following four tools – Pig, Hive, Jaql and Flume.
Of these, I decided to focus (at least for the time being) on Pig and Hive (for the somewhat stupid reason that the HIA book covers these too). Both of these are are high level DSLs that produce sequences of Map-Reduce jobs. Pig provides a data flow language called PigLatin, and Hive provides a SQL-like language called HiveQL. Both tools provide a REPL shell, and both can be extended with UDFs (User Defined Functions). The reason they coexist in spite of so much overlap is because they are aimed at different users – Pig appears to be aimed at the programmer types and Hive at the analyst types.
The appeal of both Pig and Hive lies in the productivity gains – writing Map-Reduce jobs by hand gives you control, but it takes time to write. Once you master Pig and/or Hive, it is much faster to generate sequences of Map-Reduce jobs. In this post, I will describe three use cases (the first of which comes from the HIA book, and the other two I dreamed up).
More useful Hadoop exercise examples.
Comments Off on More Fun with Hadoop In Action Exercises (Pig and Hive)
August 4, 2012
Fun With Hadoop In Action Exercises (Java) by Sujit Pal.
From the post:
As some of you know, I recently took some online courses from Coursera. Having taken these courses, I have come to the realization that my knowledge has some rather large blind spots. So far, I have gotten most of my education from books and websites, and I have tended to cherry pick subjects which I need at the moment for my work, as a result of which I tend to ignore stuff (techniques, algorithms, etc) that fall outside that realm. Obviously, this is Not A Good Thing™, so I have begun to seek ways to remedy that.
I first looked at Hadoop years ago, but never got much beyond creating proof of concept Map-Reduce programs (Java and Streaming/Python) for text mining applications. Lately, many subprojects (Pig, Hive, etc) have come up in order to make it easier to deal with large amounts of data using Hadoop, about which I know nothing. So in an attempt to ramp up relatively quickly, I decided to take some courses at BigData University.
The course uses BigInsights (IBM’s packaging of Hadoop) which run only on Linux. VMWare images are available, but since I have a Macbook Pro, that wasn’t much use to me without a VMWare player (not free for Mac OSX). I then installed VirtualBox and tried to run a Fedora 10 64-bit image on it, and install BigInsights on Fedora, but it failed. I then tried to install Cloudera CDH4 (Cloudera’s packaging of Hadoop) on it (its a series of yum commands), but that did not work out either. Ultimately I decided to ditch VirtualBox altogether and do a pseudo-distributed installation of the stock Apache Hadoop (1.0.3) direct on my Mac following instructions on Michael Noll’s page.
The Hadoop Fundamentals I course which I was taking covers quite a few things, but I decided to stop and actually read all of Hadoop in Action (HIA) in order to get a more thorough coverage. I had purchased it some years before as part of Manning’s MEAP (Early Access) program, so its a bit dated (examples are mostly in the older 0.19 API), but its the only Hadoop book I possess, and the concepts are explained beautifully, and its not a huge leap to mentally translate code from the old API to the new, so it was well worth the read.
I also decided to tackle the exercises (in Java for now) and post my solutions on GitHub. Three reasons. First, it exposes me to a more comprehensive set of scenarios than I have had previously, and forces me to use techniques and algorithms that I wont otherwise. Second, hopefully some of my readers can walk circles around me where Hadoop is concerned, and they would be kind enough to provide criticism and suggestions for improvement. And third, there may be some who would benefit from having the HIA examples worked out. So anyway, here they are, my solutions to selected exercises from Chapters 4 and 5 of the HIA book for your reading pleasure.
Much good content follows!
This will be useful to a large number of people.
As well as setting a good example.
Comments Off on Fun With Hadoop In Action Exercises (Java)
August 3, 2012
Introducing Apache Hadoop YARN by Arun Murthy.
From the post:
I’m thrilled to announce that the Apache Hadoop community has decided to promote the next-generation Hadoop data-processing framework, i.e. YARN, to be a sub-project of Apache Hadoop in the ASF!
Apache Hadoop YARN joins Hadoop Common (core libraries), Hadoop HDFS (storage) and Hadoop MapReduce (the MapReduce implementation) as the sub-projects of the Apache Hadoop which, itself, is a Top Level Project in the Apache Software Foundation. Until this milestone, YARN was a part of the Hadoop MapReduce project and now is poised to stand up on it’s own as a sub-project of Hadoop.
In a nutshell, Hadoop YARN is an attempt to take Apache Hadoop beyond MapReduce for data-processing.
As folks are aware, Hadoop HDFS is the data storage layer for Hadoop and MapReduce was the data-processing layer. However, the MapReduce algorithm, by itself, isn’t sufficient for the very wide variety of use-cases we see Hadoop being employed to solve. With YARN, Hadoop now has a generic resource-management and distributed application framework, where by, one can implement multiple data processing applications customized for the task at hand. Hadoop MapReduce is now one such application for YARN and I see several others given my vantage point – in future you will see MPI, graph-processing, simple services etc.; all co-existing with MapReduce applications in a Hadoop YARN cluster.
Considering the explosive growth of Hadoop, what new data processing applications do you see emerging first in YARN?
Comments Off on Introducing Apache Hadoop YARN
« Newer Posts —
Older Posts »