Hadoop « Another Word For It

October 21, 2012

Full stack HA in Hadoop 1: HBase’s Resilience to Namenode Failover

Filed under: Hadoop,HBase,Systems Administration — Patrick Durusau @ 2:22 pm

Full stack HA in Hadoop 1: HBase’s Resilience to Namenode Failover by Devaraj Das.

From the post:

In this blog, I’ll cover how we tested Full Stack HA with NameNode HA in Hadooop 1 with Hadoop and HBase as components of the stack.

Yes, NameNode HA is finally available in the Hadoop 1 line. The test was done with Hadoop branch-1 and HBase-0.92.x on a cluster of roughly ten nodes. The aim was to try to keep a really busy HBase cluster up in the face of the cluster’s NameNode repeatedly going up and down. Note that, HBase would be functional during the time NameNode would be down. It’d only affect those operations that requires a trip to the NameNode (for example, rolling of the WAL, or compaction, or flush), and those would affect only the relevant end users (a user using the HBase get API may not be affected if that get didn’t require a new file open, for example).

A non-reliable cluster is just that, a non-reliable cluster. Not as bad as a backup that may or may not restore your data, but almost.

Regularly and routinely test any alleged HA capability along with backup restore capability. Document that testing.

As opposed to “testing” when either has to work or critical operations will fail or critical data will be lost.*

*Not Miller but résumé time.

Comments Off

October 19, 2012

What’s New in CDH4.1 Hue

Filed under: Hadoop,Hue — Patrick Durusau @ 3:36 pm

What’s New in CDH4.1 Hue by Romain Rigaux

From the post:

Hue is a Web-based interface that makes it easier to use Apache Hadoop. Hue 2.1 (included in CDH4.1) provides a new application on top of Apache Oozie (a workflow scheduler system for Apache Hadoop) for creating workflows and scheduling them repetitively. For example, Hue makes it easy to group a set of MapReduce jobs and Hive scripts and run them every day of the week.

In this post, we’re going to focus on the Workflow component of the new application.

“[E]very day of the week” includes the weekend.

That got your attention?

Let Hue manage the workflow and you enjoy the weekend.

Comments Off

Situational Aware Mappers with JAQL

Filed under: Hadoop,JAQL,MapReduce — Patrick Durusau @ 3:32 pm

Situational Aware Mappers with JAQL

From the post:

Adapting MapReduce for a higher performance has been one of the popular discussion topics. Let’s continue with our series on Adaptive MapReduce and explore the feature available via JAQL in IBM BigInsights commercial offering. This implementation also points to a much more vital corollary that enterprise offerings of Apache Hadoop are not just mere packaging and re-sell but have a bigger research initiative going on beneath the covers.

Two papers are explored by the post:

[1] Rares Vernica, Andrey Balmin, Kevin S. Beyer, Vuk Ercegovac: Adaptive MapReduce using situation-aware mappers. EDBT 2012: 420-431

Abstract:

We propose new adaptive runtime techniques for MapReduce that improve performance and simplify job tuning. We implement these techniques by breaking a key assumption of MapReduce that mappers run in isolation. Instead, our mappers communicate through a distributed meta-data store and are aware of the global state of the job. However, we still preserve the fault-tolerance, scalability, and programming API of MapReduce. We utilize these “situation-aware mappers” to develop a set of techniques that make MapReduce more dynamic: (a) Adaptive Mappers dynamically take multiple data partitions (splits) to amortize mapper start-up costs; (b) Adaptive Combiners improve local aggregation by maintaining a cache of partial aggregates for the frequent keys; (c) Adaptive Sampling and Partitioning sample the mapper outputs and use the obtained statistics to produce balanced partitions for the reducers. Our experimental evaluation shows that adaptive techniques provide up to 3x performance improvement, in some cases, and dramatically improve performance stability across the board.

[2] Andrey Balmin, Vuk Ercegovac, Rares Vernica, Kevin S. Beyer: Adaptive Processing of User-Defined Aggregates in Jaql. IEEE Data Eng. Bull. 34(4): 36-43 (2011)

Abstract:

Adaptive techniques can dramatically improve performance and simplify tuning for MapReduce jobs. However, their implementation often requires global coordination between map tasks, which breaks a key assumption of MapReduce that mappers run in isolation. We show that it is possible to preserve fault-tolerance, scalability, and ease of use of MapReduce by allowing map tasks to utilize a limited set of high-level coordination primitives. We have implemented these primitives on top of an open source distributed coordination service. We expose adaptive features in a high-level declarative query language, Jaql, by utilizing unique features of the language, such as higher-order functions and physical transparency. For instance, we observe that maintaining a small amount of global state could help improve performance for a class of aggregate functions that are able to limit the output based on a global threshold. Such algorithms arise, for example, in Top-K processing, skyline queries, and exception handling. We provide a simple API that facilitates safe and efficient development of such functions.

The bar for excellence in the use of Hadoop keeps getting higher!

Comments Off

What’s New in CDH4.1 Pig

Filed under: Cloudera,Hadoop,Pig — Patrick Durusau @ 3:28 pm

What’s New in CDH4.1 Pig by Cheolsoo Park.

From the post:

Apache Pig is a platform for analyzing large data sets that provides a high-level language called Pig Latin. Pig users can write complex data analysis programs in an intuitive and compact manner using Pig Latin.

Among many other enhancements, CDH4.1, the newest release of Cloudera’s open-source Hadoop distro, upgrades Pig from version 0.9 to version 0.10. This post provides a summary of the top seven new features introduced in CDH4.1 Pig.

Cheolsoo covers these new features:

Boolean Data Type

Nested FOREACH and CROSS

Ruby UDFs

LIMIT / SPLIT by Expression

Default SPLIT Destination

Syntactical Sugar for TOTUPLE, TOBAG, and TOMAP

AvroStorage Improvements

Enjoy!

Comments Off

October 18, 2012

Axemblr’s Java Client for the Cloudera Manager API

Filed under: Cloud Computing,Cloudera,Hadoop — Patrick Durusau @ 10:38 am

Axemblr’s Java Client for the Cloudera Manager API by Justin Kestelyn.

From the post:

Axemblr, purveyors of a cloud-agnostic MapReduce Web Service, have recently announced the availability of an Apache-licensed Java Client for the Cloudera Manager API.

The task at hand, according to Axemblr, is to ”deploy Hadoop on Cloud with as little user interaction as possible. We have the code to provision the hosts but we still need to install and configure Hadoop on all nodes and make it so the user has a nice experience doing it.” And voila, the answer is Cloudera Manager, with the process made easy via the REST API introduced in Release 4.0.

Thus, says Axemblr: “In the pursuit of our greatest desire (second only to coffee early in the morning), we ended up writing a Java client for Cloudera Manager’s API. Thus we achieved to automate a CDH3 Hadoop installation on Amazon EC2 and Rackspace Cloud. We also decided to open source the client so other people can play along.”

Another goodie to ease your way to Hadoop deployment on your favorite cloud.

Do you remember the lights at radio stations that would show “On Air?”

I need an “On Cloud” that lights up. More realistic than the data appliance.

Comments Off

October 17, 2012

Improving the integration between R and Hadoop: rmr 2.0 released

Filed under: Hadoop,R,RHadoop — Patrick Durusau @ 9:14 am

Improving the integration between R and Hadoop: rmr 2.0 released

David Smith reports:

The RHadoop project, the open-source project supported by Revolution Analytics to integrate R and Hadoop, continues to evolve. Now available is version 2 of the rmr package, which makes it possible for R programmers to write map-reduce tasks in the R language, and have them run within the Hadoop cluster. This update is the "simplest and fastest rmr yet", according to lead developer Antonio Piccolboni. While previous releases added performance-improving vectorization capabilities to the interface, this release simplifies the API while still improving performance (for example, by using native serialization where appropriate). This release also adds some conveniance functions, for example for taking random samples from Big Data stored in Hadoop. You can find further details of the changes here, and download RHadoop here.

RHadoop Project: Changelog

As you know, I’m not one to complain, ;-), but I read from above:

…this release simplifies the API while still improving performance [a good thing]

as contradicting the release notes that read in part:

…At the same time, we increased the complexity of the API. With this version we tried to define a synthesis between all the modes (record-at-a-time, vectorized and structured) present in 1.3, with the following goals:

bring the footprint of the API back to 1.2 levels.

make sure that no matter what the corner of the API one is exercising, he or she can rely on simple properties and invariants; writing an identity mapreduce should be trivial.

encourage writing the most efficient and idiomatic R code from the start, as opposed to writing against a simple API first and then developing a vectorized version for speed.

After reading the change notes, I’m going with the “simplifies the API” riff.

Take a close look and see what you think.

Comments Off

October 16, 2012

Mortar Takes Aim at Hadoop Usability [girls keep out]

Filed under: Hadoop,Usability — Patrick Durusau @ 9:48 am

Maybe I am being overly sensitive but I don’t see a problem with:

…a phalanx of admins to oversee … a [Hadoop] operation

I mean, that why they have software/hardware is to provide places for admins to gather and play. Right? 😉

Or NOT!

Maybe Ian Armas Foster in Mortar Takes Aim at Hadoop Usability has some good points:

“Have a pile of under-utilized data? Want to use Hadoop but can’t spend weeks or months getting started?” According to fresh startup Mortar, these are questions that should appeal to potential Hadoop users, who are looking to wrap their arms around the elephant without hiring a phalanx of admins to oversee the operation.

Mortar claims to make Hadoop more accessible to the people most responsible for garnering insight from big data: data scientists and engineers. The young startup took flight when a couple of architects at Wireless Generation decided that big data tools and approaches were complex enough to warrant a new breed of offering–one that could take the hardware element out of Hadoop use.

(video omitted)

Hadoop is a terrific open-source data tool that can process and perform analytics (sometimes predictive) on big data and large datasets. An unfortunate property of Hadoop is its difficult utility. Many companies looking to get into big data simply invest in Hadoop clusters without a vision as to how to use the cluster or without the resources, human on monetary, to execute said vision.

“Hadoop is an amazing technology but for most companies it was out of reach,” said Young in a presentation at the New York City Data Business Meetup in September.

To combat this, Mortar is building a web based product-as-a-service in which someone need simply need log on to the Mortar website and then they can start writing the code allowing their pile of data to do what it wants. “We wanted to make operation very easy,” said Young “because it’s very hard to hire people with Hadoop expertise and because Hadoop is sort of famously hard to operate.”

A bit further in the article, it is claimed that a “data scientist” can be up and using Hadoop in one (1) hour.

Can you name another technology that is “…famously hard to operate?”

Do data integration, semantics, semantic web, RDF, master data management, topic maps come to mind?

If they do, what do you think can be done to make them easier to operate?

Having a hard to operate approach, technology or tool may be thrilling, in a “girls keep out” clubhouse sort of way, but it isn’t the road to success, commercial or otherwise.

Comments (1)

Analyzing Twitter Data with Hadoop, Part 2: Gathering Data with Flume

Filed under: Cloudera,Flume,Hadoop,Tweets — Patrick Durusau @ 9:15 am

Analyzing Twitter Data with Hadoop, Part 2: Gathering Data with Flume by Jon Natkins.

From the post:

This is the second article in a series about analyzing Twitter data using some of the components of the Hadoop ecosystem available in CDH, Cloudera’s open-source distribution of Hadoop and related projects. In the first article, you learned how to pull CDH components together into a single cohesive application, but to really appreciate the flexibility of each of these components, we need to dive deeper.

Every story has a beginning, and every data pipeline has a source. So, to build Hadoop applications, we need to get data from a source into HDFS.

Apache Flume is one way to bring data into HDFS using CDH. The Apache Flume website describes Flume as “a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.” At the most basic level, Flume enables applications to collect data from its origin and send it to a resting location, such as HDFS. At a slightly more detailed level, Flume achieves this goal by defining dataflows consisting of three primary structures: sources, channels and sinks. The pieces of data that flow through Flume are called events, and the processes that run the dataflow are called agents.

A very good introduction to the use of Flume!

Does it seem to you that the number of examples using Twitter, not just for “big data” but in general seems to be on the rise?

Just a personal observation and subject to all the flaws, “all the buses were going the other way,” of such.

Judging from the state of my inbox, some people are still writing more than 140 characters at a time.

Will it make a difference in our tools/thinking if we focus on shorter strings as opposed to longer ones?

Comments Off

Ready to Contribute to Apache Hadoop 2.0?

Filed under: Hadoop,Hadoop YARN,Hortonworks — Patrick Durusau @ 4:08 am

User feedback is a contribution to a software project.

Software can only mature with feedback, your feedback.

Otherwise the final deliverable has a “works on my machine” outcome.

Don’t let Apache Hadoop 2.0 have a “works on my machine” outcome.

Download the preview and contribute your experiences back to the community.

We will all be glad you did!

Details:

Hortonworks Data Platform 2.0 Alpha is Now Available for Preview! by Jeff Sposetti.

From the post:

We are very excited to announce the Alpha release of the Hortonworks Data Platform 2.0 (HDP 2.0 Alpha).

HDP 2.0 Alpha is built around Apache Hadoop 2.0, which improves availability of HDFS with High Availability for the NameNode along with several performance and reliability enhancements. Apache Hadoop 2.0 also significantly advances data processing in the Hadoop ecosystem with the introduction of YARN, a generic resource-management and application framework to support MapReduce and other paradigms such as real-time processing and graph processing.

In addition to Apache Hadoop 2.0, this release includes the essential Hadoop ecosystem projects such as Apache HBase, Apache Pig, Apache Hive, Apache HCatalog, Apache ZooKeeper and Apache Oozie to provide a fully integrated and verified Apache Hadoop 2.0 stack

Apache Hadoop 2.0 is well on the path to General Availability, and is already deployed at scale in several organizations; but it won’t get to the current maturity levels of the Hadoop 1.0 stack (available in Hortonworks Data Platform 1.x) without feedback and contributions from the community.

Hortonworks strongly believes that for open source technologies to mature and become widely adopted in the enterprise, you must balance innovation with stability. With HDP 2.0 Alpha, Hortonworks provides organizations an easy way to evaluate and gain experience with the Apache Hadoop 2.0 technology stack, and it presents the perfect opportunity to help bring stability to the platform and influence the future of the technology.

Comments Off

October 15, 2012

Real-time Big Data Analytics Engine – Twitter’s Storm

Filed under: BigData,Hadoop,Storm — Patrick Durusau @ 8:44 am

Real-time Big Data Analytics Engine – Twitter’s Storm by Istvan Szegedi.

From the post:

Hadoop is a batch-oriented big data solution at its heart and leaves gaps in ad-hoc and real-time data processing at massive scale so some people have already started counting its days as we know it now. As one of the alternatives, we have already seen Google BigQuery to support ad-hoc analytics and this time the post is about Twitter’s Storm real-time computation engine which aims to provide solution in the real-time data analytics world. Storm was originally developed by BackType and running now under Twitter’s name, after BackType has been acquired by them. The need for having a dedicated real-time analytics solution was explained by Nathan Marz as follows: “There’s no hack that will turn Hadoop into a realtime system; realtime data processing has a fundamentally different set of requirements than batch processing…. The lack of a “Hadoop of realtime” has become the biggest hole in the data processing ecosystem. Storm fills that hole.”

Introduction to Storm, including a walk through the word count typology example that comes with the current download.

A useful addition to your toolkit!

Comments Off

October 11, 2012

Big Data Security Part One: Introducing PacketPig

Filed under: BigData,Hadoop,PacketPig,Security,Systems Administration — Patrick Durusau @ 4:04 pm

Big Data Security Part One: Introducing PacketPig by Michael Baker.

From the post:

Packetloop CTO Michael Baker (@cloudjunky) made a big splash when he presented ‘Finding Needles in Haystacks (the Size of Countries)‘ at Blackhat Europe earlier this year. The paper outlines a toolkit based on Apache Pig, Packetpig @packetpig (available on github), for doing network security monitoring and intrusion detection analysis on full packet captures using Hadoop.

In this series of posts, we’re going to introduce Big Data Security and explore using Packetpig on real full packet captures to understand and analyze networks. In this post, Michael will introduce big data security in the form of full data capture, Packetpig and Packetloop.

If you are a bit rusty on packets, TCP/IP, I could just wave my hands and say: “See the various tutorials.” and off you go to hunt something down.

Let me be more helpful than that and suggest: TCP/IP Tutorial and Technical Overview from the IBM RedBooks we were talking about earlier.

It’s not short (almost a thousand pages) but it isn’t W. Richards Stevens on the other hand (in three volumes). 😉

You won’t need all of either resource but it is better to start with too much than too little.

Comments Off

Hadoop/HBase Cluster ~ 1 Hour/$10 (What do you have to lose?)

Filed under: Hadoop,HBase — Patrick Durusau @ 3:08 pm

Set Up a Hadoop/HBase Cluster on EC2 in (About) an Hour by George London.

From the post:

I’m going to walk you through a (relatively) simple set of steps that will get you up and running MapReduce programs on a cloud-based, six-node distributed Hadoop/HBase cluster as fast as possible. This is all based on what I’ve picked up on my own, so if you know of better/faster methods, please let me know in comments!

We’re going to be running our cluster on Amazon EC2, and launching the cluster using Apache Whirr and configuring it using Cloudera Manager Free Edition. Then we’ll run some basic programs I’ve posted on Github that will parse data and load it into Apache HBase.

All together, this tutorial will take a bit over one hour and cost about $10 in server costs.

This is the sort of tutorial that I long to write for topic maps.

There is a longer version of this tutorial here.

Comments Off

October 10, 2012

What is Hadoop Metrics2?

Filed under: Cloudera,Hadoop,Systems Administration — Patrick Durusau @ 4:17 pm

What is Hadoop Metrics2? by Ahmed Radwan.

I’ve been wondering about that. How about you? 😉

From the post:

Metrics are collections of information about Hadoop daemons, events and measurements; for example, data nodes collect metrics such as the number of blocks replicated, number of read requests from clients, and so on. For that reason, metrics are an invaluable resource for monitoring Hadoop services and an indispensable tool for debugging system problems.

This blog post focuses on the features and use of the Metrics2 system for Hadoop, which allows multiple metrics output plugins to be used in parallel, supports dynamic reconfiguration of metrics plugins, provides metrics filtering, and allows all metrics to be exported via JMX.

However cool the software, can’t ever really get away from managing the software.

And it isn’t a bad skill to have. Read on!

Comments Off

October 8, 2012

Are You Confused? (About MR2 and YARN?) Help is on the way!

Filed under: Hadoop,Hadoop YARN,MapReduce 2.0 — Patrick Durusau @ 6:51 pm

MR2 and YARN Briefly Explained by Justin Kestelyn.

Justin writes:

With CDH4 onward, the Apache Hadoop component introduced two new terms for Hadoop users to wonder about: MR2 and YARN. Unfortunately, these terms are mixed up so much that many people are confused about them. Do they mean the same thing, or not?

Not but see Justin’s post for the details. (He also points to a longer post with more details.)

Comments Off

October 7, 2012

Gagnam Style Hadoop Learning

Filed under: Hadoop,Humor — Patrick Durusau @ 7:43 pm

Gagnam Style Hadoop Learning

Err, you will just have to see this one. It…, defies description.

Not management appropriate, too many words. That would lead to questions.

Let’s start the week by avoiding management questions because of too many words in a video.

Comments Off

October 6, 2012

Applying Parallel Prediction to Big Data

Filed under: Hadoop,Mahout,Oracle,Pig,Weather Data,Weka — Patrick Durusau @ 3:20 pm

Applying Parallel Prediction to Big Data by Dan McClary (Principal Product Manager for Big Data and Hadoop at Oracle).

From the post:

One of the constants in discussions around Big Data is the desire for richer analytics and models. However, for those who don’t have a deep background in statistics or machine learning, it can be difficult to know not only just what techniques to apply, but on what data to apply them. Moreover, how can we leverage the power of Apache Hadoop to effectively operationalize the model-building process? In this post we’re going to take a look at a simple approach for applying well-known machine learning approaches to our big datasets. We’ll use Pig and Hadoop to quickly parallelize a standalone machine-learning program written in Jython.

Playing Weatherman

I’d like to predict the weather. Heck, we all would – there’s personal and business value in knowing the likelihood of sun, rain, or snow. Do I need an umbrella? Can I sell more umbrellas? Better yet, groups like the National Climatic Data Center offer public access to weather data stretching back to the 1930s. I’ve got a question I want to answer and some big data with which to do it. On first reaction, because I want to do machine learning on data stored in HDFS, I might be tempted to reach for a massively scalable machine learning library like Mahout.

For the problem at hand, that may be overkill and we can get it solved in an easier way, without understanding Mahout. Something becomes apparent on thinking about the problem: I don’t want my climate model for San Francisco to include the weather data from Providence, RI. Weather is a local problem and we want to model it locally. Therefore what we need is many models across different subsets of data. For the purpose of example, I’d like to model the weather on a state-by-state basis. But if I have to build 50 models sequentially, tomorrow’s weather will have happened before I’ve got a national forecast. Fortunately, this is an area where Pig shines.

Two quick observations:

First, Dan makes my point about your needing the “right” data, which may or may not be the same thing as “big data.” Decide what you want to do before you reach for big iron and data.

Second, I never hear references to the “weatherman” without remembering: “you don’t need to be a weatherman to know which way the wind blows.” (link to the manifesto) If you prefer a softer version, Subterranean Homesick Blues by Bob Dylan.

Comments Off

October 4, 2012

YARN Meetup at Hortonworks on Friday, Oct 12

Filed under: Hadoop,Hadoop YARN,Hortonworks — Patrick Durusau @ 4:35 pm

YARN Meetup at Hortonworks on Friday, Oct 12 by Russell Jurney.

From the post:

Hortonworks is hosting an Apache YARN Meetup on Friday, Oct 12, to solicit feedback on the YARN APIs. We’ve talked about YARN before in a four-part series on YARN, parts one, two, three and four.

YARN, or “Apache Hadoop NextGen MapReduce,” has come a long way this year. It is now a full-fledged sub-project of Apache Hadoop and has already been deployed on a massive 2,000 node cluster at Yahoo. Many projects, both open-src and otherwise, are porting to work in YARN such as Storm, S4 and many of them are in fairly advanced stages. We also have several individuals implementing one-off or ad-hoc application on YARN.

This meetup is a good time for YARN developers to catch up and talk more about YARN, it’s current status and medium-term and long-term roadmap.

OK, it’s probably too late to get cheap tickets but if you are in New York on the 12th of October, take advantage of the opportunity!

And please blog about the meeting, with a note to yours truly! I will post a link to your posting.

Comments Off

Adapting MapReduce for realtime apps

Filed under: Hadoop,MapReduce — Patrick Durusau @ 4:28 pm

Adapting MapReduce for realtime apps

From the post:

As much as MapReduce is popular, so much is the discussion to make it even better from a generalized approach to higher performance oriented approach. We will be discussing a few frameworks which have tried to adapt MapReduce further for higher performance orientation.

The first post in this series tries will discuss AMREF, an Adaptive MapReduce Framework designed for real time data intensive applications. (published in the paper Fan Zhang, Junwei Cao, Xiaolong Song, Hong Cai, Cheng Wu: AMREF: An Adaptive MapReduce Framework for Real Time Applications. GCC 2010: 157-162.)

If you are interested in squeezing more performance out of MapReduce, this looks like a good starting place.

Comments Off

Scalable Machine Learning with Hadoop (most of the time)

Filed under: Hadoop,Machine Learning,Mahout — Patrick Durusau @ 1:44 pm

Scalable Machine Learning with Hadoop (most of the time) by Grant Ingersoll. (slides)

Grant’s slides from a presentation on machine learning with Hadoop in Taiwan!

Not quite like being there but still useful.

And a reminder that I need to get a copy of Taming Text!

Comments (1)

October 3, 2012

CDH4.1 Now Released!

Filed under: Cloudera,Flume,Hadoop,HBase,HDFS,Hive,Pig — Patrick Durusau @ 8:28 pm

CDH4.1 Now Released! by Charles Zedlewski.

From the post:

We’re pleased to announce the availability of CDH4.1. We’ve seen excellent adoption of CDH4.0 since it went GA at the end of June and a number of exciting use cases have moved to production. CDH4.1 is an update that has a number of fixes but also a number of useful enhancements. Among them:

Quorum based storage – Quorum-based Storage for HDFS provides the ability for HDFS to store its own NameNode edit logs, allowing you to run a highly available NameNode without external storage or custom fencing.

Hive security and concurrency – we’ve fixed some long standing issues with running Hive. With CDH4.1, it is now possible to run a shared Hive instance where users submit queries using Kerberos authentication. In addition this new Hive server supports multiple users submitting queries at the same time.

Support for DataFu – the LinkedIn data science team was kind enough to open source their library of Pig UDFs that make it easier to perform common jobs like sessionization or set operations. Big thanks to the LinkedIn team!!!

Oozie workflow builder – since we added Oozie to CDH more than two years ago, we have often had requests to make it easier to develop Oozie workflows. The newly enhanced job designer in Hue enables users to use a visual tool to build and run Oozie workflows.

FlumeNG improvements – since its release, FlumeNG has become the backbone for some exciting data collection projects, in some cases collecting as much as 20TB of new event data per day. In CDH4.1 we added an HBase sink as well as metrics for monitoring as well as a number of performance improvements.

Various performance improvements – CDH4.1 users should experience a boost in their MapReduce performance from CDH4.0.

Various security improvements – CDH4.1 enables users to configure the system to encrypt data in flight during the shuffle phase. CDH now also applies Hadoop security to users who access the filesystem via a FUSE mount.

It’s releases like this that make me wish I spent more time writing documentation for software. To try out all the cool features with no real goal other than trying them out.

Enjoy!

Comments Off

September 30, 2012

Quantcast File System for Hadoop

Filed under: Hadoop,QFS — Patrick Durusau @ 8:32 pm

Quantcast File System for Hadoop

From Alex Popescu’s myNoSQL news of a new file system for Hadoop.

QFS

Pointers to comments?

Comments Off

September 29, 2012

Hadoop as Java Ecosystem “MVP”

Filed under: Hadoop,MapReduce — Patrick Durusau @ 3:12 pm

Apache Hadoop Wins Duke’s Choice Award, is a Java Ecosystem “MVP” by Justin Kestelyn.

From the post:

For those of you new to it, the Duke’s Choice Awards program was initiated by Sun Microsystems in 2002 in an effort to “celebrate extreme innovation in the world of Java technology” – in essence, it’s the “MVP” of the Java ecosystem. Since it acquired Sun in 2009, Oracle has continued the tradition of bestowing the award, and in fact has made the process more community-oriented by accepting nominations from the public and involving Java User Groups in the judging effort.

For the 2012 awards, I’m happy to report that Apache Hadoop is among the awardees – which also include the United Nations High Commission for Refugees, Liquid Robotics, and Java cloud company Jelastic Inc., among others.

Very cool!

Kudos to the Apache Hadoop project!

Comments Off

September 28, 2012

Alan Gates CHUGs HCatalog in Windy City

Filed under: Hadoop,HCatalog — Patrick Durusau @ 2:19 pm

Alan Gates CHUGs HCatalog in Windy City (Chicago Hadoop User Group) by Kim Truong

From the post:

Alan Gates presented HCatalog to the Chicago Hadoop User Group (CHUG) on 9/17/12. There was a great turnout, and the strength of CHUG is evidence that Chicago is a Hadoop city. Below are some kind words from the host, Mark Slusar.

On 9/17/12, the Chicago Hadoop User Group (CHUG) was delighted to host Hortonworks Co-Founder Alan Gates to give an overview of HCatalog. In addition to downtown Chicago meetups, Allstate Insurance Company in Northbrook, IL hosts regular Chicago Hadoop User Group Meetups. After noshing on refreshments provided by Hortonworks, attendees were treated to an in-depth overview of HCatalog, it’s history, as well as how and when to use it. Alan’s experience and expertise were an excellent contribution to CHUG. Alan made a great connection with every attendee. With his detailed lecture, he answered many questions, and also joined a handful of attendees for drinks after the meetup. CHUG would be thrilled to have Alan & Hortonworks team return in the future!” – Mark Slusar

What a great way to start the weekend!

Enjoy!

Comments Off

September 27, 2012

Hadoop and Metadata (Removing the Impedance Mis-match)

Filed under: Hadoop,HCatalog,Metadata — Patrick Durusau @ 7:11 pm

Hadoop and Metadata (Removing the Impedance Mis-match) by Alan Gates, Russell Jurney.

From the post:

Apache Hadoop enables a revolution in how organization’s process data, with the freedom and scale Hadoop provides enabling new kinds of applications building new kinds of value and delivering results from big data on shorter timelines than ever before. The shift towards a Hadoop-centric mode of data processing in the enterprise has however posed a challenge: how do we collaborate in the context of the freedom that Hadoop provides us? How do we share data which can be stored and processed in any format the user desires? Furthermore, how do we integrate between different tools and with other systems that make-up data-center as computer?

As a Hadoop user, the need for a metadata directory is clear. Users don’t want to ‘reinvent the wheel’ and repeat the work of others. They want to share results and intermediate data-sets and collaborate with colleagues. Given the needs of users, the case for a generic metadata mechanism on top of Hadoop is easy to make: increased visibility into data assets by registering them with a metadata registry for discovery and sharing enables increased efficiency. Less work for the user.

Users also want to be able to use different tool-sets and systems together – Hadoop and non-Hadoop alike. As a Hadoop user, there is a clear need for interoperability among the diverse tools on today’s Hadoop cluster: Hive, Pig, Cascading, Java MapReduce and streaming Python, C/C++, perl, and ruby with data stored in formats from CSV, TSV, Thrift, Protobuf, Avro, SequenceFiles, Hive’s RCFile as well as proprietary formats.

Finally, raw data does not usually originate on the Hadoop Distributed Filesystem. There is a clear need for a central point to register resources from different kinds of systems for ETL onto HDFS, and to publish results of analyses on Hadoop onto other systems.

Sounds topic mappish doesn’t it?

Marketable HCatalog data products anyone?

I first saw this at Hortonworks.

Comments Off

September 25, 2012

Location Sensitive Hashing in Map Reduce

Filed under: Hadoop,MapReduce — Patrick Durusau @ 3:23 pm

Location Sensitive Hashing in Map Reduce by Ricky Ho.

From the post:

Inspired by Dr. Gautam Shroff who teaches the class: Web Intelligence and Big data in coursera.org, there are many scenarios where we want to compute similarity between large amount of items (e.g. photos, products, persons, resumes … etc). I want to add another algorithm to my Map/Reduce algorithm catalog.

For the background of Map/Reduce implementation on Hadoop. I have a previous post that covers the details.

“Location” here is not used in the geographic sense but as a general measure of distance. Could be geographic, but could be some other measure of location as well.

Comments Off

Search Hadoop with Search-Hadoop.com

Filed under: Hadoop,Lucene — Patrick Durusau @ 10:46 am

Search Hadoop with Search-Hadoop.com by Russell Jurney.

As the Hadoop ecosystem has exploded into many projects, searching for the right answers when questions arise can be a challenge. Thats why I was thrilled to hear about search-hadoop.com, from Sematext. It has a sister site called search-lucene where you can… search lucene!

Class! Class! Pay attention now.

These are examples of value-added services.

Both of these are going on my browser tool bar. How about you?

Comments Off

September 24, 2012

Schedule This! Strata + Hadoop World Speakers from Cloudera

Filed under: Cloudera,Conferences,Hadoop — Patrick Durusau @ 2:38 pm

Schedule This! Strata + Hadoop World Speakers from Cloudera by Justin Kestelyn.

Oct. 23-25, 2012, New York City

From the post:

We’re getting really close to Strata Conference + Hadoop World 2012 (just over a month away), schedule planning-wise. So you may want to consider adding the tutorials, sessions, and keynotes below to your calendar! (Start times are always subject to change of course.)

The ones listed below are led or co-led by Clouderans, but there is certainly a wide range of attractive choices beyond what you see here. We just want to ensure that you put these particular ones high on your consideration list.

Just in case the Clouderans aren’t enough incentive to attend (they should be), consider the full schedule for the conference.

Comments Off

September 23, 2012

Meet the Committer, Part Two: Matt Foley [Ambari herein]

Filed under: Clustering (servers),Hadoop — Patrick Durusau @ 7:30 pm

Meet the Committer, Part Two: Matt Foley by Kim Truong

From the post:

For the next installation of “Future of Apache Hadoop” webinar series, I would like to introduce to you Matt Foley and Ambari. Matt is a member of Hortonworks technical staff, Committer and PMC member for Apache Hadoop core project and will be our guest speaker on September 26, 2012 @10am PDT / 1pm EDT webinar: Deployment and Management of Hadoop Clusters with AMBARI.

Get to know Matt in this second installment of our “Meet the Committer” series.

No pressure but I do hope this compares well to the Alan Gates webinar on Pig. No pressure. 😉

In case you want to investigate/learn/brush up on Ambari.

Comments Off

Pig Out to Hadoop (Replay) [Restore Your Faith in Webinars]

Filed under: Hadoop,Hortonworks,Pig — Patrick Durusau @ 3:08 pm

Pig Out to Hadoop with Alan Gates (Link to the webinar page at Hortonworks. Scroll down for this webinar. You have to register/login to view.)

From the description:

Pig has added some exciting new features in 0.10, including a boolean type, UDFs in JRuby, load and store functions for JSON, bloom filters, and performance improvements. Join Alan Gates, Hortonworks co-founder and long-time contributor to the Apache Pig and HCatalog projects, to discuss these new features, as well as talk about work the project is planning to do in the near future. In particular, we will cover how Pig can take advantage of changes in Hadoop 0.23.

I should have been watching more closely for this webinar recording to get posted.

Not only is it a great webinar on Pig, but it will restore your faith in webinars as a means of content delivery.

I have suffered through several lately where introductions took more time than actual technical content of the webinar.

Hard to know until you have already registered and spent time expecting substantive content.

Is there a public tally board for webinars on search, semantics, big data, etc.?

Comments (1)

September 20, 2012

HCatalog Meetup at Twitter

Filed under: Hadoop,HCatalog,Pig — Patrick Durusau @ 7:22 pm

HCatalog Meetup at Twitter by Russell Jurney.

From the post:

Representatives from Twitter, Yahoo, LinkedIn, Hortonworks and IBM met at Twitter HQ on Thursday to talk HCatalog. Committers from HCatalog, Pig and Hive were on hand to discuss the state of HCatalog and its future.

Apache HCatalog is a table and storage management service for data created using Apache Hadoop.

See Russell’s post for more details.

Then brush up on HCatalog (if you aren’t already following it).

Comments Off

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 21, 2012

October 19, 2012

October 18, 2012

October 17, 2012

October 16, 2012

October 15, 2012

October 11, 2012

October 10, 2012

October 8, 2012

October 7, 2012

October 6, 2012

October 4, 2012

October 3, 2012

September 30, 2012

September 29, 2012

September 28, 2012

September 27, 2012

September 25, 2012

September 24, 2012

September 23, 2012

September 20, 2012