Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 12, 2013

…Apache HBase REST Interface, Part 1

Filed under: Cloudera,HBase — Patrick Durusau @ 2:20 pm

How-to: Use the Apache HBase REST Interface, Part 1 by Jesse Anderson.

From the post:

There are various ways to access and interact with Apache HBase. The Java API provides the most functionality, but many people want to use HBase without Java.

There are two main approaches for doing that: One is the Thrift interface, which is the faster and more lightweight of the two options. The other way to access HBase is using the REST interface, which uses HTTP verbs to perform an action, giving developers a wide choice of languages and programs to use.

This series of how-to’s will discuss the REST interface and provide Python code samples for accessing it. The first post will cover HBase REST, some Python caveats, and table administration. The second post will explain how to insert multiple rows at a time using XML and JSON. The third post will show how to get multiples rows using XML and JSON. The full code samples can be found on my GitHub account.

Post also has a reminder about HBaseCon 2013 (June 13, San Francisco).

January 9, 2013

Cloudera Impala: A Modern SQL Engine for Hadoop [Webinar – 10 Jan 2013]

Filed under: Cloudera,Hadoop,Impala — Patrick Durusau @ 12:04 pm

Cloudera Impala: A Modern SQL Engine for Hadoop

From the post:

Join us for this technical deep dive about Cloudera Impala, the project that makes scalable parallel databse technology available to the Hadoop community for the first time. Impala is an open-sourced code base that allows users to issue low-latency queries to data stored in HDFS and Apache HBase using familiar SQL operators.

Presenter Marcel Kornacker, creator of Impala, will begin with an overview of Impala from the user’s perspective, followed by an overview of Impala’s architecture and implementation, and will conclude with a comparison of Impala with Apache Hive, commercial MapReduce alternatives and traditional data warehouse infrastructure.

Looking forward to the comparison part. Picking the right tool for a job is an important first step.

December 14, 2012

How-To: Run a MapReduce Job in CDH4

Filed under: Cloudera,Hadoop,MapReduce — Patrick Durusau @ 2:56 pm

How-To: Run a MapReduce Job in CDH4 by Sandy Ryza.

From the post:

This is the first post in series that will get you going on how to write, compile, and run a simple MapReduce job on Apache Hadoop. The full code, along with tests, is available at http://github.com/cloudera/mapreduce-tutorial. The program will run on either MR1 or MR2.

We’ll assume that you have a running Hadoop installation, either locally or on a cluster, and your environment is set up correctly so that typing “hadoop” into your command line gives you some notes on usage. Detailed instructions for installing CDH, Cloudera’s open-source, enterprise-ready distro of Hadoop and related projects, are available here: https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation. We’ll also assume you have Maven installed on your system, as this will make compiling your code easier. Note that Maven is not a strict dependency; we could also compile using Java on the command line or with an IDE like Eclipse.

The Use Case

There’s been a lot of brawling on our pirate ship recently. Not so rarely, one of the mates will punch another one in the mouth, knocking a tooth out onto the deck. Our poor sailors will wake up the next day with an empty bottle of rum, wondering who’s responsible for the gap between their teeth. All this violence has gotten out of hand, so as a deterrent, we’d like to provide everyone with a list of everyone that’s ever left them with a gap. Luckily, we’ve been able to set up a Flume source so that every time someone punches someone else, it gets written out as a line in a big log file in Hadoop. To turn this data into these lists, we need a MapReduce job that can 1) invert the mapping from attacker to their victim, 2) group by victims, and 3) eliminate duplicates.

Cool!

Imagine using the same technique while you watch the evening news!

On second thought, that would take too much data entry and be depressing.

Stick to the pirates!

December 11, 2012

Solving real world analytics problems with Apache Hadoop [Webinar]

Filed under: Cloudera,Hadoop — Patrick Durusau @ 7:14 pm

Solving real world analytics problems with Apache Hadoop

Thursday December 13, 2012 at 8:30 a.m. PST/11:30 a.m. EST

From the registration page:

Agenda:

  • Defining big data
  • What are the most critical components of a big data solution?
  • The business and technical challenges of delivering a solution
  • How Cloudera accelerates big data value?
  • Why partner with HP?
  • The HP AppSystem powered by Cloudera

Doesn’t look heavy on the technical side but on the other hand, attending means you will be entered in a raffle for an HP Mini Notebook.

December 5, 2012

Impala Beta (0.3) + Cloudera Manager 4.1.2 [Get’m While Their Hot!]

Filed under: Cloudera,Hadoop,Impala,MapReduce — Patrick Durusau @ 5:46 am

Cloudera Impala Beta (version 0.3) and Cloudera Manager 4.1.2 Now Available by Vinithra Varadharajan.

If you are keeping your Hadoop ecosystem skills up to date, drop by Cloudera for the latest Impala beta and a new release of Cloudera Manager.

Vinithra reports that new releases of Impala are going to drop every two to four weeks.

You can either wait for the final release of Impala or read along and contribute to the final product with your testing and comments.

December 4, 2012

New to Hadoop

Filed under: Cloudera,Hadoop,MapReduce — Patrick Durusau @ 12:08 pm

New to Hadoop

Cloudera has organized a seven step program for learning Hadoop!

  1. Read Up on Background
  2. Install Locally, Install a VM, or Spin Up on Cloud
  3. Explore Tutorials
  4. Get Trained Up
  5. Read Books
  6. Contribute!
  7. Participate!

It doesn’t list every possible resource but all the ones listed are high quality.

Following this program will build a solid basis for exploring the Hadoop ecosystem on your own.

December 3, 2012

Cloudera – Videos from Strata + Hadoop World 2012

Filed under: Cloudera,Hadoop,MapReduce — Patrick Durusau @ 7:20 pm

Cloudera – Videos from Strata + Hadoop World 2012

The link is to the main resources page, where you can find many other videos and other materials.

If you want Strata + Hadoop World 2012 videos specifically, search on Hadoop World 2012.

As of today, that pulls up 41 entries. Should be enough to keep you occupied for a day or so. 😉

November 27, 2012

SINAInnovation: Innovation and Data

Filed under: Bioinformatics,Cloudera,Data — Patrick Durusau @ 2:26 pm

SINAInnovation: Innovation and Data by Jeffrey Hammerbacher.

From the description:

Cloudera Co-founder Jeff Hammerbacher speaks about data and innovation in the biology and medicine fields.

Interesting presentation, particularly on creating structures for innovation.

One of his insights I would summarize as “break early, rebuild fast.” His term for it was “lower batch size.” Try new ideas and when they fail, try a new one.

I do wonder about his goal to : “Lower the cost of data storage and processing to zero.”

It may get to be “too cheap to meter” but that isn’t the same thing as being zero. Somewhere in the infrastructure, someone is paying bills for storage and processing.

I mention that because some political parties think that infrastructure can exist without ongoing maintenance and care.

Failing infrastructures don’t lead to innovation.


SINAInnovation description:

SINAInnovations was a three-day conference at The Mount Sinai Medical Center that examined all aspects of innovation and therapeutic discovery within academic medical centers, from how it can be taught and fostered within academia, to how it can accelerate drug discovery and the commercialization of emerging biotechnologies.

November 19, 2012

The “Ask Bigger Questions” Contest!

Filed under: Cloudera,Contest,Hadoop — Patrick Durusau @ 7:17 pm

The “Ask Bigger Questions” Contest! by Ryan Goldman. (Deadline, Feb. 1 2013)

From the post:

Have you helped your company ask bigger questions? Our mission at Cloudera University is to equip Hadoop professionals with the skills to manage, process, analyze, and monetize more data than they ever thought possible.

Over the past three years, we’ve heard many great stories from our training participants about faster cluster deployments, complex data workflows made simple, and superhero troubleshooting moments. And we’ve heard from executives in all types of businesses that staffing Cloudera Certified professionals gives them confidence that their Hadoop teams have the skills to turn data into breakthrough insights.

Now, it’s your turn to tell us your bigger questions story! Cloudera University is seeking tales of Hadoop success originating with training and certification. How has an investment in your education paid dividends for your company, team, customer, or career?

The most compelling stories chosen from all entrants will receive prizes like Amazon gift cards, discounted Cloudera University training, autographed copies of Hadoop books from O’Reilly Media, and Cloudera swag. We may even turn your story into a case study!

Sign up to participate here. Submissions must be received by Friday, Feb. 1, 2013 to qualify for a prize.

A good marketing technique that might bear imitation.

Don’t have to seek out success stories. Incentives for people to bring them to you.

You get good marketing material that is likely to resonate with other users.

Something to think about.

BioInformatics: A Data Deluge with Hadoop to the Rescue

Filed under: Bioinformatics,Cloudera,Hadoop,Impala — Patrick Durusau @ 4:10 pm

BioInformatics: A Data Deluge with Hadoop to the Rescue by Marty Lurie.

From the post:

Cloudera Cofounder and Chief Scientist Jeff Hammerbacher is leading a revolutionary project with Mount Sinai School of Medicine to apply the power of Cloudera’s Big Data platform to critical problems in predicting and understanding the process and treatment of disease.

“We are at the cutting edge of disease prevention and treatment, and the work that we will do together will reshape the landscape of our field,” said Dennis S. Charney, MD, Anne and Joel Ehrenkranz Dean, Mount Sinai School of Medicine and Executive Vice President for Academic Affairs, The Mount Sinai Medical Center. “Mount Sinai is thrilled to join minds with Cloudera.” (Please see http://www.cloudera.com/content/cloudera/en/about/press-center/press-releases/release.html?ReleaseID=1747809 for more details.)

Cloudera is active in many other areas of BioInformatics. Due to Cloudera’s market leadership in Big Data, many DNA mapping programs have specific installation instructions for CDH (Cloudera’s 100% open-source, enterprise-ready distribution of Hadoop and related projects). But rather than just tell you about Cloudera let’s do a worked example of BioInformatics data – specifically FAERS.

A sponsored piece by Cloudera but walks you through using Impala with the FDA data on adverse drug reactions.

Demonstrates getting started with Impala isn’t hard. Which is true.

What’s lacking is a measure of the difficulty of good results.

Any old result, good or bad, probably isn’t of interest to most users.

November 15, 2012

Cloudera Glossary

Filed under: Cloudera,Hadoop — Patrick Durusau @ 3:47 pm

Cloudera Glossary

A goodly collection of terms used with Cloudera (Hadoop and related) technology.

I have a weakness for dictionaries, lexicons, grammars and the like so your mileage may vary.

Includes links to additional resources.

November 14, 2012

Cloudera Impala – Fast, Interactive Queries with Hadoop

Filed under: Cloudera,Hadoop,Impala — Patrick Durusau @ 5:50 am

Cloudera Impala – Fast, Interactive Queries with Hadoop by Istvan Szegedi.

From the post:

As discussed in the previous post about Twitter’s Storm, Hadoop is a batch oriented solution that has a lack of support for ad-hoc, real-time queries. Many of the players in Big Data have realised the need for fast, interactive queries besides the traditional Hadooop approach. Cloudera, one the key solution vendors in Big Data/Hadoop domain has just recently launched Cloudera Impala that addresses this gap.

As Cloudera Engineering team descibed in ther blog, their work was inspired by Google Dremel paper which is also the basis for Google BigQuery. Cloudera Impala provides a HiveQL-like query language for wide variety of SELECT statements with WHERE, GROUP BY, HAVING clauses, with ORDER BY – though currently LIMIT is mandatory with ORDER BY -, joins (LEFT, RIGTH, FULL, OUTER, INNER), UNION ALL, external tables, etc. It also supports arithmetic and logical operators and Hive built-in functions such as COUNT, SUM, LIKE, IN or BETWEEN. It can access data stored on HDFS but it does not use mapreduce, instead it is based on its own distributed query engine.

The current Impala release (Impala 1.0beta) does not support DDL statements (CREATE, ALTER, DROP TABLE), all the table creation/modification/deletion functions have to be executed via Hive and then refreshed in Impala shell.

Cloudera Impala is open-source under Apache Licence, the code can be retrieved from Github. Its components are written in C++, Java and Python.

Will get you off to a good start with Impala.

November 7, 2012

Impala: Real-time Queries in Hadoop [Recorded Webinar]

Filed under: Cloudera,Hadoop,Impala — Patrick Durusau @ 4:05 pm

Impala: Real-time Queries in Hadoop

From the description:

Learn how Cloudera Impala empowers you to:

  1. Perform interactive, real-time analysis directly on source data stored in Hadoop
  2. Interact with data in HDFS and HBase at the “speed of thought”
  3. Reduce data movement between systems & eliminate double storage

You can also grab the slides here

.Almost fifty-nine minutes.

Speedup on Hive over MapReduce reported to be 4-30X faster.

I’m not sure about #2 above but then I lack a skull-jack. Maybe this next Christmas. 😉

October 31, 2012

One To Watch: Apache Crunch

Filed under: Apache Crunch,Cloudera,Hadoop,MapReduce — Patrick Durusau @ 4:37 pm

One To Watch: Apache Crunch by Chris Mayer.

From the post:

Over the past few years, the Apache Software Foundation has become the hub for big data-focused projects. An array of companies have recognised the worth of housing their latest innovative projects at the ASF, with Apache Hadoop and Apache Cassandra two shining examples.

Amongst the number of projects arriving in the Apache Incubator was Apache Crunch. Crunch is a Java library created to eliminate the tedium of writing a MapReduce pipeline. It aims to take hold of the entire process, making writing, testing, and running MapReduce pipelines more efficient and “even fun” (if this Cloudera blog post is to be believed).

That’s a tall order, to make MapReduce pipelines “even fun.” On the other hand, remarkable things have emerged from Apache for decades now.

A project to definitely keep in sight.

October 25, 2012

Cloudera’s Impala and the Semantic “Mosh Pit”

Filed under: Cloudera,Hadoop,Impala — Patrick Durusau @ 4:30 am

Cloudera’s Impala tool binds Hadoop with business intelligence apps by Christina Farr.

From the post:

In traditional circles, Hadoop is viewed as a bright but unruly problem child.

Indeed, it is still in the nascent stages of development. However the scores of “big data” startups that leverage Hadoop will tell you that it is here to stay.

Cloudera, the venture-backed startup that ushered the mainstream deployment of Hadoop, has unveiled a new technology at the Hadoop World, the data-focused conference in New York.

Its new product, known as “Impala”, addresses many of the concerns that large enterprises still have about Hadoop, namely that it does not integrate well with traditional business intelligence applications.

“We have heard this criticism,” said Charles Zedlewski, Cloudera’s VP of Product in a phone interview with VentureBeat. “That’s why we decided to do something about it,” he said.

Impala enables its users to store vast volumes of unwieldy data and run queries in HBase, Hadoop’s NoSQL database. What’s interesting is that it is built to maximise speed: it runs on top of Hadoop storage, but speaks to SQL and works with pre-existing drivers.

Legacy data is a well known concept.

Are we approaching the point of legacy applications? Applications that are too widely/deeply embedded in IT infrastructure to be replaced?

Or at least not replaced quickly?

The semantics of legacy data are known to be fair game for topic maps. Do the semantics of legacy applications offer the same possibilities?

Mapping the semantics of “legacy” applications, their ancestors and descendants, data, legacy and otherwise, results in a semantic mosh pit.

Some strategies for a semantic “mosh pit:”

  1. Prohibit it (we know the success rate on that option)
  2. Ignore it (costly but more “successful” than #1)
  3. Create an app on top of the legacy app (an error repeated isn’t an error, it’s following precedent)
  4. Sample it (but what are you missing?)
  5. Map it (being mindful of cost/benefit)

Which one are you going to choose?

October 20, 2012

Sneak Peek into Skybox Imaging’s Cloudera-powered Satellite System [InaaS?]

Filed under: BigData,Cloudera,Geographic Data,Geography,Intelligence — Patrick Durusau @ 3:02 pm

Sneak Peek into Skybox Imaging’s Cloudera-powered Satellite System by Justin Kestelyn (@kestelyn)

This is a guest post by Oliver Guinan, VP Ground Software, at Skybox Imaging. Oliver is a 15-year veteran of the internet industry and is responsible for all ground system design, architecture and implementation at Skybox.

One of the great promises of the big data movement is using networks of ubiquitous sensors to deliver insights about the world around us. Skybox Imaging is attempting to do just that for millions of locations across our planet.

Skybox is developing a low cost imaging satellite system and web-accessible big data processing platform that will capture video or images of any location on Earth within a couple of days. The low cost nature of the satellite opens the possibility of deploying tens of satellites which, when integrated together, have the potential to image any spot on Earth within an hour.

Skybox satellites are designed to capture light in the harsh environment of outer space. Each satellite captures multiple images of a given spot on Earth. Once the images are transferred from the satellite to the ground, the data needs to be processed and combined to form a single image, similar to those seen within online mapping portals.

With any sensor network, capturing raw data is only the beginning of the story. We at Skybox are building a system to ingest and process the raw data, allowing data scientists and end users to ask arbitrary questions of the data, then publish the answers in an accessible way and at a scale that grows with the number of satellites in orbit. We selected Cloudera to support this deployment.

Now is the time to start planning topic map based products that can incorporate this type of data.

There are lots of folks who are “curious” about what is happening next door, in the next block, a few “klicks” away, across the border, etc.

Not all of them have the funds for private “keyhole” satellites and vacuum data feeds. But they may have money to pay you for efficient and effective collation of intelligence data.

Topic maps empowering “Intelligence as a Service (InaaS)”?

October 19, 2012

What’s New in CDH4.1 Pig

Filed under: Cloudera,Hadoop,Pig — Patrick Durusau @ 3:28 pm

What’s New in CDH4.1 Pig by Cheolsoo Park.

From the post:

Apache Pig is a platform for analyzing large data sets that provides a high-level language called Pig Latin. Pig users can write complex data analysis programs in an intuitive and compact manner using Pig Latin.

Among many other enhancements, CDH4.1, the newest release of Cloudera’s open-source Hadoop distro, upgrades Pig from version 0.9 to version 0.10. This post provides a summary of the top seven new features introduced in CDH4.1 Pig.

Cheolsoo covers these new features:

  • Boolean Data Type
  • Nested FOREACH and CROSS
  • Ruby UDFs
  • LIMIT / SPLIT by Expression
  • Default SPLIT Destination
  • Syntactical Sugar for TOTUPLE, TOBAG, and TOMAP
  • AvroStorage Improvements

Enjoy!

October 18, 2012

Axemblr’s Java Client for the Cloudera Manager API

Filed under: Cloud Computing,Cloudera,Hadoop — Patrick Durusau @ 10:38 am

Axemblr’s Java Client for the Cloudera Manager API by Justin Kestelyn.

From the post:

Axemblr, purveyors of a cloud-agnostic MapReduce Web Service, have recently announced the availability of an Apache-licensed Java Client for the Cloudera Manager API.

The task at hand, according to Axemblr, is to ”deploy Hadoop on Cloud with as little user interaction as possible. We have the code to provision the hosts but we still need to install and configure Hadoop on all nodes and make it so the user has a nice experience doing it.” And voila, the answer is Cloudera Manager, with the process made easy via the REST API introduced in Release 4.0.

Thus, says Axemblr: “In the pursuit of our greatest desire (second only to coffee early in the morning), we ended up writing a Java client for Cloudera Manager’s API. Thus we achieved to automate a CDH3 Hadoop installation on Amazon EC2 and Rackspace Cloud. We also decided to open source the client so other people can play along.”

Another goodie to ease your way to Hadoop deployment on your favorite cloud.

Do you remember the lights at radio stations that would show “On Air?”

I need an “On Cloud” that lights up. More realistic than the data appliance.

October 16, 2012

Analyzing Twitter Data with Hadoop, Part 2: Gathering Data with Flume

Filed under: Cloudera,Flume,Hadoop,Tweets — Patrick Durusau @ 9:15 am

Analyzing Twitter Data with Hadoop, Part 2: Gathering Data with Flume by Jon Natkins.

From the post:

This is the second article in a series about analyzing Twitter data using some of the components of the Hadoop ecosystem available in CDH, Cloudera’s open-source distribution of Hadoop and related projects. In the first article, you learned how to pull CDH components together into a single cohesive application, but to really appreciate the flexibility of each of these components, we need to dive deeper.

Every story has a beginning, and every data pipeline has a source. So, to build Hadoop applications, we need to get data from a source into HDFS.

Apache Flume is one way to bring data into HDFS using CDH. The Apache Flume website describes Flume as “a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.” At the most basic level, Flume enables applications to collect data from its origin and send it to a resting location, such as HDFS. At a slightly more detailed level, Flume achieves this goal by defining dataflows consisting of three primary structures: sources, channels and sinks. The pieces of data that flow through Flume are called events, and the processes that run the dataflow are called agents.

A very good introduction to the use of Flume!

Does it seem to you that the number of examples using Twitter, not just for “big data” but in general seems to be on the rise?

Just a personal observation and subject to all the flaws, “all the buses were going the other way,” of such.

Judging from the state of my inbox, some people are still writing more than 140 characters at a time.

Will it make a difference in our tools/thinking if we focus on shorter strings as opposed to longer ones?

October 10, 2012

What is Hadoop Metrics2?

Filed under: Cloudera,Hadoop,Systems Administration — Patrick Durusau @ 4:17 pm

What is Hadoop Metrics2? by Ahmed Radwan.

I’ve been wondering about that. How about you? 😉

From the post:

Metrics are collections of information about Hadoop daemons, events and measurements; for example, data nodes collect metrics such as the number of blocks replicated, number of read requests from clients, and so on. For that reason, metrics are an invaluable resource for monitoring Hadoop services and an indispensable tool for debugging system problems.

This blog post focuses on the features and use of the Metrics2 system for Hadoop, which allows multiple metrics output plugins to be used in parallel, supports dynamic reconfiguration of metrics plugins, provides metrics filtering, and allows all metrics to be exported via JMX.

However cool the software, can’t ever really get away from managing the software.

And it isn’t a bad skill to have. Read on!

October 3, 2012

CDH4.1 Now Released!

Filed under: Cloudera,Flume,Hadoop,HBase,HDFS,Hive,Pig — Patrick Durusau @ 8:28 pm

CDH4.1 Now Released! by Charles Zedlewski.

From the post:

We’re pleased to announce the availability of CDH4.1. We’ve seen excellent adoption of CDH4.0 since it went GA at the end of June and a number of exciting use cases have moved to production. CDH4.1 is an update that has a number of fixes but also a number of useful enhancements. Among them:

  • Quorum based storage – Quorum-based Storage for HDFS provides the ability for HDFS to store its own NameNode edit logs, allowing you to run a highly available NameNode without external storage or custom fencing.
  • Hive security and concurrency – we’ve fixed some long standing issues with running Hive. With CDH4.1, it is now possible to run a shared Hive instance where users submit queries using Kerberos authentication. In addition this new Hive server supports multiple users submitting queries at the same time.
  • Support for DataFu – the LinkedIn data science team was kind enough to open source their library of Pig UDFs that make it easier to perform common jobs like sessionization or set operations. Big thanks to the LinkedIn team!!!
  • Oozie workflow builder – since we added Oozie to CDH more than two years ago, we have often had requests to make it easier to develop Oozie workflows. The newly enhanced job designer in Hue enables users to use a visual tool to build and run Oozie workflows.
  • FlumeNG improvements – since its release, FlumeNG has become the backbone for some exciting data collection projects, in some cases collecting as much as 20TB of new event data per day. In CDH4.1 we added an HBase sink as well as metrics for monitoring as well as a number of performance improvements.
  • Various performance improvements – CDH4.1 users should experience a boost in their MapReduce performance from CDH4.0.
  • Various security improvements – CDH4.1 enables users to configure the system to encrypt data in flight during the shuffle phase. CDH now also applies Hadoop security to users who access the filesystem via a FUSE mount.

It’s releases like this that make me wish I spent more time writing documentation for software. To try out all the cool features with no real goal other than trying them out.

Enjoy!

September 24, 2012

Schedule This! Strata + Hadoop World Speakers from Cloudera

Filed under: Cloudera,Conferences,Hadoop — Patrick Durusau @ 2:38 pm

Schedule This! Strata + Hadoop World Speakers from Cloudera by Justin Kestelyn.

Oct. 23-25, 2012, New York City

From the post:

We’re getting really close to Strata Conference + Hadoop World 2012 (just over a month away), schedule planning-wise. So you may want to consider adding the tutorials, sessions, and keynotes below to your calendar! (Start times are always subject to change of course.)

The ones listed below are led or co-led by Clouderans, but there is certainly a wide range of attractive choices beyond what you see here. We just want to ensure that you put these particular ones high on your consideration list.

Just in case the Clouderans aren’t enough incentive to attend (they should be), consider the full schedule for the conference.

September 19, 2012

Analyzing Twitter Data with Hadoop [Hiding in a Public Data Stream]

Filed under: Cloudera,Flume,Hadoop,HDFS,Hive,Oozie,Tweets — Patrick Durusau @ 10:46 am

Analyzing Twitter Data with Hadoop by Jon Natkins

From the post:

Social media has gained immense popularity with marketing teams, and Twitter is an effective tool for a company to get people excited about its products. Twitter makes it easy to engage users and communicate directly with them, and in turn, users can provide word-of-mouth marketing for companies by discussing the products. Given limited resources, and knowing we may not be able to talk to everyone we want to target directly, marketing departments can be more efficient by being selective about whom we reach out to.

In this post, we’ll learn how we can use Apache Flume, Apache HDFS, Apache Oozie, and Apache Hive to design an end-to-end data pipeline that will enable us to analyze Twitter data. This will be the first post in a series. The posts to follow to will describe, in more depth, how each component is involved and how the custom code operates. All the code and instructions necessary to reproduce this pipeline is available on the Cloudera Github.

Looking forward to more posts in this series!

Social media is a focus for marketing teams for obvious reasons.

Analysis of snaps, crackles and pops en masse.

What if you wanted to communicate securely with others using social media?

Thinking of something more robust and larger than two (or three) lovers agreeing on code words.

How would you hide in a public data stream?

Or the converse, how would you hunt for someone in a public data stream?

How would you use topic maps to manage the semantic side of such a process?

September 12, 2012

Cloudera Enterprise in Less Than Two Minutes

Filed under: Cloud Computing,Cloudera,Hadoop,MapReduce — Patrick Durusau @ 4:10 pm

Cloudera Enterprise in Less Than Two Minutes by Justin Kestelyn.

I had to pause “Born Under A Bad Sign” by Cream to watch the video but it was worth it!

Good example of selling technique too!

Focused on common use cases and user concerns. Promises a solution without all the troublesome details.

Time enough for that after a prospect is interested. And even then, ease them into the details.

September 5, 2012

What Do Real-Life Hadoop Workloads Look Like?

Filed under: Cloudera,Hadoop — Patrick Durusau @ 3:23 pm

What Do Real-Life Hadoop Workloads Look Like? by Yanpei Chen.

From the post:

Organizations in diverse industries have adopted Hadoop-based systems for large-scale data processing. As a leading force in Hadoop development with customers in half of the Fortune 50 companies, Cloudera is in a unique position to characterize and compare real-life Hadoop workloads. Such insights are essential as developers, data scientists, and decision makers reflect on current use cases to anticipate technology trends.

Recently we collaborated with researchers at UC Berkeley to collect and analyze a set of Hadoop traces. These traces come from Cloudera customers in e-commerce, telecommunications, media, and retail (Table 1). Here I will explain a subset of the observations, and the thoughts they triggered about challenges and opportunities in the Hadoop ecosystem, both present and in the future.

Specific (and useful) to Hadoop installations but I suspect more useful for semantic processing in general.

Questions like:

  • What topics are “hot spots” of merging activity?
  • Where do those topics originate?
  • How do changes in merging rules impact the merging process?

are only some of the ones that may be of interest.

August 31, 2012

Developing CDH Applications with Maven and Eclipse

Filed under: Cloudera,Eclipse,Hadoop,Maven — Patrick Durusau @ 10:54 am

Developing CDH Applications with Maven and Eclipse by Jon Natkins

Learn how to configure a basic Maven project that will be able to build applications against CDH

Apache Maven is a build automation tool that can be used for Java projects. Since nearly all the Apache Hadoop ecosystem is written in Java, Maven is a great tool for managing projects that build on top of the Hadoop APIs. In this post, we’ll configure a basic Maven project that will be able to build applications against CDH (Cloudera’s Distribution Including Apache Hadoop) binaries.

Maven projects are defined using an XML file called pom.xml, which describes things like the project’s dependencies on other modules, the build order, and any other plugins that the project uses. A complete example of the pom.xml described below, which can be used with CDH, is available on Github. (To use the example, you’ll need at least Maven 2.0 installed.) If you’ve never set up a Maven project before, you can get a jumpstart by using Maven’s quickstart archetype, which generates a small initial project layout.

I don’t have a Fairness Doctrine but thought since I had a post on make today that one on Maven would not be out of place.

Both of them are likely to figure in an active topic map/semantic application work.

BTW, since both “make” and “maven” have multiple meanings, how would you index this post to separate the uses in this post from other meanings?

Would it make a difference if, as appears above, some instances are surrounded with hyperlinks?

How would I indicate that the hyperlinks are identity references?

Or some subset of hyperlinks are identity references?

August 27, 2012

Hadoop on your PC: Cloudera’s CDH4 virtual machine

Filed under: Cloudera,Hadoop — Patrick Durusau @ 2:10 pm

Hadoop on your PC: Cloudera’s CDH4 virtual machine by Andrew Brust.

From the post:

Want to learn Hadoop without building your own cluster or paying for cloud resources? Then download Cloudera’s Hadoop distro and run it in a virtual machine on your PC. I’ll show you how.

As good a way as any to get your feet wet with Hadoop.

Don’t be surprised if in a week or two you have both the nucleus of a cluster and a cloud account. Reasoning that you need to be prepared for any client’s environment of choice.

Or at least that is what you will tell your significant other.

August 24, 2012

Process a Million Songs with Apache Pig

Filed under: Amazon Web Services AWS,Cloudera,Data Mining,Hadoop,Pig — Patrick Durusau @ 3:22 pm

Process a Million Songs with Apache Pig by Justin Kestelyn.

From the post:

The following is a guest post kindly offered by Adam Kawa, a 26-year old Hadoop developer from Warsaw, Poland. This post was originally published in a slightly different form at his blog, Hakuna MapData!

Recently I have found an interesting dataset, called Million Song Dataset (MSD), which contains detailed acoustic and contextual data about a million songs. For each song we can find information like title, hotness, tempo, duration, danceability, and loudness as well as artist name, popularity, localization (latitude and longitude pair), and many other things. There are no music files included here, but the links to MP3 song previews at 7digital.com can be easily constructed from the data.

The dataset consists of 339 tab-separated text files. Each file contains about 3,000 songs and each song is represented as one separate line of text. The dataset is publicly available and you can find it at Infochimps or Amazon S3. Since the total size of this data sums up to around 218GB, processing it using one machine may take a very long time.

Definitely, a much more interesting and efficient approach is to use multiple machines and process the songs in parallel by taking advantage of open-source tools from the Apache Hadoop ecosystem (e.g. Apache Pig). If you have your own machines, you can simply use CDH (Cloudera’s Distribution including Apache Hadoop), which includes the complete Apache Hadoop stack. CDH can be installed manually (quickly and easily by typing a couple of simple commands) or automatically using Cloudera Manager Free Edition (which is Cloudera’s recommended approach). Both CDH and Cloudera Manager are freely downloadable here. Alternatively, you may rent some machines from Amazon with Hadoop already installed and process the data using Amazon’s Elastic MapReduce (here is a cool description writen by Paul Lemere how to use it and pay as low as $1, and here is my presentation about Elastic MapReduce given at the second meeting of Warsaw Hadoop User Group).

An example of offering the reader their choice of implementation detail, on or off a cloud. 😉

Suspect that is going to become increasingly common.

August 13, 2012

CDH3 update 5 is now available

Filed under: Avro,Cloudera,Flume,Hadoop,HDFS,Hive — Patrick Durusau @ 4:17 pm

CDH3 update 5 is now available by Arvind Prabhakar

From the post:

We are happy to announce the general availability of CDH3 update 5. This update is a maintenance release of CDH3 platform and provides a considerable amount of bug-fixes and stability enhancements. Alongside these fixes, we have also included a few new features, most notable of which are the following:

  • Flume 1.2.0 – Provides a durable file channel and many more features over the previous release.
  • Hive AvroSerDe – Replaces the Haivvreo SerDe and provides robust support for Avro data format.
  • WebHDFS – A full read/write REST API to HDFS.

Maintenance release. Installation is good practice before major releases.

August 3, 2012

Column Statistics in Hive

Filed under: Cloudera,Hive,Merging,Statistics — Patrick Durusau @ 2:48 pm

Column Statistics in Hive by Shreepadma Venugopalan.

From the post:

Over the last couple of months the Hive team at Cloudera has been working hard to bring a bunch of exciting new features to Hive. In this blog post, I’m going to talk about one such feature – Column Statistics in Hive – and how Hive’s query processing engine can benefit from it. The feature is currently a work in progress but we expect it to be available for review imminently.

Motivation

While there are many possible execution plans for a query, some plans are more optimal than others. The query optimizer is responsible for generating an efficient execution plan for a given SQL query from the space of all possible plans. Currently, Hive’s query optimizer uses rules of thumbs to generate an efficient execution plan for a query. While such rules of thumb optimizations transform the query plan into a more efficient one, the resulting plan is not always the most efficient execution plan.

In contrast, the query optimizer in a traditional RDBMS is cost based; it uses the statistical properties of the input column values to estimate the cost alternative query plans and chooses the plan with the lowest cost. The cost model for query plans assigns an estimated execution cost to the plans. The cost model is based on the CPU and I/O costs of query execution for every operator in the query plan. As an example consider a query that represents a join among {A, B, C} with the predicate {A.x == B.x == C.x}. Assume table A has a total of 500 records, table B has a total of 6000 records, table C has a total of 1000 records. In the absence of cost based query optimization, the system picks the join order specified by the user. In our example, let us further assume that the result of joining A and B yields 2000 records and the result of joining A and C yields 50 records.Hence the cost of performing the join between A, B and C, without join reordering, is the cost of joining A and B + cost of joining the output of A Join B with C. In our example this would result in a cost of (500 * 6000) + (2000 * 1000). On the other hand, a cost based optimizer (CBO) in a RDBMS would pick the more optimal alternate order [(A Join C) Join B] thus resulting in a cost of (500 * 1000) + (50 * 6000). However, in order to pick the more optimal join order the CBO needs cardinality estimates on the join column.

Today, Hive supports statistics at the table and partition level – count of files, raw data size, count of rows etc, but doesn’t support statistics on column values. These table and partition level statistics are insufficient for the purpose of building a CBO because they don’t provide any information about the individual column values. Hence obtaining the statistical summary of the column values is the first step towards building a CBO for Hive.

In addition to join reordering, Hive’s query optimizer will be able to take advantage of column statistics to decide whether to perform a map side aggregation as well as estimate the cardinality of operators in the execution plan better.

Some days I wonder where improvements to algorithms and data structures are going to lead?

Other days, I just enjoy the news.

Today is one of the latter.

PS: What a cost based optimizer (CBO) would look like for merging operations? Or perhaps better, merge cost estimator (MCE)? Metered merging anyone?

« Newer PostsOlder Posts »

Powered by WordPress