Archive for the ‘Cloudera’ Category

Cloudera Search: The Newest Hadoop Framework for CDH Users and Developers

Wednesday, June 5th, 2013

Cloudera Search: The Newest Hadoop Framework for CDH Users and Developers by Doug Cutting.

From the post:

One of the unexpected pleasures of open source development is the way that technologies adapt and evolve for uses you never originally anticipated.

Seven years ago, Apache Hadoop sprang from a project based on Apache Lucene, aiming to solve a search problem: how to scalably store and index the internet. Today, it’s my pleasure to announce Cloudera Search, which uses Lucene (among other things) to make search solve a Hadoop problem: how to let non-technical users interactively explore and analyze data in Hadoop.

Cloudera Search is released to public beta, as of today. (See a demo here; get installation instructions here.) Powered by Apache Solr 4.3, Cloudera Search allows hundreds of users to search petabytes of Hadoop data interactively.

In the context of our platform, CDH (Cloudera’s Distribution including Apache Hadoop), Cloudera Search is another framework much like MapReduce and Cloudera Impala. It’s another way for users to interact with Hadoop data and for developers to build Hadoop applications. Each framework in our platform is designed to cater to different families of applications and users:

(…)

Did you catch the line:

Powered by Apache Solr 4.3, Cloudera Search allows hundreds of users to search petabytes of Hadoop data interactively.

Does that make you feel better about scale issues?

Also see: Cloudera Search Webinar, Wednesday, June 19, 2013 11AM-12PM PT.

A serious step up in capabilities.

Apache Pig Editor in Hue 2.3

Saturday, May 25th, 2013

Apache Pig Editor in Hue 2.3

From the post:

In the previous installment of the demo series about Hue — the open source Web UI that makes Apache Hadoop easier to use — you learned how to analyze data with Hue using Apache Hive via Hue’s Beeswax and Catalog applications. In this installment, we’ll focus on using the new editor for Apache Pig in Hue 2.3.

Complementing the editors for Hive and Cloudera Impala, the Pig editor provides a great starting point for exploration and real-time interaction with Hadoop. This new application lets you edit and run Pig scripts interactively in an editor tailored for a great user experience. Features include:

  • UDFs and parameters (with default value) support
  • Autocompletion of Pig keywords, aliases, and HDFS paths
  • Syntax highlighting
  • One-click script submission
  • Progress, result, and logs display
  • Interactive single-page application

Here’s a short video demoing its capabilities and ease of use:

(…)

How are you editing your Pig scripts now?

How are you documenting the semantics of your Pig scripts?

How do you search across your Pig scripts?

How-to: Configure Eclipse for Hadoop Contributions

Thursday, May 16th, 2013

How-to: Configure Eclipse for Hadoop Contributions by Karthik Kambatla.

From the post:

Contributing to Apache Hadoop or writing custom pluggable modules requires modifying Hadoop’s source code. While it is perfectly fine to use a text editor to modify Java source, modern IDEs simplify navigation and debugging of large Java projects like Hadoop significantly. Eclipse is a popular choice thanks to its broad user base and multitude of available plugins.

This post covers configuring Eclipse to modify Hadoop’s source. (Developing applications against CDH using Eclipse is covered in a different post.) Hadoop has changed a great deal since our previous post on configuring Eclipse for Hadoop development; here we’ll revisit configuring Eclipse for the latest “flavors” of Hadoop. Note that trunk and other release branches differ in their directory structure, feature set, and build tools they use. (The EclipseEnvironment Hadoop wiki page is a good starting point for development on trunk.)

A post to ease your way towards contributing to the Hadoop project!

Or if you simply want to know the code you are running cold.

Or something in between!

Analyzing Twitter: An End-to-End Data Pipeline Recap

Monday, May 13th, 2013

Analyzing Twitter: An End-to-End Data Pipeline Recap by Jason Barbour.

Jason reviews presentations at a recent Data Science MD meeting:

Starting off the night, Joey Echeverria, a Principal Solutions Architect, first discussed a big data architecture and how a key components of relational data management system can be replaced with current big data technologies. With Twitter being increasingly popular with marketing teams, analyzing Twitter data becomes a perfect use case to demonstrate a complete big data pipeline.

(…)

Following Joey, Sean Busbey, a Solutions Architect at Cloudera, discussed working with Mahout, a scalable machine learning library for Hadoop. Sean first introduced the three C’s of machine learning: classification, clustering, and collaborative filtering. With classification, learning from a training set supervised, and new examples can be categorized. Clustering allows examples to be grouped together with common features, while collaborative filtering allows new candidates to be suggested.

Great summaries, links to additional resources and the complete slides.

Check the DC Data Community Events Calendar if you plan to visit the DC area. (I assume residents already do.)

Cloudera Development Kit (CDK)…

Tuesday, May 7th, 2013

Cloudera Development Kit (CDK): Hadoop Application Development Made Easier by Eric Sammer & Tom White.

From the post:

At Cloudera, we have the privilege of helping thousands of developers learn Apache Hadoop, as well as build and deploy systems and applications on top of Hadoop. While we (and many of you) believe that platform is fast becoming a staple system in the data center, we’re also acutely aware of its complexities. In fact, this is the entire motivation behind Cloudera Manager: to make the Hadoop platform easy for operations staff to deploy and manage.

So, we’ve made Hadoop much easier to “consume” for admins and other operators — but what about for developers, whether working for ISVs, SIs, or users? Until now, they’ve largely been on their own.

That’s why we’re really excited to announce the Cloudera Developer Kit (CDK), a new open source project designed to help developers get up and running to build applications on CDH, Cloudera’s open source distribution including Hadoop, faster and easier than before. The CDK is a collection of libraries, tools, examples, and documentation engineered to simplify the most common tasks when working with the platform. Just like CDH, the CDK is 100% free, open source, and licensed under the same permissive Apache License v2, so you can use the code any way you choose in your existing commercial code base or open source project.

The CDK lives on GitHub where users can freely browse, download, fork, and contribute back to the source. Community contributions are not only welcome but strongly encouraged. Since most Java developers use tools such as Maven (or tools that are compatible with Maven repositories), artifacts are also available from the Cloudera Maven Repository for easy project integration.

The CDK is a collection of libraries, tools, examples, and docs engineered to simplify common tasks.

What’s In There Today

Our goal is to release a number of CDK modules over time. The first module that can be found in the current release is the CDK Data module; a set of APIs to drastically simplify working with datasets in Hadoop filesystems such as HDFS and the local filesystem. The Data module handles automatic serialization and deserialization of Java POJOs as well as Avro Records, automatic compression, file and directory layout and management, automatic partitioning based on configurable functions, and a metadata provider plugin interface to integrate with centralized metadata management systems (including HCatalog). All Data APIs are fully documented with javadoc. A reference guide is available to walk you through the important parts of the module, as well. Additionally, a set of examples is provided to help you see the APIs in action immediately.

Here’s to hoping that vendor support as shown for Hadoop, Lucene/Solr, R, (who am I missing?), continues and spreads to other areas of software development.

Impala 1.0

Wednesday, May 1st, 2013

Impala 1.0: Industry’s First Production-Ready SQL-on-Hadoop Solution

From the post:

Cloudera, the category leader that sets the standard for Apache Hadoop in the enterprise, today announced the general availability of Cloudera Impala, its open source, interactive SQL query engine for analyzing data stored in Hadoop clusters in real time. Cloudera was first-to-market with its SQL-on-Hadoop offering, releasing Impala to open source as a public beta offering in October 2012. Since that time, it has worked closely with customers and open source users, rigorously testing and refining the platform in real world applications to deliver today’s production-hardened and customer validated release, designed from the ground-up for enterprise workloads. The company noted that adoption of the platform has been strong: over 40 enterprise customers and open source users are using Impala today, including 37signals, Expedia, Six3 Systems, Stripe, and Trion Worlds. With its 1.0 release, Impala extends Cloudera’s unified Platform for Big Data, which is designed specifically to bring different computation frameworks and applications to a single pool of data, using a common set of system resources.

The bigger data pools get, the more opportunity there is for semantic confusion.

Or to put that more positively, the greater the market for tools to lessen or avoid semantic confusion.

;-)

Cloudera ML:…

Friday, March 22nd, 2013

Cloudera ML: New Open Source Libraries and Tools for Data Scientists by Josh Wills.

From the post:

Today, I’m pleased to introduce Cloudera ML, an Apache licensed collection of Java libraries and command line tools to aid data scientists in performing common data preparation and model evaluation tasks. Cloudera ML is intended to be an educational resource and reference implementation for new data scientists that want to understand the most effective techniques for building robust and scalable machine learning models on top of Hadoop.

…[details about clustering omitted]

If you were paying at least somewhat close attention, you may have noticed that the algorithms I’m describing above are essentially clever sampling techniques. With all of the hype surrounding big data, sampling has gotten a bit of a bad rap, which is unfortunate, since most of the work of a data scientist involves finding just the right way to turn a large data set into a small one. Of course, it usually takes a few hundred tries to find that right way, and Hadoop is a powerful tool for exploring the space of possible features and how they should be weighted in order to achieve our objectives.

Wherever possible, we want to minimize the amount of parameter tuning required for any model we create. At the very least, we should try to provide feedback on the quality of the model that is created by different parameter settings. For k-means, we want to help data scientists choose a good value of K, the number of clusters to create. In Cloudera ML, we integrate the process of selecting a value of K into the data sampling and cluster fitting process by allowing data scientists to evaluate multiple values of K during a single run of the tool and reporting statistics about the stability of the clusters, such as the prediction strength.

Finally, we want to investigate the anomalous events in our clustering- those points that don’t fit well into any of the larger clusters. Cloudera ML includes a tool for using the clusters that were identified by the scalable k-means algorithm to compute an assignment of every point in our large data set to a particular cluster center, including the distance from that point to its assigned center. This information is created via a MapReduce job that outputs a CSV file that can be analyzed interactively using Cloudera Impala or your preferred analytical application for processing data stored in Hadoop.

Cloudera ML is under active development, and we are planning to add support for pivot tables, Hive integration via HCatalog, and tools for building ensemble classifers over the next few weeks. We’re eager to get feedback on bug fixes and things that you would like to see in the tool, either by opening an issue or a pull request on our github repository. We’re also having a conversation about training a new generation of data scientists next Tuesday, March 26th, at 2pm ET/11am PT, and I hope that you will be able to join us.

Another great project by Cloudera!

Training a New Generation of Data Scientists

Thursday, March 21st, 2013

Training a New Generation of Data Scientists by Ryan Goldman.

From the post:

Data scientists drive data as a platform to answer previously unimaginable questions. These multi-talented data professionals are in demand like never before because they identify or create some of the most exciting and potentially profitable business opportunities across industries. However, a scarcity of existing external talent will require companies of all sizes to find, develop, and train their people with backgrounds in software engineering, statistics, or traditional business intelligence as the next generation of data scientists.

Join us for the premiere of Training a New Generation of Data Scientists on Tuesday, March 26, at 2pm ET/11am PT. In this video, Cloudera’s Senior Director of Data Science, Josh Wills, will discuss what data scientists do, how they think about problems, the relationship between data science and Hadoop, and how Cloudera training can help you join this increasingly important profession. Following the video, Josh will answer your questions about data science, Hadoop, and Cloudera’s Introduction to Data Science: Building Recommender Systems course.

This could be fun!

And if nothing else, will give you the tools to distinguish legitimate training, like Cloudera’s, from the “How to make $millions in real estate,” from the guy who makes money selling lectures and books sort of training.

As “hot” as data science is, you don’t have to look for to find that sort of training.

…Apache HBase REST Interface, Part 1

Tuesday, March 12th, 2013

How-to: Use the Apache HBase REST Interface, Part 1 by Jesse Anderson.

From the post:

There are various ways to access and interact with Apache HBase. The Java API provides the most functionality, but many people want to use HBase without Java.

There are two main approaches for doing that: One is the Thrift interface, which is the faster and more lightweight of the two options. The other way to access HBase is using the REST interface, which uses HTTP verbs to perform an action, giving developers a wide choice of languages and programs to use.

This series of how-to’s will discuss the REST interface and provide Python code samples for accessing it. The first post will cover HBase REST, some Python caveats, and table administration. The second post will explain how to insert multiple rows at a time using XML and JSON. The third post will show how to get multiples rows using XML and JSON. The full code samples can be found on my GitHub account.

Post also has a reminder about HBaseCon 2013 (June 13, San Francisco).

Cloudera Impala: A Modern SQL Engine for Hadoop [Webinar - 10 Jan 2013]

Wednesday, January 9th, 2013

Cloudera Impala: A Modern SQL Engine for Hadoop

From the post:

Join us for this technical deep dive about Cloudera Impala, the project that makes scalable parallel databse technology available to the Hadoop community for the first time. Impala is an open-sourced code base that allows users to issue low-latency queries to data stored in HDFS and Apache HBase using familiar SQL operators.

Presenter Marcel Kornacker, creator of Impala, will begin with an overview of Impala from the user’s perspective, followed by an overview of Impala’s architecture and implementation, and will conclude with a comparison of Impala with Apache Hive, commercial MapReduce alternatives and traditional data warehouse infrastructure.

Looking forward to the comparison part. Picking the right tool for a job is an important first step.

How-To: Run a MapReduce Job in CDH4

Friday, December 14th, 2012

How-To: Run a MapReduce Job in CDH4 by Sandy Ryza.

From the post:

This is the first post in series that will get you going on how to write, compile, and run a simple MapReduce job on Apache Hadoop. The full code, along with tests, is available at http://github.com/cloudera/mapreduce-tutorial. The program will run on either MR1 or MR2.

We’ll assume that you have a running Hadoop installation, either locally or on a cluster, and your environment is set up correctly so that typing “hadoop” into your command line gives you some notes on usage. Detailed instructions for installing CDH, Cloudera’s open-source, enterprise-ready distro of Hadoop and related projects, are available here: https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation. We’ll also assume you have Maven installed on your system, as this will make compiling your code easier. Note that Maven is not a strict dependency; we could also compile using Java on the command line or with an IDE like Eclipse.

The Use Case

There’s been a lot of brawling on our pirate ship recently. Not so rarely, one of the mates will punch another one in the mouth, knocking a tooth out onto the deck. Our poor sailors will wake up the next day with an empty bottle of rum, wondering who’s responsible for the gap between their teeth. All this violence has gotten out of hand, so as a deterrent, we’d like to provide everyone with a list of everyone that’s ever left them with a gap. Luckily, we’ve been able to set up a Flume source so that every time someone punches someone else, it gets written out as a line in a big log file in Hadoop. To turn this data into these lists, we need a MapReduce job that can 1) invert the mapping from attacker to their victim, 2) group by victims, and 3) eliminate duplicates.

Cool!

Imagine using the same technique while you watch the evening news!

On second thought, that would take too much data entry and be depressing.

Stick to the pirates!

Solving real world analytics problems with Apache Hadoop [Webinar]

Tuesday, December 11th, 2012

Solving real world analytics problems with Apache Hadoop

Thursday December 13, 2012 at 8:30 a.m. PST/11:30 a.m. EST

From the registration page:

Agenda:

  • Defining big data
  • What are the most critical components of a big data solution?
  • The business and technical challenges of delivering a solution
  • How Cloudera accelerates big data value?
  • Why partner with HP?
  • The HP AppSystem powered by Cloudera

Doesn’t look heavy on the technical side but on the other hand, attending means you will be entered in a raffle for an HP Mini Notebook.

Impala Beta (0.3) + Cloudera Manager 4.1.2 [Get'm While Their Hot!]

Wednesday, December 5th, 2012

Cloudera Impala Beta (version 0.3) and Cloudera Manager 4.1.2 Now Available by Vinithra Varadharajan.

If you are keeping your Hadoop ecosystem skills up to date, drop by Cloudera for the latest Impala beta and a new release of Cloudera Manager.

Vinithra reports that new releases of Impala are going to drop every two to four weeks.

You can either wait for the final release of Impala or read along and contribute to the final product with your testing and comments.

New to Hadoop

Tuesday, December 4th, 2012

New to Hadoop

Cloudera has organized a seven step program for learning Hadoop!

  1. Read Up on Background
  2. Install Locally, Install a VM, or Spin Up on Cloud
  3. Explore Tutorials
  4. Get Trained Up
  5. Read Books
  6. Contribute!
  7. Participate!

It doesn’t list every possible resource but all the ones listed are high quality.

Following this program will build a solid basis for exploring the Hadoop ecosystem on your own.

Cloudera – Videos from Strata + Hadoop World 2012

Monday, December 3rd, 2012

Cloudera – Videos from Strata + Hadoop World 2012

The link is to the main resources page, where you can find many other videos and other materials.

If you want Strata + Hadoop World 2012 videos specifically, search on Hadoop World 2012.

As of today, that pulls up 41 entries. Should be enough to keep you occupied for a day or so. ;-)

SINAInnovation: Innovation and Data

Tuesday, November 27th, 2012

SINAInnovation: Innovation and Data by Jeffrey Hammerbacher.

From the description:

Cloudera Co-founder Jeff Hammerbacher speaks about data and innovation in the biology and medicine fields.

Interesting presentation, particularly on creating structures for innovation.

One of his insights I would summarize as “break early, rebuild fast.” His term for it was “lower batch size.” Try new ideas and when they fail, try a new one.

I do wonder about his goal to : “Lower the cost of data storage and processing to zero.”

It may get to be “too cheap to meter” but that isn’t the same thing as being zero. Somewhere in the infrastructure, someone is paying bills for storage and processing.

I mention that because some political parties think that infrastructure can exist without ongoing maintenance and care.

Failing infrastructures don’t lead to innovation.


SINAInnovation description:

SINAInnovations was a three-day conference at The Mount Sinai Medical Center that examined all aspects of innovation and therapeutic discovery within academic medical centers, from how it can be taught and fostered within academia, to how it can accelerate drug discovery and the commercialization of emerging biotechnologies.

The “Ask Bigger Questions” Contest!

Monday, November 19th, 2012

The “Ask Bigger Questions” Contest! by Ryan Goldman. (Deadline, Feb. 1 2013)

From the post:

Have you helped your company ask bigger questions? Our mission at Cloudera University is to equip Hadoop professionals with the skills to manage, process, analyze, and monetize more data than they ever thought possible.

Over the past three years, we’ve heard many great stories from our training participants about faster cluster deployments, complex data workflows made simple, and superhero troubleshooting moments. And we’ve heard from executives in all types of businesses that staffing Cloudera Certified professionals gives them confidence that their Hadoop teams have the skills to turn data into breakthrough insights.

Now, it’s your turn to tell us your bigger questions story! Cloudera University is seeking tales of Hadoop success originating with training and certification. How has an investment in your education paid dividends for your company, team, customer, or career?

The most compelling stories chosen from all entrants will receive prizes like Amazon gift cards, discounted Cloudera University training, autographed copies of Hadoop books from O’Reilly Media, and Cloudera swag. We may even turn your story into a case study!

Sign up to participate here. Submissions must be received by Friday, Feb. 1, 2013 to qualify for a prize.

A good marketing technique that might bear imitation.

Don’t have to seek out success stories. Incentives for people to bring them to you.

You get good marketing material that is likely to resonate with other users.

Something to think about.

BioInformatics: A Data Deluge with Hadoop to the Rescue

Monday, November 19th, 2012

BioInformatics: A Data Deluge with Hadoop to the Rescue by Marty Lurie.

From the post:

Cloudera Cofounder and Chief Scientist Jeff Hammerbacher is leading a revolutionary project with Mount Sinai School of Medicine to apply the power of Cloudera’s Big Data platform to critical problems in predicting and understanding the process and treatment of disease.

“We are at the cutting edge of disease prevention and treatment, and the work that we will do together will reshape the landscape of our field,” said Dennis S. Charney, MD, Anne and Joel Ehrenkranz Dean, Mount Sinai School of Medicine and Executive Vice President for Academic Affairs, The Mount Sinai Medical Center. “Mount Sinai is thrilled to join minds with Cloudera.” (Please see http://www.cloudera.com/content/cloudera/en/about/press-center/press-releases/release.html?ReleaseID=1747809 for more details.)

Cloudera is active in many other areas of BioInformatics. Due to Cloudera’s market leadership in Big Data, many DNA mapping programs have specific installation instructions for CDH (Cloudera’s 100% open-source, enterprise-ready distribution of Hadoop and related projects). But rather than just tell you about Cloudera let’s do a worked example of BioInformatics data – specifically FAERS.

A sponsored piece by Cloudera but walks you through using Impala with the FDA data on adverse drug reactions.

Demonstrates getting started with Impala isn’t hard. Which is true.

What’s lacking is a measure of the difficulty of good results.

Any old result, good or bad, probably isn’t of interest to most users.

Cloudera Glossary

Thursday, November 15th, 2012

Cloudera Glossary

A goodly collection of terms used with Cloudera (Hadoop and related) technology.

I have a weakness for dictionaries, lexicons, grammars and the like so your mileage may vary.

Includes links to additional resources.

Cloudera Impala – Fast, Interactive Queries with Hadoop

Wednesday, November 14th, 2012

Cloudera Impala – Fast, Interactive Queries with Hadoop by Istvan Szegedi.

From the post:

As discussed in the previous post about Twitter’s Storm, Hadoop is a batch oriented solution that has a lack of support for ad-hoc, real-time queries. Many of the players in Big Data have realised the need for fast, interactive queries besides the traditional Hadooop approach. Cloudera, one the key solution vendors in Big Data/Hadoop domain has just recently launched Cloudera Impala that addresses this gap.

As Cloudera Engineering team descibed in ther blog, their work was inspired by Google Dremel paper which is also the basis for Google BigQuery. Cloudera Impala provides a HiveQL-like query language for wide variety of SELECT statements with WHERE, GROUP BY, HAVING clauses, with ORDER BY – though currently LIMIT is mandatory with ORDER BY -, joins (LEFT, RIGTH, FULL, OUTER, INNER), UNION ALL, external tables, etc. It also supports arithmetic and logical operators and Hive built-in functions such as COUNT, SUM, LIKE, IN or BETWEEN. It can access data stored on HDFS but it does not use mapreduce, instead it is based on its own distributed query engine.

The current Impala release (Impala 1.0beta) does not support DDL statements (CREATE, ALTER, DROP TABLE), all the table creation/modification/deletion functions have to be executed via Hive and then refreshed in Impala shell.

Cloudera Impala is open-source under Apache Licence, the code can be retrieved from Github. Its components are written in C++, Java and Python.

Will get you off to a good start with Impala.

Impala: Real-time Queries in Hadoop [Recorded Webinar]

Wednesday, November 7th, 2012

Impala: Real-time Queries in Hadoop

From the description:

Learn how Cloudera Impala empowers you to:

  1. Perform interactive, real-time analysis directly on source data stored in Hadoop
  2. Interact with data in HDFS and HBase at the “speed of thought”
  3. Reduce data movement between systems & eliminate double storage

You can also grab the slides here

.Almost fifty-nine minutes.

Speedup on Hive over MapReduce reported to be 4-30X faster.

I’m not sure about #2 above but then I lack a skull-jack. Maybe this next Christmas. ;-)

One To Watch: Apache Crunch

Wednesday, October 31st, 2012

One To Watch: Apache Crunch by Chris Mayer.

From the post:

Over the past few years, the Apache Software Foundation has become the hub for big data-focused projects. An array of companies have recognised the worth of housing their latest innovative projects at the ASF, with Apache Hadoop and Apache Cassandra two shining examples.

Amongst the number of projects arriving in the Apache Incubator was Apache Crunch. Crunch is a Java library created to eliminate the tedium of writing a MapReduce pipeline. It aims to take hold of the entire process, making writing, testing, and running MapReduce pipelines more efficient and “even fun” (if this Cloudera blog post is to be believed).

That’s a tall order, to make MapReduce pipelines “even fun.” On the other hand, remarkable things have emerged from Apache for decades now.

A project to definitely keep in sight.

Cloudera’s Impala and the Semantic “Mosh Pit”

Thursday, October 25th, 2012

Cloudera’s Impala tool binds Hadoop with business intelligence apps by Christina Farr.

From the post:

In traditional circles, Hadoop is viewed as a bright but unruly problem child.

Indeed, it is still in the nascent stages of development. However the scores of “big data” startups that leverage Hadoop will tell you that it is here to stay.

Cloudera, the venture-backed startup that ushered the mainstream deployment of Hadoop, has unveiled a new technology at the Hadoop World, the data-focused conference in New York.

Its new product, known as “Impala”, addresses many of the concerns that large enterprises still have about Hadoop, namely that it does not integrate well with traditional business intelligence applications.

“We have heard this criticism,” said Charles Zedlewski, Cloudera’s VP of Product in a phone interview with VentureBeat. “That’s why we decided to do something about it,” he said.

Impala enables its users to store vast volumes of unwieldy data and run queries in HBase, Hadoop’s NoSQL database. What’s interesting is that it is built to maximise speed: it runs on top of Hadoop storage, but speaks to SQL and works with pre-existing drivers.

Legacy data is a well known concept.

Are we approaching the point of legacy applications? Applications that are too widely/deeply embedded in IT infrastructure to be replaced?

Or at least not replaced quickly?

The semantics of legacy data are known to be fair game for topic maps. Do the semantics of legacy applications offer the same possibilities?

Mapping the semantics of “legacy” applications, their ancestors and descendants, data, legacy and otherwise, results in a semantic mosh pit.

Some strategies for a semantic “mosh pit:”

  1. Prohibit it (we know the success rate on that option)
  2. Ignore it (costly but more “successful” than #1)
  3. Create an app on top of the legacy app (an error repeated isn’t an error, it’s following precedent)
  4. Sample it (but what are you missing?)
  5. Map it (being mindful of cost/benefit)

Which one are you going to choose?

Sneak Peek into Skybox Imaging’s Cloudera-powered Satellite System [InaaS?]

Saturday, October 20th, 2012

Sneak Peek into Skybox Imaging’s Cloudera-powered Satellite System by Justin Kestelyn (@kestelyn)

This is a guest post by Oliver Guinan, VP Ground Software, at Skybox Imaging. Oliver is a 15-year veteran of the internet industry and is responsible for all ground system design, architecture and implementation at Skybox.

One of the great promises of the big data movement is using networks of ubiquitous sensors to deliver insights about the world around us. Skybox Imaging is attempting to do just that for millions of locations across our planet.

Skybox is developing a low cost imaging satellite system and web-accessible big data processing platform that will capture video or images of any location on Earth within a couple of days. The low cost nature of the satellite opens the possibility of deploying tens of satellites which, when integrated together, have the potential to image any spot on Earth within an hour.

Skybox satellites are designed to capture light in the harsh environment of outer space. Each satellite captures multiple images of a given spot on Earth. Once the images are transferred from the satellite to the ground, the data needs to be processed and combined to form a single image, similar to those seen within online mapping portals.

With any sensor network, capturing raw data is only the beginning of the story. We at Skybox are building a system to ingest and process the raw data, allowing data scientists and end users to ask arbitrary questions of the data, then publish the answers in an accessible way and at a scale that grows with the number of satellites in orbit. We selected Cloudera to support this deployment.

Now is the time to start planning topic map based products that can incorporate this type of data.

There are lots of folks who are “curious” about what is happening next door, in the next block, a few “klicks” away, across the border, etc.

Not all of them have the funds for private “keyhole” satellites and vacuum data feeds. But they may have money to pay you for efficient and effective collation of intelligence data.

Topic maps empowering “Intelligence as a Service (InaaS)”?

What’s New in CDH4.1 Pig

Friday, October 19th, 2012

What’s New in CDH4.1 Pig by Cheolsoo Park.

From the post:

Apache Pig is a platform for analyzing large data sets that provides a high-level language called Pig Latin. Pig users can write complex data analysis programs in an intuitive and compact manner using Pig Latin.

Among many other enhancements, CDH4.1, the newest release of Cloudera’s open-source Hadoop distro, upgrades Pig from version 0.9 to version 0.10. This post provides a summary of the top seven new features introduced in CDH4.1 Pig.

Cheolsoo covers these new features:

  • Boolean Data Type
  • Nested FOREACH and CROSS
  • Ruby UDFs
  • LIMIT / SPLIT by Expression
  • Default SPLIT Destination
  • Syntactical Sugar for TOTUPLE, TOBAG, and TOMAP
  • AvroStorage Improvements

Enjoy!

Axemblr’s Java Client for the Cloudera Manager API

Thursday, October 18th, 2012

Axemblr’s Java Client for the Cloudera Manager API by Justin Kestelyn.

From the post:

Axemblr, purveyors of a cloud-agnostic MapReduce Web Service, have recently announced the availability of an Apache-licensed Java Client for the Cloudera Manager API.

The task at hand, according to Axemblr, is to ”deploy Hadoop on Cloud with as little user interaction as possible. We have the code to provision the hosts but we still need to install and configure Hadoop on all nodes and make it so the user has a nice experience doing it.” And voila, the answer is Cloudera Manager, with the process made easy via the REST API introduced in Release 4.0.

Thus, says Axemblr: “In the pursuit of our greatest desire (second only to coffee early in the morning), we ended up writing a Java client for Cloudera Manager’s API. Thus we achieved to automate a CDH3 Hadoop installation on Amazon EC2 and Rackspace Cloud. We also decided to open source the client so other people can play along.”

Another goodie to ease your way to Hadoop deployment on your favorite cloud.

Do you remember the lights at radio stations that would show “On Air?”

I need an “On Cloud” that lights up. More realistic than the data appliance.

Analyzing Twitter Data with Hadoop, Part 2: Gathering Data with Flume

Tuesday, October 16th, 2012

Analyzing Twitter Data with Hadoop, Part 2: Gathering Data with Flume by Jon Natkins.

From the post:

This is the second article in a series about analyzing Twitter data using some of the components of the Hadoop ecosystem available in CDH, Cloudera’s open-source distribution of Hadoop and related projects. In the first article, you learned how to pull CDH components together into a single cohesive application, but to really appreciate the flexibility of each of these components, we need to dive deeper.

Every story has a beginning, and every data pipeline has a source. So, to build Hadoop applications, we need to get data from a source into HDFS.

Apache Flume is one way to bring data into HDFS using CDH. The Apache Flume website describes Flume as “a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.” At the most basic level, Flume enables applications to collect data from its origin and send it to a resting location, such as HDFS. At a slightly more detailed level, Flume achieves this goal by defining dataflows consisting of three primary structures: sources, channels and sinks. The pieces of data that flow through Flume are called events, and the processes that run the dataflow are called agents.

A very good introduction to the use of Flume!

Does it seem to you that the number of examples using Twitter, not just for “big data” but in general seems to be on the rise?

Just a personal observation and subject to all the flaws, “all the buses were going the other way,” of such.

Judging from the state of my inbox, some people are still writing more than 140 characters at a time.

Will it make a difference in our tools/thinking if we focus on shorter strings as opposed to longer ones?

What is Hadoop Metrics2?

Wednesday, October 10th, 2012

What is Hadoop Metrics2? by Ahmed Radwan.

I’ve been wondering about that. How about you? ;-)

From the post:

Metrics are collections of information about Hadoop daemons, events and measurements; for example, data nodes collect metrics such as the number of blocks replicated, number of read requests from clients, and so on. For that reason, metrics are an invaluable resource for monitoring Hadoop services and an indispensable tool for debugging system problems.

This blog post focuses on the features and use of the Metrics2 system for Hadoop, which allows multiple metrics output plugins to be used in parallel, supports dynamic reconfiguration of metrics plugins, provides metrics filtering, and allows all metrics to be exported via JMX.

However cool the software, can’t ever really get away from managing the software.

And it isn’t a bad skill to have. Read on!

CDH4.1 Now Released!

Wednesday, October 3rd, 2012

CDH4.1 Now Released! by Charles Zedlewski.

From the post:

We’re pleased to announce the availability of CDH4.1. We’ve seen excellent adoption of CDH4.0 since it went GA at the end of June and a number of exciting use cases have moved to production. CDH4.1 is an update that has a number of fixes but also a number of useful enhancements. Among them:

  • Quorum based storage – Quorum-based Storage for HDFS provides the ability for HDFS to store its own NameNode edit logs, allowing you to run a highly available NameNode without external storage or custom fencing.
  • Hive security and concurrency – we’ve fixed some long standing issues with running Hive. With CDH4.1, it is now possible to run a shared Hive instance where users submit queries using Kerberos authentication. In addition this new Hive server supports multiple users submitting queries at the same time.
  • Support for DataFu – the LinkedIn data science team was kind enough to open source their library of Pig UDFs that make it easier to perform common jobs like sessionization or set operations. Big thanks to the LinkedIn team!!!
  • Oozie workflow builder – since we added Oozie to CDH more than two years ago, we have often had requests to make it easier to develop Oozie workflows. The newly enhanced job designer in Hue enables users to use a visual tool to build and run Oozie workflows.
  • FlumeNG improvements – since its release, FlumeNG has become the backbone for some exciting data collection projects, in some cases collecting as much as 20TB of new event data per day. In CDH4.1 we added an HBase sink as well as metrics for monitoring as well as a number of performance improvements.
  • Various performance improvements – CDH4.1 users should experience a boost in their MapReduce performance from CDH4.0.
  • Various security improvements – CDH4.1 enables users to configure the system to encrypt data in flight during the shuffle phase. CDH now also applies Hadoop security to users who access the filesystem via a FUSE mount.

It’s releases like this that make me wish I spent more time writing documentation for software. To try out all the cool features with no real goal other than trying them out.

Enjoy!

Schedule This! Strata + Hadoop World Speakers from Cloudera

Monday, September 24th, 2012

Schedule This! Strata + Hadoop World Speakers from Cloudera by Justin Kestelyn.

Oct. 23-25, 2012, New York City

From the post:

We’re getting really close to Strata Conference + Hadoop World 2012 (just over a month away), schedule planning-wise. So you may want to consider adding the tutorials, sessions, and keynotes below to your calendar! (Start times are always subject to change of course.)

The ones listed below are led or co-led by Clouderans, but there is certainly a wide range of attractive choices beyond what you see here. We just want to ensure that you put these particular ones high on your consideration list.

Just in case the Clouderans aren’t enough incentive to attend (they should be), consider the full schedule for the conference.