Archive for the ‘Cloudera’ Category

Top 5 Cloudera Engineering Blogs of 2017

Tuesday, January 9th, 2018

Top 5 Cloudera Engineering Blogs of 2017

From the post:

1. Working with UDFs in Apache Spark

2. Offset Management For Apache Kafka With Apache Spark Streaming

3. Performance comparison of different file formats and storage engines in the Apache Hadoop ecosystem

4. Up and running with Apache Spark on Apache Kudu

5. Apache Impala Leads Traditional Analytic Database

Kudos to Cloudera for a useful list of “top” blog posts for 2017.

We might disagree on the top five but it’s a manageable number of posts and represents the quality of Cloudera postings all year long.


Cloudera Introduces Topic Maps Extra-Lite

Wednesday, May 10th, 2017

New in Cloudera Enterprise 5.11: Hue Data Search and Tagging by Romain Rigaux.

From the post:

Have you ever struggled to remember table names related to your project? Does it take much too long to find those columns or views? Hue now lets you easily search for any table, view, or column across all databases in the cluster. With the ability to search across tens of thousands of tables, you’re able to quickly find the tables that are relevant for your needs for faster data discovery.

In addition, you can also now tag objects with names to better categorize them and group them to different projects. These tags are searchable, expediting the exploration process through easier, more intuitive discovery.

Through an integration with Cloudera Navigator, existing tags and indexed objects show up automatically in Hue, any additional tags you add appear back in Cloudera Navigator, and the familiar Cloudera Navigator search syntax is supported.
… (emphasis in original)

Seventeen (17) years ago, ISO/IEC 13250:2000 offered users the ability to have additional names for tables, columns and/or any other subject of interest.

Additional names that could have scope (think range of application, such as a language), that could exist in relationships to their creators/users, exposing as much or as little information to a particular user as desired.

For commonplace needs, perhaps tagging objects with names, displayed as simple string is sufficient.

But if viewed from a topic maps perspective, that string display to one user could in fact represent that string, along with who created it, what names it is used with, who uses similar names, just to name a few of the possibilities.

All of which makes me think topic maps should ask users:

  • What subjects do you need to talk about?
  • How do you want to identify those subjects?
  • What do you want to say about those subjects?
  • Do you need to talk about associations/relationships?

It could be, that for day to day users, a string tag/name is sufficient. That doesn’t mean that greater semantics don’t lurk just below the surface. Perhaps even on demand.

Achieving a 300% speedup in ETL with Apache Spark

Tuesday, January 3rd, 2017

Achieving a 300% speedup in ETL with Apache Spark by Eric Maynard.

From the post:

A common design pattern often emerges when teams begin to stitch together existing systems and an EDH cluster: file dumps, typically in a format like CSV, are regularly uploaded to EDH, where they are then unpacked, transformed into optimal query format, and tucked away in HDFS where various EDH components can use them. When these file dumps are large or happen very often, these simple steps can significantly slow down an ingest pipeline. Part of this delay is inevitable; moving large files across the network is time-consuming because of physical limitations and can’t be readily sped up. However, the rest of the basic ingest workflow described above can often be improved.

Campaign finance data suffers more from complexity and obscurity than volume.

However, there are data problems where volume and not deceit is the issue. In those cases, you may find Eric’s advice quite helpful.

Ibis on Impala: Python at Scale for Data Science

Tuesday, July 21st, 2015

Ibis on Impala: Python at Scale for Data Science by Marcel Kornacker and Wes McKinney.

From the post:

Ibis: Same Great Python Ecosystem at Hadoop Scale

Co-founded by the respective architects of the Python pandas toolkit and Impala and now incubating in Cloudera Labs, Ibis is a new data analysis framework with the goal of enabling advanced data analysis on a 100% Python stack with full-fidelity data. With Ibis, for the first time, developers and data scientists will be able to utilize the last 15 years of advances in high-performance Python tools and infrastructure in a Hadoop-scale environment—without compromising user experience for performance. It’s exactly the same Python you know and love, only at scale!

In this initial (unsupported) Cloudera Labs release, Ibis offers comprehensive support for the analytical capabilities presently provided by Impala, enabling Python users to run Big Data workloads in a manner similar to that of “small data” tools like pandas. Next, we’ll extend Impala and Ibis in several ways to make the Python ecosystem a seamless part of the stack:

  • First, Ibis will enable more natural data modeling by leveraging Impala’s upcoming support for nested types (expected by end of 2015).
  • Second, we’ll add support for Python user-defined logic so that Ibis will integrate with the existing Python data ecosystem—enabling custom Python functions at scale.
  • Finally, we’ll accelerate performance further through low-level integrations between Ibis and Impala with a new Python-friendly, in-memory columnar format and Python-to-LLVM code generation. These updates will accelerate Python to run at native hardware speed.

See: Getting Started with Ibis and How to Contribute (same authors, opposite order) in order to cut to the chase and get started.


New Advanced Analytics and Data Wrangling Tutorials on Cloudera Live

Thursday, January 8th, 2015

New Advanced Analytics and Data Wrangling Tutorials on Cloudera Live by Alex Gutow.

From the post:

When it comes to learning Apache Hadoop and CDH (Cloudera’s open source platform including Hadoop), there is no better place to start than Cloudera Live. With a quick, one-button deployment option, Cloudera Live launches a four-node Cloudera cluster that you can learn and experiment in free for two-weeks. To help plan and extend the capabilities of your cluster, we also offer various partner deployments. Building on the addition of interactive tutorials and Tableau and Zoomdata integration, we have added a new tutorial on Apache Spark and a new Trifacta partner deployment.

One of the most popular tools in the Hadoop ecosystem is Apache Spark. This easy-to-use, general-purpose framework is extensible across multiple use cases – including batch processing, iterative advanced analytics, and real-time stream processing. With support and development from multiple industry vendors and partner tools, Spark has quickly become a standard within Hadoop.

With the new tutorial, “Relationship Strength Analytics Using Spark,” it will walk you through the basics of Spark and how you can utilize the same, unified enterprise data hub to launch into advanced analytics. Using the example of product relationships, it will walk you through how to discover what products are commonly viewed together, how to optimize product campaigns together for better sales, and discover other insights about product relationships to help build advanced recommendations.

There is enough high grade educational material for data science that I think with some slicing and dicing, an entire curriculum could be fashioned out of online resources alone.

A number of Cloudera tutorials would find their way into such a listing.


New in CDH 5.3: Transparent Encryption in HDFS

Wednesday, January 7th, 2015

New in CDH 5.3: Transparent Encryption in HDFS by Charles Lamb, Yi Liu & Andrew Wang

From the post:

Apache Hadoop 2.6 adds support for transparent encryption to HDFS. Once configured, data read from and written to specified HDFS directories will be transparently encrypted and decrypted, without requiring any changes to user application code. This encryption is also end-to-end, meaning that data can only be encrypted and decrypted by the client. HDFS itself never handles unencrypted data or data encryption keys. All these characteristics improve security, and HDFS encryption can be an important part of an organization-wide data protection story.

Cloudera’s HDFS and Cloudera Navigator Key Trustee (formerly Gazzang zTrustee) engineering teams did this work under HDFS-6134 in collaboration with engineers at Intel as an extension of earlier Project Rhino work. In this post, we’ll explain how it works, and how to use it.

Excellent news! Especially for data centers who are responsible for the data of others.

The authors do mention the problem of rogue users, that is on the client side:

Finally, since each file is encrypted with a unique DEK and each EZ can have a different key, the potential damage from a single rogue user is limited. A rogue user can only access EDEKs and ciphertext of files for which they have HDFS permissions, and can only decrypt EDEKs for which they have KMS permissions. Their ability to access plaintext is limited to the intersection of the two. In a secure setup, both sets of permissions will be heavily restricted.

Just so you know, it won’t be a security problem with Hadoop 2.6 if Sony is hacked while running on a Hadoop 2.6 at a data center. Anyone who copies the master access codes from sticky notes will be able to do a lot of damage. North Korea, will be the whipping boy for major future cyberhacks. That’s policy, not facts talking.

For users who do understand what secure environments should look like, this a great advance.

Cloudera Live (Update)

Thursday, December 25th, 2014

Cloudera Live (Update)

I thought I had updated: Cloudera Live (beta) but apparently not!

Let me correct that today:

Cloudera Live is the fastest and easiest way to get started with Apache Hadoop and it now includes self­-guided, interactive demos and tutorials. With a one-­button deployment option, you can spin up a four-­node cluster of CDH, Cloudera’s open source Hadoop platform, within minutes. This free, cloud­-based Hadoop environment lets you:

  • Learn the basics of Hadoop (and CDH) through pre-­loaded, hands-­on tutorials
  • Plan your Hadoop project using your own datasets
  • Explore the latest features in CDH
  • Extend the capabilities of Hadoop and CDH through familiar partner tools, including Tableau and Zoomdata

Caution: The free trial is for fourteen (14) days only. To prevent billing to your account, you must delete the four machine cluster that you create.

I understand the need for a time limit but fourteen (14) days seems rather short to me, considering the number of options in the Hadoop ecosystem.

There is a read-only CDH option which is limited to three hour sessions.


Cloudera Enterprise 5.3 is Released

Wednesday, December 24th, 2014

Cloudera Enterprise 5.3 is Released by Justin Kestelyn.

From the post:

We’re pleased to announce the release of Cloudera Enterprise 5.3 (comprising CDH 5.3, Cloudera Manager 5.3, and Cloudera Navigator 2.2).

This release continues the drumbeat for security functionality in particular, with HDFS encryption (jointly developed with Intel under Project Rhino) now recommended for production use. This feature alone should justify upgrades for security-minded users (and an improved CDH upgrade wizard makes that process easier).

Here are some of the highlights (incomplete; see the respective Release Notes for CDH, Cloudera Manager, and Cloudera Navigator for full lists of features and fixes):

You are unlikely to see this until after the holidays but do pay attention to the security aspects of this release. Ask yourself, “Does my employer want to be the next Sony?” Then upgrade your current installation.

Other goodies are included so it isn’t just an upgrade for security reasons.


New in Cloudera Labs: SparkOnHBase

Friday, December 19th, 2014

New in Cloudera Labs: SparkOnHBase by Ted Malaska.

From the post:

Apache Spark is making a huge impact across our industry, changing the way we think about batch processing and stream processing. However, as we progressively migrate from MapReduce toward Spark, we shouldn’t have to “give up” anything. One of those capabilities we need to retain is the ability to interact with Apache HBase.

In this post, we will share the work being done in Cloudera Labs to make integrating Spark and HBase super-easy in the form of the SparkOnHBase project. (As with everything else in Cloudera Labs, SparkOnHBase is not supported and there is no timetable for possible support in the future; it’s for experimentation only.) You’ll learn common patterns of HBase integration with Spark and see Scala and Java examples for each. (It may be helpful to have the SparkOnHBase repository open as you read along.)

Is it too late to amend my wish list to include an eighty-hour week with Spark? 😉

This is an excellent opportunity to follow along with lab quality research on an important technology.

The Cloudera Labs discussion group strikes me as dreadfully under used.


The Top 10 Posts of 2014 from the Cloudera Engineering Blog

Thursday, December 18th, 2014

The Top 10 Posts of 2014 from the Cloudera Engineering Blog by Justin Kestelyn.

From the post:

Our “Top 10″ list of blog posts published during a calendar year is a crowd favorite (see the 2013 version here), in particular because it serves as informal, crowdsourced research about popular interests. Page views don’t lie (although skew for publishing date—clearly, posts that publish earlier in the year have pole position—has to be taken into account).

In 2014, a strong interest in various new components that bring real time or near-real time capabilities to the Apache Hadoop ecosystem is apparent. And we’re particularly proud that the most popular post was authored by a non-employee.

See Justin’s post for the top ten (10) list!

The Cloudera blog always has high quality content so this the cream of the crop!


The Impala Cookbook

Thursday, December 11th, 2014

The Impala Cookbook by Justin Kestelyn.

From the post:

Impala, the open source MPP analytic database for Apache Hadoop, is now firmly entrenched in the Big Data mainstream. How do we know this? For one, Impala is now the standard against which alternatives measure themselves, based on a proliferation of new benchmark testing. Furthermore, Impala has been adopted by multiple vendors as their solution for letting customers do exploratory analysis on Big Data, natively and in place (without the need for redundant architecture or ETL). Also significant, we’re seeing the emergence of best practices and patterns out of customer experiences.

As an effort to streamline deployments and shorten the path to success, Cloudera’s Impala team has compiled a “cookbook” based on those experiences, covering:

  • Physical and Schema Design
  • Memory Usage
  • Cluster Sizing and Hardware Recommendations
  • Benchmarking
  • Multi-tenancy Best Practices
  • Query Tuning Basics
  • Interaction with Apache Hive, Apache Sentry, and Apache Parquet

By using these recommendations, Impala users will be assured of proper configuration, sizing, management, and measurement practices to provide an optimal experience. Happy cooking!

I must confess to some confusion when I first read Justin’s post. I thought the slide set was a rather long description of the cookbook and not the cookbook itself. I was searching for the cookbook and kept finding the slides. 😉

Oh, the slides are very much worth your time but I would reserve the term “cookbook” for something a bit more substantive.

Although O’Reilly thinks a few more than 800 responses constitutes a “survey” of data scientists. Survey results that are free from any mention of Impala. Another reason to use that “survey” with caution.

Lab Report: The Final Grade [Normalizing Corporate Small Data]

Sunday, December 7th, 2014

Lab Report: The Final Grade by Dr. Geoffrey Malafsky.

From the post:

We have completed our TechLab series with Cloudera. Its objective was to explore the ability of Hadoop in general, and Cloudera’s distribution in particular, to meet the growing need for rapid, secure, adaptive merging and correction of core corporate data. I call this Corporate Small Data which is:

“Structured data that is the fuel of an organization’s main activities, and whose problems with accuracy and trustworthiness are past the stage of being alleged. This includes financial, customer, company, inventory, medical, risk, supply chain, and other primary data used for decision making, applications, reports, and Business Intelligence. This is Small Data relative to the much ballyhooed Big Data of the Terabyte range.”1

Corporate Small Data does not include the predominant Big Data examples which are almost all stochastic use cases. These can succeed even if there is error in the source data and uncertainty in the results since the business objective is getting trends or making general associations. In stark contrast are deterministic use cases, where the ramifications for wrong results are severely negative, such as for executive decision making, accounting, risk management, regulatory compliance, and security.

Dr. Malafsky gives Cloudera high marks (A-) for use in enterprises and what he describes as “data normalization.” Not in the relational database sense but more in the data cleaning sense.

While testing a Cloudera distribution at your next data cleaning exercise, ask yourself this question: OK, the processing worked great, but how to I avoid collecting all the information I needed for this project, again in the future?

Introducing Cloudera Labs: An Open Look into Cloudera Engineering R&D

Sunday, November 30th, 2014

Introducing Cloudera Labs: An Open Look into Cloudera Engineering R&D by Justin Kestelyn.

From the announcement of Cloudera Labs, a list of existing projects and a call for your suggestions of others:

Apache Kafka is among the “charter members” of this program. Since its origin as proprietary LinkedIn infrastructure just a couple years ago for highly scalable and resilient real-time data transport, it’s now one of the hottest projects associated with Hadoop. To stimulate feedback about Kafka’s role in enterprise data hubs, today we are making a Kafka-Cloudera Labs parcel (unsupported) available for installation.

Other initial Labs projects include:

  • Exhibit
    Exhibit is a library of Apache Hive UDFs that usefully let you treat array fields within a Hive row as if they were “mini-tables” and then execute SQL statements against them for deeper analysis.
  • Hive-on-Spark Integration
    A broad community effort is underway to bring Apache Spark-based data processing to Apache Hive, reducing query latency considerably and allowing IT to further standardize on Spark for data processing.
  • Impyla
    Impyla is a Python (2.6 and 2.7) client for Impala, the open source MPP query engine for Hadoop. It communicates with Impala using the same standard protocol as ODBC/JDBC drivers.
  • Oryx
    Oryx, a project jointly spearheaded by Cloudera Engineering and Intel, provides simple, real-time infrastructure for large-scale machine learning/predictive analytics applications.
  • RecordBreaker
    RecordBreaker, a project jointly developed by Hadoop co-founder Mike Cafarella and Cloudera, automatically turns your text-formatted data into structured Avro data–dramatically reducing data prep time.

As time goes on, and some of the projects potentially graduate into CDH components (or otherwise remain as Labs projects), more names will join the list. And of course, we’re always interested in hearing your suggestions for new Labs projects.

Do you take the rapid development of the Hadoop ecosystem as a lesson about investment in R&D by companies both large and small?

Is one of your first questions to a startup: What are your plans for investing in open source R&D?

Other R&D labs that I should call out for special mention?

Apache Hive on Apache Spark: The First Demo

Friday, November 21st, 2014

Apache Hive on Apache Spark: The First Demo by Brock Noland.

From the post:

Apache Spark is quickly becoming the programmatic successor to MapReduce for data processing on Apache Hadoop. Over the course of its short history, it has become one of the most popular projects in the Hadoop ecosystem, and is now supported by multiple industry vendors—ensuring its status as an emerging standard.

Two months ago Cloudera, Databricks, IBM, Intel, MapR, and others came together to port Apache Hive and the other batch processing engines to Spark. In October at Strata + Hadoop World New York, the Hive on Spark project lead Xuefu Zhang shared the project status and a provided a demo of our work. The same week at the Bay Area Hadoop User Group, Szehon Ho discussed the project and demo’ed the work completed. Additionally, Xuefu and Suhas Satish will be speaking about Hive on Spark at the Bay Area Hive User Group on Dec. 3.

The community has committed more than 140 changes to the Spark branch as part of HIVE-7292 – Hive on Spark. We are proud to say that queries are now functionally able to run, as you can see in the demo below of a multi-node Hive-on-Spark query (query 28 from TPC-DS with a scale factor of 20 on a TPC-DS derived dataset).

After seeing the demo, you will want to move Spark up on your technology to master list!

The Definitive “Getting Started” Tutorial for Apache Hadoop + Your Own Demo Cluster

Tuesday, October 7th, 2014

The Definitive “Getting Started” Tutorial for Apache Hadoop + Your Own Demo Cluster by Justin Kestelyn.

From the post:

Most Hadoop tutorials take a piecemeal approach: they either focus on one or two components, or at best a segment of the end-to-end process (just data ingestion, just batch processing, or just analytics). Furthermore, few if any provide a business context that makes the exercise pragmatic.

This new tutorial closes both gaps. It takes the reader through the complete Hadoop data lifecycle—from data ingestion through interactive data discovery—and does so while emphasizing the business questions concerned: What products do customers view on the Web, what do they like to buy, and is there a relationship between the two?

Getting those answers is a task that organizations with traditional infrastructure have been doing for years. However, the ones that bought into Hadoop do the same thing at greater scale, at lower cost, and on the same storage substrate (with no ETL, that is) upon which many other types of analysis can be done.

To learn how to do that, in this tutorial (and assuming you are using our sample dataset) you will:

  • Load relational and clickstream data into HDFS (via Apache Sqoop and Apache Flume respectively)
  • Use Apache Avro to serialize/prepare that data for analysis
  • Create Apache Hive tables
  • Query those tables using Hive or Impala (via the Hue GUI)
  • Index the clickstream data using Flume, Cloudera Search, and Morphlines, and expose a search GUI for business users/analysts

I can’t imagine what “other” tutorials that Justin has in mind. 😉

To be fair, I haven’t taken this particular tutorial. Hadoop tutorials you suggest as comparisons to this one? Your comparisons of Hadoop tutorials?

Cloudera Navigator Demo

Tuesday, August 26th, 2014

Cloudera Navigator Demo

Not long (9:50) but useful demo of Cloudera Navigator.

There was a surprise or two.

The first one was the suggestion that if there are multiple columns with different names for zip code (the equivalent of postal codes), that you should normalize all the columns to one name.

Understandable but what if the column has a non-intuitive (to the user) name for the column? Such as CEP?

It appears that “searching” is on surface tokens and we all know the perils of that type of searching. More robust searching would allow for searching for any variant name of postal code, for example, and return the columns that shared the property of being a postal code, without regard to the column name.

The second surprise was that “normalization” as described sets the stage for repeating normatization with each data import. That sounds subject to human error as more and more data sets are imported.

The interface itself appears easy to use, assuming you are satisfied with opaque tokens for which you have to guess the semantics. You could be right but then on the other hand, you could be wrong.

Kite SDK 0.15.0

Tuesday, July 29th, 2014

What’s New in Kite SDK 0.15.0? by Ryan Blue.

From the post:

Recently, Kite SDK, the open source toolset that helps developers build systems on the Apache Hadoop ecosystem, became a 0.15.0. In this post, you’ll get an overview of several new features and bug fixes.

Covered by this quick recap:

Working with Datasets by URI

Improved Configuration for MR and Apache Crunch Jobs

Parent POM for Kite Applications

Java Class Hints [more informative error messages]

More Docs and Tutorials

The last addition this release is a new user guide on, where we’re adding new tutorials and background articles. We’ve also updated the examples for the new features, which is a great place to learn more about Kite.

Also, watch this technical webinar on-demand to learn more about working with datasets in Kite.

I think you are going to like this.

Cloudera Live (beta)

Thursday, April 17th, 2014

Cloudera Live (beta)

From the webpage:

Try a live demo of Hadoop, right now.

Cloudera Live is a new way to get started with Apache Hadoop, online. No downloads, no installations, no waiting. Watch tutorial videos and work with real-world examples of the complete Hadoop stack included with CDH, Cloudera’s completely open source Hadoop platform, to:

  • Learn Hue, the Hadoop User Interface developed by Cloudera
  • Query data using popular projects like Apache Hive, Apache Pig, Impala, Apache Solr, and Apache Spark (new!)
  • Develop workflows using Apache Oozie

Great news for people interested in Hadoop!

Question: Will this become the default delivery model for test driving software and training?


How-to: Process Data using Morphlines (in Kite SDK)

Friday, April 11th, 2014

How-to: Process Data using Morphlines (in Kite SDK) by Janos Matyas.

From the post:

SequenceIQ has an Apache Hadoop-based platform and API that consume and ingest various types of data from different sources to offer predictive analytics and actionable insights. Our datasets are structured, unstructured, log files, and communication records, and they require constant refining, cleaning, and transformation.

These datasets come from different sources (industry-standard and proprietary adapters, Apache Flume, MQTT, iBeacon, and so on), so we need a flexible, embeddable framework to support our ETL process chain. Hello, Morphlines! (As you may know, originally the Morphlines library was developed as part of Cloudera Search; eventually, it graduated into the Kite SDK as a general-purpose framework.)

To define a Morphline transformation chain, you need to describe the steps in a configuration file, and the framework will then turn into an in-memory container for transformation commands. Commands perform tasks such as transforming, loading, parsing, and processing records, and they can be linked in a processing chain.

In this blog post, I’ll demonstrate such an ETL process chain containing custom Morphlines commands (defined via config file and Java), and use the framework within MapReduce jobs and Flume. For the sample ETL with Morphlines use case, we have picked a publicly available “million song” dataset from The raw data consist of one JSON file/entry for each track; the dictionary contains the following keywords:

A welcome demonstration of Morphines but I do wonder about the statement:

Our datasets are structured, unstructured, log files, and communication records, and they require constant refining, cleaning, and transformation. (Emphasis added.)

If you don’t have experience with S3 and this pipleine, it is a good starting point for your investigations.

Use Parquet with Impala, Hive, Pig, and MapReduce

Saturday, March 22nd, 2014

How-to: Use Parquet with Impala, Hive, Pig, and MapReduce by John Russell.

From the post:

The CDH software stack lets you use your tool of choice with the Parquet file format – – offering the benefits of columnar storage at each phase of data processing.

An open source project co-founded by Twitter and Cloudera, Parquet was designed from the ground up as a state-of-the-art, general-purpose, columnar file format for the Apache Hadoop ecosystem. In particular, Parquet has several features that make it highly suited to use with Cloudera Impala for data warehouse-style operations:

  • Columnar storage layout: A query can examine and perform calculations on all values for a column while reading only a small fraction of the data from a data file or table.
  • Flexible compression options: The data can be compressed with any of several codecs. Different data files can be compressed differently. The compression is transparent to applications that read the data files.
  • Innovative encoding schemes: Sequences of identical, similar, or related data values can be represented in ways that save disk space and memory, yet require little effort to decode. The encoding schemes provide an extra level of space savings beyond the overall compression for each data file.
  • Large file size: The layout of Parquet data files is optimized for queries that process large volumes of data, with individual files in the multi-megabyte or even gigabyte range.

Impala can create Parquet tables, insert data into them, convert data from other file formats to Parquet, and then perform SQL queries on the resulting data files. Parquet tables created by Impala can be accessed by Apache Hive, and vice versa.

That said, the CDH software stack lets you use the tool of your choice with the Parquet file format, for each phase of data processing. For example, you can read and write Parquet files using Apache Pig and MapReduce jobs. You can convert, transform, and query Parquet tables through Impala and Hive. And you can interchange data files between all of those components — including ones external to CDH, such as Cascading and Apache Tajo.

In this blog post, you will learn the most important principles involved.

Since I mentioned ROOT files yesterday, I am curious what you make of the use of Thrift metadata definitions to read Parquet files?

It’s great that data can be documented for reading, but reading doesn’t imply to me that its semantics have been captured.

A wide variety of products read data, less certain they can document data semantics.


I first saw this in a tweet by Patrick Hunt.

Kite Software Development Kit

Thursday, March 13th, 2014

Kite Software Development Kit

From the webpage:

The Kite Software Development Kit (Apache License, Version 2.0), or Kite for short, is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.

  • Codifies expert patterns and practices for building data-oriented systems and applications
  • Lets developers focus on business logic, not plumbing or infrastructure
  • Provides smart defaults for platform choices
  • Supports gradual adoption via loosely-coupled modules

Version 0.12.0 was released March 10, 2014.

Do note that unlike some “pattern languages,” these are legitimate patterns are based on expert patterns and practices. (There are “patterns” produced like Uncle Bilius (Harry Potter and the Deathly Hallows, Chapter Eight) after downing a bottle of firewhiskey. You should avoid such patterns.)

Data Science Challenge

Tuesday, March 11th, 2014

Data Science Challenge

Some details from the registration page:

Prerequisite: Data Science Essentials (DS-200)
Schedule: Twice per year
Duration: Three months from launch date
Next Challenge Date: March 31, 2014
Language: English
Price: USD $600

From the webpage:

Cloudera will release a Data Science Challenge twice each year. Each bi-quarterly project is based on a real-world data science problem involving a large data set and is open to candidates for three months to complete. During the open period, candidates may work on their project individually and at their own pace.

Current Data Science Challenge

The new Data Science Challenge: Detecting Anomalies in Medicare Claims will be available starting March 31, 2014, and will cost USD $600.

In the U.S., Medicare reimburses private providers for medical procedures performed for covered individuals. As such, it needs to verify that the type of procedures performed and the cost of those procedures are consistent and reasonable. Finally, it needs to detect possible errors or fraud in claims for reimbursement from providers. You have been hired to analyze a large amount of data from Medicare and try to detect abnormal data — providers, areas, or patients with unusual procedures and/or claims.

Register for the challenge.

Build a Winning Model

CCP candidates compete against each other and against a benchmark set by a committee including some of the world’s elite data scientists. Participants who surpass evaluation benchmarks receive the CCP: Data Scientist credential.

Lead the Field

Those with the highest scores from each Challenge will have an opportunity to share their solutions and promote their work on and via press and social media outlets. All candidates retain the full rights to their own work and may leverage their models outside of the Challenge as they choose.

Useful way to develop some street cred in data science.

Data Scientist Solution Kit

Friday, March 7th, 2014

Data Scientist Solution Kit

From the post:

The explosion of data is leading to new business opportunities that draw on advanced analytics and require a broader, more sophisticated skills set, including software development, data engineering, math and statistics, subject matter expertise, and fluency in a variety of analytics tools. Brought together by data scientists, these capabilities can lead to deeper market insights, more focused product innovation, faster anomaly detection, and more effective customer engagement for the business.

The Data Science Challenge Solution Kit is your best resource to get hands-on experience with a real-world data science challenge in a self-paced, learner-centric environment. The free solution kit includes a live data set, a step-by-step tutorial, and a detailed explanation of the processes required to arrive at the correct outcomes.

Data Science at Your Desk

The Web Analytics Challenge includes five sections that simulate the experience of exploring, then cleaning, and ultimately analyzing web log data. First, you will work through some of the common issues a data scientist encounters with log data and data in JSON format. Second, you will clean and prepare the data for modeling. Third, you will develop an alternate approach to building a classifier, with a focus on data structure and accuracy. Fourth, you will learn how to use tools like Cloudera ML to discover clusters within a data set. Finally, you will select an optimal recommender algorithm and extract ratings predictions using Apache Mahout.

With the ongoing confusion about what it means to be a “data scientist,” having a certification or two isn’t going to hurt your chances for employment.

And you may learn something in the bargain. 😉

CDH 4.6, Cloudera Manager 4.8.2, and Search 1.2

Friday, February 28th, 2014

Announcing: CDH 4.6, Cloudera Manager 4.8.2, and Search 1.2 by Justin Kestelyn.

Mostly bug fix releases but now is as good a time as any to upgrade before you are in crunch mode.

Secrets of Cloudera Support:…

Wednesday, February 26th, 2014

Secrets of Cloudera Support: Inside Our Own Enterprise Data Hub by Adam Warrington.

From the post:

Here at Cloudera, we are constantly pushing the envelope to give our customers world-class support. One of the cornerstones of this effort is the Cloudera Support Interface (CSI), which we’ve described in prior blog posts (here and here). Through CSI, our support team is able to quickly reason about a customer’s environment, search for information related to a case currently being worked, and much more.

In this post, I’m happy to write about a new feature in CSI, which we call Monocle Stack Trace.

Stack Trace Exploration with Search

Hadoop log messages and the stack traces in those logs are critical information in many of the support cases Cloudera handles. We find that our customer operation engineers (COEs) will regularly search for stack traces they find referenced in support cases to try to determine where else that stack trace has shown up, and in what context it would occur. This could be in the many sources we were already indexing as part of Monocle Search in CSI: Apache JIRAs, Apache mailing lists, internal Cloudera JIRAs, internal Cloudera mailing lists, support cases, Knowledge Base articles, Cloudera Community Forums, and the customer diagnostic bundles we get from Cloudera Manager.

It turns out that doing routine document searches for stack traces doesn’t always yield the best results. Stack traces are relatively long compared to normal search terms, so search indexes won’t always return the relevant results in the order you would expect. It’s also hard for a user to churn through the search results to figure out if the stack trace was actually an exact match in the document to figure out how relevant it actually is.

To solve this problem, we took an approach similar to what Google does when it wants to allow searching over a type that isn’t best suited for normal document search (such as images): we created an independent index and search result page for stack-trace searches. In Monocle Stack Trace, the search results show a list of unique stack traces grouped with every source of data in which unique stack trace was discovered. Each source can be viewed in-line in the search result page, or the user can go to it directly by following a link.

We also give visual hints as to how the stack trace for which the user searched differs from the stack traces that show up in the search results. A green highlighted line in a search result indicates a matching call stack line. Yellow indicates a call stack line that only differs in line number, something that may indicate the same stack trace on a different version of the source code. A screenshot showing the grouping of sources and visual highlighting is below:

See Adam’s post for the details.

I like the imaginative modification of standard search.

Not all data is the same and searching it as if it were, leaves a lot of useful data unfound.

Forbes on Graphs

Thursday, February 13th, 2014

Big Data Solutions Through The Combination Of Tools by Ben Lorica.

From the post:

As a user who tends to mix-and-match many different tools, not having to deal with configuring and assembling a suite of tools is a big win. So I’m really liking the recent trend towards more integrated and packaged solutions. A recent example is the relaunch of Cloudera’s Enterprise Data hub, to include Spark(1) and Spark Streaming. Users benefit by gaining automatic access to analytic engines that come with Spark(2). Besides simplifying things for data scientists and data engineers, easy access to analytic engines is critical for streamlining the creation of big data applications.

Another recent example is Dendrite(3) – an interesting new graph analysis solution from Lab41. It combines Titan (a distributed graph database), GraphLab (for graph analytics), and a front-end that leverages AngularJS, into a Graph exploration and analysis tool for business analysts:

Another contender in the graph space!

Interesting that Spark comes up a second time for today.

Having Forbes notice a technology gives it credence don’t you think?

I first saw this in a tweet by aurelius.

Write and Run Giraph Jobs on Hadoop

Sunday, February 9th, 2014

Write and Run Giraph Jobs on Hadoop by Mirko Kämpf.

From the post:

Create a test environment for writing and testing Giraph jobs, or just for playing around with Giraph and small sample datasets.

Apache Giraph is a scalable, fault-tolerant implementation of graph-processing algorithms in Apache Hadoop clusters of up to thousands of computing nodes. Giraph is in use at companies like Facebook and PayPal, for example, to help represent and analyze the billions (or even trillions) of connections across massive datasets. Giraph was inspired by Google’s Pregel framework and integrates well with Apache Accumulo, Apache HBase, Apache Hive, and Cloudera Impala.

Currently, the upstream “quick start” document explains how to deploy Giraph on a Hadoop cluster with two nodes running Ubuntu Linux. Although this setup is appropriate for lightweight development and testing, using Giraph with an enterprise-grade CDH-based cluster requires a slightly more robust approach.

In this how-to, you will learn how to use Giraph 1.0.0 on top of CDH 4.x using a simple example dataset, and run example jobs that are already implemented in Giraph. You will also learn how to set up your own Giraph-based development environment. The end result will be a setup (not intended for production) for writing and testing Giraph jobs, or just for playing around with Giraph and small sample datasets. (In future posts, I will explain how to implement your own graph algorithms and graph generators as well as how to export your results to Gephi, the “Adobe Photoshop for graphs”, through Impala and JDBC for further inspection.)

The first in a series of posts on Giraph.

This is great stuff!

It should keep you busy during your first conference call and/or staff meeting on Monday morning.

Monday won’t seem so bad. 😉

Create a Simple Hadoop Cluster with VirtualBox ( < 1 Hour)

Wednesday, January 29th, 2014

How-to: Create a Simple Hadoop Cluster with VirtualBox by Christian Javet.

From the post:

I wanted to get familiar with the big data world, and decided to test Hadoop. Initially, I used Cloudera’s pre-built virtual machine with its full Apache Hadoop suite pre-configured (called Cloudera QuickStart VM), and gave it a try. It was a really interesting and informative experience. The QuickStart VM is fully functional and you can test many Hadoop services, even though it is running as a single-node cluster.

I wondered what it would take to install a small four-node cluster…

I did some research and I found this excellent video on YouTube presenting a step by step explanation on how to setup a cluster with VMware and Cloudera. I adapted this tutorial to use VirtualBox instead, and this article describes the steps used.

Watch for the line:

Overall we will allocate 14GB of memory, so ensure that the host machine has sufficient memory, otherwise this will impact your experience negatively.

Yes, “…impact your experience negatively.”



The Cloudera Developer Newsletter: It’s For You!

Friday, January 10th, 2014

The Cloudera Developer Newsletter: It’s For You! by Justin Kestelyn.

From the post:

Developers and data scientists, we’re realize you’re special – as are operators and analysts, in their own particular ways.

For that reason, we are very happy to kick off 2014 with a new free service designed for you and other technical end-users in the Cloudera ecosystem: the Cloudera Developer Newsletter.

This new email-based newsletter contains links to a curated list of new how-to’s, docs, tools, engineer and community interviews, training, projects, conversations, videos, and blog posts to help you get a new Apache Hadoop-based enterprise data hub deployment off the ground, or get the most value out of an existing deployment. Look for a new issue every month!

All you have you to do is click the button below, provide your name and email address, tick the “Developer Community” check-box, and submit. Done! (Of course, you can also opt-in to several other communication channels if you wish.)

The first newsletter is due to appear at the end of January, 2014.

Given the quality of other Cloudera resources I look forward to this newsletter with anticipation!

Impala v Hive

Sunday, December 22nd, 2013

Impala v Hive by Mike Olson.

From the post:

We introduced Cloudera Impala more than a year ago. It was a good launch for us — it made our platform better in ways that mattered to our customers, and it’s allowed us to win business that was previously unavailable because earlier products simply couldn’t tackle interactive SQL workloads.

As a side effect, though, that launch ignited fierce competition among vendors for SQL market share in the Apache Hadoop ecosystem, with claims and counter-claims flying. Chest-beating on performance abounds (and we like our numbers pretty well), but I want to approach the matter from a different direction here.

I get asked all the time about Cloudera’s decision to develop Impala from the ground up as a new project, rather than improving the existing Apache Hive project. If there’s existing code, the thinking goes, surely it’s best to start there — right?

Well, no. We thought long and hard about it, and we concluded that the best thing to do was to create a new open source project, designed on different principles from Hive. Impala is that system. Our experiences over the last year increase our conviction on that strategy.

Let me walk you through our thinking.

Mike makes a very good argument for building Impala.

Whether you agree with it or not, it centers on requirements and users.

I won’t preempt his argument here but suffice it to say that Cloudera saw the need for robust SQL support over Hadoop data stores and estimated user demand for a language like SQL versus a newer language like Pig.

Personally I found it refreshing for someone to explicitly consider user habits as opposed to a “…users need to learn the right way (my way) to query/store/annotate data…” type approach.

You know the outcome, now go read the reasons Cloudera made the decisions it did.