Archive for the ‘Impala’ Category

The Impala Cookbook

Thursday, December 11th, 2014

The Impala Cookbook by Justin Kestelyn.

From the post:

Impala, the open source MPP analytic database for Apache Hadoop, is now firmly entrenched in the Big Data mainstream. How do we know this? For one, Impala is now the standard against which alternatives measure themselves, based on a proliferation of new benchmark testing. Furthermore, Impala has been adopted by multiple vendors as their solution for letting customers do exploratory analysis on Big Data, natively and in place (without the need for redundant architecture or ETL). Also significant, we’re seeing the emergence of best practices and patterns out of customer experiences.

As an effort to streamline deployments and shorten the path to success, Cloudera’s Impala team has compiled a “cookbook” based on those experiences, covering:

  • Physical and Schema Design
  • Memory Usage
  • Cluster Sizing and Hardware Recommendations
  • Benchmarking
  • Multi-tenancy Best Practices
  • Query Tuning Basics
  • Interaction with Apache Hive, Apache Sentry, and Apache Parquet

By using these recommendations, Impala users will be assured of proper configuration, sizing, management, and measurement practices to provide an optimal experience. Happy cooking!

I must confess to some confusion when I first read Justin’s post. I thought the slide set was a rather long description of the cookbook and not the cookbook itself. I was searching for the cookbook and kept finding the slides. 😉

Oh, the slides are very much worth your time but I would reserve the term “cookbook” for something a bit more substantive.

Although O’Reilly thinks a few more than 800 responses constitutes a “survey” of data scientists. Survey results that are free from any mention of Impala. Another reason to use that “survey” with caution.

Open-sourcing tools for Hadoop

Saturday, November 22nd, 2014

Open-sourcing tools for Hadoop by Colin Marc.

From the post:

Stripe’s batch data infrastructure is built largely on top of Apache Hadoop. We use these systems for everything from fraud modeling to business analytics, and we’re open-sourcing a few pieces today:

Timberlake

Timberlake is a dashboard that gives you insight into the Hadoop jobs running on your cluster. Jeff built it as a replacement for YARN’s ResourceManager and MRv2’s JobHistory server, and it has some features we’ve found useful:

  • Map and reduce task waterfalls and timing plots
  • Scalding and Cascading awareness
  • Error tracebacks for failed jobs

Brushfire

Avi wrote a Scala framework for distributed learning of ensemble decision tree models called Brushfire. It’s inspired by Google’s PLANET, but built on Hadoop and Scalding. Designed to be highly generic, Brushfire can build and validate random forests and similar models from very large amounts of training data.

Sequins

Sequins is a static database for serving data in Hadoop’s SequenceFile format. I wrote it to provide low-latency access to key/value aggregates generated by Hadoop. For example, we use it to give our API access to historical fraud modeling features, without adding an online dependency on HDFS.

Herringbone

At Stripe, we use Parquet extensively, especially in tandem with Cloudera Impala. Danielle, Jeff, and Avi wrote Herringbone (a collection of small command-line utilities) to make working with Parquet and Impala easier.

More open source tools for your Hadoop installation!

I am considering creating a list of closed source tools for Hadoop. It would be shorter and easier to maintain than a list of open source tools for Hadoop. 😉

6X Performance with Impala

Sunday, May 4th, 2014

In-memory Caching in HDFS: Lower latency, same great taste by Andrew Wang.

From the post:

My coworker Colin McCabe and I recently gave a talk at Hadoop Summit Amsterdam titled “In-memory Caching in HDFS: Lower latency, same great taste.” I’m very pleased with how this feature turned out, since it was approximately a year-long effort going from initial design to production system. Combined with Impala, we showed up to a 6x performance improvement by running on cached data, and that number will only improve with time. Slides and video of our presentation are available online.

Finding data the person who signs the checks will be interested in seeing with 6X performance is left as an exercise for the reader. 😉

Cloudera Live (beta)

Thursday, April 17th, 2014

Cloudera Live (beta)

From the webpage:

Try a live demo of Hadoop, right now.

Cloudera Live is a new way to get started with Apache Hadoop, online. No downloads, no installations, no waiting. Watch tutorial videos and work with real-world examples of the complete Hadoop stack included with CDH, Cloudera’s completely open source Hadoop platform, to:

  • Learn Hue, the Hadoop User Interface developed by Cloudera
  • Query data using popular projects like Apache Hive, Apache Pig, Impala, Apache Solr, and Apache Spark (new!)
  • Develop workflows using Apache Oozie

Great news for people interested in Hadoop!

Question: Will this become the default delivery model for test driving software and training?

Enjoy!

Use Parquet with Impala, Hive, Pig, and MapReduce

Saturday, March 22nd, 2014

How-to: Use Parquet with Impala, Hive, Pig, and MapReduce by John Russell.

From the post:

The CDH software stack lets you use your tool of choice with the Parquet file format – – offering the benefits of columnar storage at each phase of data processing.

An open source project co-founded by Twitter and Cloudera, Parquet was designed from the ground up as a state-of-the-art, general-purpose, columnar file format for the Apache Hadoop ecosystem. In particular, Parquet has several features that make it highly suited to use with Cloudera Impala for data warehouse-style operations:

  • Columnar storage layout: A query can examine and perform calculations on all values for a column while reading only a small fraction of the data from a data file or table.
  • Flexible compression options: The data can be compressed with any of several codecs. Different data files can be compressed differently. The compression is transparent to applications that read the data files.
  • Innovative encoding schemes: Sequences of identical, similar, or related data values can be represented in ways that save disk space and memory, yet require little effort to decode. The encoding schemes provide an extra level of space savings beyond the overall compression for each data file.
  • Large file size: The layout of Parquet data files is optimized for queries that process large volumes of data, with individual files in the multi-megabyte or even gigabyte range.

Impala can create Parquet tables, insert data into them, convert data from other file formats to Parquet, and then perform SQL queries on the resulting data files. Parquet tables created by Impala can be accessed by Apache Hive, and vice versa.

That said, the CDH software stack lets you use the tool of your choice with the Parquet file format, for each phase of data processing. For example, you can read and write Parquet files using Apache Pig and MapReduce jobs. You can convert, transform, and query Parquet tables through Impala and Hive. And you can interchange data files between all of those components — including ones external to CDH, such as Cascading and Apache Tajo.

In this blog post, you will learn the most important principles involved.

Since I mentioned ROOT files yesterday, I am curious what you make of the use of Thrift metadata definitions to read Parquet files?

It’s great that data can be documented for reading, but reading doesn’t imply to me that its semantics have been captured.

A wide variety of products read data, less certain they can document data semantics.

You?

I first saw this in a tweet by Patrick Hunt.

Impala v Hive

Sunday, December 22nd, 2013

Impala v Hive by Mike Olson.

From the post:

We introduced Cloudera Impala more than a year ago. It was a good launch for us — it made our platform better in ways that mattered to our customers, and it’s allowed us to win business that was previously unavailable because earlier products simply couldn’t tackle interactive SQL workloads.

As a side effect, though, that launch ignited fierce competition among vendors for SQL market share in the Apache Hadoop ecosystem, with claims and counter-claims flying. Chest-beating on performance abounds (and we like our numbers pretty well), but I want to approach the matter from a different direction here.

I get asked all the time about Cloudera’s decision to develop Impala from the ground up as a new project, rather than improving the existing Apache Hive project. If there’s existing code, the thinking goes, surely it’s best to start there — right?

Well, no. We thought long and hard about it, and we concluded that the best thing to do was to create a new open source project, designed on different principles from Hive. Impala is that system. Our experiences over the last year increase our conviction on that strategy.

Let me walk you through our thinking.

Mike makes a very good argument for building Impala.

Whether you agree with it or not, it centers on requirements and users.

I won’t preempt his argument here but suffice it to say that Cloudera saw the need for robust SQL support over Hadoop data stores and estimated user demand for a language like SQL versus a newer language like Pig.

Personally I found it refreshing for someone to explicitly consider user habits as opposed to a “…users need to learn the right way (my way) to query/store/annotate data…” type approach.

You know the outcome, now go read the reasons Cloudera made the decisions it did.

CDH 4.5, Manager 4.8, Impala 1.2.1, Search 1.1

Tuesday, November 26th, 2013

Announcing: CDH 4.5, Cloudera Manager 4.8, Cloudera Impala 1.2.1, and Cloudera Search 1.1

Before your nieces and nephews (at least in the U.S.) start chewing up your bandwidth over the Thanksgiving Holidays, you may want to grab the most recent releases from Cloudera.

If you are traveling, it will give you something to do during airport delays. 😉

Integrating R with Cloudera Impala…

Monday, November 25th, 2013

Integrating R with Cloudera Impala for Real-Time Queries on Hadoop by Istvan Szegedi.

From the post:

Cloudera Impala supports low-latency, interactive queries on Hadoop data sets either stored in Hadoop Distributed File System (HDFS) or HBase, the distributed NoSQL database for Hadoop. Impala’s notion is to use Hadoop as a storage engine but move away from MapReduce algorithms. Instead, Impala uses distributed queries, a concept inherited from massive parallel processing databases. As a result, Impala supports SQL-like query languange (in the same way way as Apache Hive), but can execute the queries 10-100 times fasters than Hive that converts them into MapReduce. You can find more details on Impala in one of the previous posts.

R is one of the most popular open source statistical computing and graphical software. It can work with various data sources from comma separated files to web contents referred by URLs to relational databases to NoSQL (e.g. MongoDB or Cassandra) and Hadoop.

Thanks to the generic Impala ODBC driver, R can be integrated with Impala, too. The solution will provide fast, interactive queries running on top of Hadoop data sets and then the data can be further processed or visualized within R.

Have you noticed that newer technologies (Hadoop) are becoming accessible to more traditional tools (R)?

Which will move traditional tool users towards newer technologies.

The combining of the new with the traditional has a distinct odor.

I think it is called success. 😉

Use MADlib Pre-built Analytic Functions….

Wednesday, October 30th, 2013

How-to: Use MADlib Pre-built Analytic Functions with Impala by Victor Bittorf.

From the post:

Cloudera Impala is an exciting project that unlocks interactive queries and SQL analytics on big data. Over the past few months I have been working with the Impala team to extend Impala’s analytic capabilities. Today I am happy to announce the availability of pre-built mathematical and statistical algorithms for the Impala community under a free open-source license. These pre-built algorithms combine recent theoretical techniques for shared nothing parallelization for analytics and the new user-defined aggregations (UDA) framework in Impala 1.2 in order to achieve big data scalability. This initial release has support for logistic regression, support vector machines (SVMs), and linear regression.

Having recently completed my masters degree while working in the database systems group at University of Madison Wisconsin, I’m excited to work with the Impala team on this project while I continue my research as a visiting student at Stanford. I’m going to go through some details about what we’ve implemented and how to use it.

As interest in data analytics increases, there is growing demand for deploying analytic algorithms in enterprise systems. One approach that has received much attention from researchers, engineers and data scientists is the integration of statistical data analysis into databases. One example of this is MADlib, which leverages the data-processing capabilities of an RDBMS to analyze data.

Victor walks through several examples of data analytics but for those of you who want to cut to the chase:

This package uses UDAs and UDFs when training and evaluating analytic models. While all of these tasks can be done in pure SQL using the Impala shell, we’ve put together some front-end scripts to streamline the process. The source code for the UDAs, UDFs, and scripts are all on GitHub.

Usual cautions apply: The results of your script or model may or may not have any resemblance to “facts” as experienced by others.

Visualization on Impala: Big, Real-Time, and Raw

Tuesday, August 20th, 2013

Visualization on Impala: Big, Real-Time, and Raw by Justin Kestelyn.

From the post:

What if you could affordably manage billions of rows of raw Big Data and let typical business people analyze it at the speed of thought in beautiful, interactive visuals? What if you could do all the above without worrying about structuring that data in a data warehouse schema, moving it, and pre-defining reports and dashboards? With the approach I’ll describe below, you can.

The traditional Apache Hadoop approach — in which you store all your data in HDFS and do batch processing through MapReduce — works well for data geeks and data scientists, who can write MapReduce jobs and wait hours for them to run before asking the next question. But many businesses have never even heard of Hadoop, don’t employ a data scientist, and want their data questions answered in a second or two — not in hours.

We at Zoomdata, working with the Cloudera team, have figured out how to make Big Data simple, useful, and instantly accessible across an organization, with Cloudera Impala being a key element. Zoomdata is a next-generation user interface for data, and addresses streams of data as opposed to sets. Zoomdata performs continuous math across data streams in real-time to drive visualizations on touch, gestural, and legacy web interfaces. As new data points come in, it re-computes their values and turns them into visuals in milliseconds.

To handle historical data, Zoomdata re-streams the historical raw data through the same stream-processing engine, the same way you’d rewind a television show on your home DVR. The amount of the data involved can grow rapidly, so the ability to crunch billions of rows of raw data in a couple seconds is important –- which is where Impala comes in.

With Impala on top of raw HDFS data, we can run flights of tiny queries, each to do a tiny fraction of the overall work. Zoomdata adds the ability to process the resulting stream of micro-result sets instead of processing the raw data. We call this approach “micro-aggregate delegation”; it enables users to see results immediately, allowing for instantaneous analysis of arbitrarily large amounts of raw data. The approach also allows for joining micro-aggregate streams from disparate Hadoop, NoSQL, and legacy sources together while they are in-flight, an approach we call the “Death Star Join” (more on that in a future blog post).

The demo below shows how this works, by visualizing a dataset of 1 billion raw records per day nearly instantaneously, with no pre-aggregation, no indexing, no database, no star schema, no pre-built reports, and no data movement — just billions of rows of raw data in HDFS with Impala and Zoomdata on top.

The demo is very impressive! A must see.

The riff that Zoomdata is a “dvr” for data will resonate with many users.

My only caveat is caution with regard to the cleanliness of your data. The demo presumes that the underlying data is clean and the relationships displayed are relevant to the user’s query.

Neither of those assumptions may be correct in your particular case. Not the fault of Zoomdata because no software can correct a poor choice of data for analysis.

See Zoomdata.

…Sentry: Fine-Grained Authorization for Impala and Apache Hive

Wednesday, July 24th, 2013

…Sentry: Fine-Grained Authorization for Impala and Apache Hive

From the post:

Cloudera, the leader in enterprise analytic data management powered by Apache Hadoop™, today unveiled the next step in the evolution of enterprise-class big data security, introducing Sentry: a new Apache licensed open source project that delivers the industry’s first fine-grained authorization framework for Hadoop. An independent security module that integrates with open source SQL query engines Apache Hive and Cloudera Impala, Sentry delivers advanced authorization controls to enable multi-user applications and cross-functional processes for enterprise datasets. This level of granular control, available for the first time in Hadoop, is imperative to meet enterprise Role Based Access Control (RBAC) requirements of highly regulated industries, like healthcare, financial services and government. Sentry alleviates the security concerns that have prevented some organizations from opening Hadoop data systems to a more diverse set of users, extending the power of Hadoop and making it suitable for new industries, organizations and enterprise use cases. Concurrently, the company confirmed it plans to submit the Sentry security module to the Apache Incubator at the Apache Software Foundation later this year.

Welcome news but I could not bring myself to include all the noise words in the press release title. 😉

For technical details, see: http://cloudera.com/content/cloudera/en/Campaign/introducing-sentry.html.

Just a word of advice: This doesn’t “solve” big data security issues. It is one aspect of big data security.

Another aspect of big data security is not allowing people to bring in and leave your facility with magnetic media. Ever.

Not to mention using glue to permanently close all USB ports and CD/DVD drives.

There is always tension between how much security do you need versus the cost and inconvenience.

Another form of security: Have your supervisor’s approval in writing for deviations from known “good” security practices.

Impala 1.0

Wednesday, May 1st, 2013

Impala 1.0: Industry’s First Production-Ready SQL-on-Hadoop Solution

From the post:

Cloudera, the category leader that sets the standard for Apache Hadoop in the enterprise, today announced the general availability of Cloudera Impala, its open source, interactive SQL query engine for analyzing data stored in Hadoop clusters in real time. Cloudera was first-to-market with its SQL-on-Hadoop offering, releasing Impala to open source as a public beta offering in October 2012. Since that time, it has worked closely with customers and open source users, rigorously testing and refining the platform in real world applications to deliver today’s production-hardened and customer validated release, designed from the ground-up for enterprise workloads. The company noted that adoption of the platform has been strong: over 40 enterprise customers and open source users are using Impala today, including 37signals, Expedia, Six3 Systems, Stripe, and Trion Worlds. With its 1.0 release, Impala extends Cloudera’s unified Platform for Big Data, which is designed specifically to bring different computation frameworks and applications to a single pool of data, using a common set of system resources.

The bigger data pools get, the more opportunity there is for semantic confusion.

Or to put that more positively, the greater the market for tools to lessen or avoid semantic confusion.

😉

What’s New in Hue 2.3

Sunday, April 28th, 2013

What’s New in Hue 2.3

From the post:

We’re very happy to announce the 2.3 release of Hue, the open source Web UI that makes Apache Hadoop easier to use.

Hue 2.3 comes only two months after 2.2 but contains more than 100 improvements and fixes. In particular, two new apps were added (including an Apache Pig editor) and the query editors are now easier to use.

Here’s the new features list:

  • Pig Editor: new application for editing and running Apache Pig scripts with UDFs and parameters
  • Table Browser: new application for managing Apache Hive databases, viewing table schemas and sample of content
  • Apache Oozie Bundles are now supported
  • SQL highlighting and auto-completion for Hive/Impala apps
  • Multi-query and highlight/run a portion of a query
  • Job Designer was totally restyled and now supports all Oozie actions
  • Oracle databases (11.2 and later) are now supported

Time to upgrade!

Cloudera Impala: A Modern SQL Engine for Hadoop [Webinar – 10 Jan 2013]

Wednesday, January 9th, 2013

Cloudera Impala: A Modern SQL Engine for Hadoop

From the post:

Join us for this technical deep dive about Cloudera Impala, the project that makes scalable parallel databse technology available to the Hadoop community for the first time. Impala is an open-sourced code base that allows users to issue low-latency queries to data stored in HDFS and Apache HBase using familiar SQL operators.

Presenter Marcel Kornacker, creator of Impala, will begin with an overview of Impala from the user’s perspective, followed by an overview of Impala’s architecture and implementation, and will conclude with a comparison of Impala with Apache Hive, commercial MapReduce alternatives and traditional data warehouse infrastructure.

Looking forward to the comparison part. Picking the right tool for a job is an important first step.

Impala Beta (0.3) + Cloudera Manager 4.1.2 [Get’m While Their Hot!]

Wednesday, December 5th, 2012

Cloudera Impala Beta (version 0.3) and Cloudera Manager 4.1.2 Now Available by Vinithra Varadharajan.

If you are keeping your Hadoop ecosystem skills up to date, drop by Cloudera for the latest Impala beta and a new release of Cloudera Manager.

Vinithra reports that new releases of Impala are going to drop every two to four weeks.

You can either wait for the final release of Impala or read along and contribute to the final product with your testing and comments.

BioInformatics: A Data Deluge with Hadoop to the Rescue

Monday, November 19th, 2012

BioInformatics: A Data Deluge with Hadoop to the Rescue by Marty Lurie.

From the post:

Cloudera Cofounder and Chief Scientist Jeff Hammerbacher is leading a revolutionary project with Mount Sinai School of Medicine to apply the power of Cloudera’s Big Data platform to critical problems in predicting and understanding the process and treatment of disease.

“We are at the cutting edge of disease prevention and treatment, and the work that we will do together will reshape the landscape of our field,” said Dennis S. Charney, MD, Anne and Joel Ehrenkranz Dean, Mount Sinai School of Medicine and Executive Vice President for Academic Affairs, The Mount Sinai Medical Center. “Mount Sinai is thrilled to join minds with Cloudera.” (Please see http://www.cloudera.com/content/cloudera/en/about/press-center/press-releases/release.html?ReleaseID=1747809 for more details.)

Cloudera is active in many other areas of BioInformatics. Due to Cloudera’s market leadership in Big Data, many DNA mapping programs have specific installation instructions for CDH (Cloudera’s 100% open-source, enterprise-ready distribution of Hadoop and related projects). But rather than just tell you about Cloudera let’s do a worked example of BioInformatics data – specifically FAERS.

A sponsored piece by Cloudera but walks you through using Impala with the FDA data on adverse drug reactions.

Demonstrates getting started with Impala isn’t hard. Which is true.

What’s lacking is a measure of the difficulty of good results.

Any old result, good or bad, probably isn’t of interest to most users.

Cloudera Impala – Fast, Interactive Queries with Hadoop

Wednesday, November 14th, 2012

Cloudera Impala – Fast, Interactive Queries with Hadoop by Istvan Szegedi.

From the post:

As discussed in the previous post about Twitter’s Storm, Hadoop is a batch oriented solution that has a lack of support for ad-hoc, real-time queries. Many of the players in Big Data have realised the need for fast, interactive queries besides the traditional Hadooop approach. Cloudera, one the key solution vendors in Big Data/Hadoop domain has just recently launched Cloudera Impala that addresses this gap.

As Cloudera Engineering team descibed in ther blog, their work was inspired by Google Dremel paper which is also the basis for Google BigQuery. Cloudera Impala provides a HiveQL-like query language for wide variety of SELECT statements with WHERE, GROUP BY, HAVING clauses, with ORDER BY – though currently LIMIT is mandatory with ORDER BY -, joins (LEFT, RIGTH, FULL, OUTER, INNER), UNION ALL, external tables, etc. It also supports arithmetic and logical operators and Hive built-in functions such as COUNT, SUM, LIKE, IN or BETWEEN. It can access data stored on HDFS but it does not use mapreduce, instead it is based on its own distributed query engine.

The current Impala release (Impala 1.0beta) does not support DDL statements (CREATE, ALTER, DROP TABLE), all the table creation/modification/deletion functions have to be executed via Hive and then refreshed in Impala shell.

Cloudera Impala is open-source under Apache Licence, the code can be retrieved from Github. Its components are written in C++, Java and Python.

Will get you off to a good start with Impala.

Impala: Real-time Queries in Hadoop [Recorded Webinar]

Wednesday, November 7th, 2012

Impala: Real-time Queries in Hadoop

From the description:

Learn how Cloudera Impala empowers you to:

  1. Perform interactive, real-time analysis directly on source data stored in Hadoop
  2. Interact with data in HDFS and HBase at the “speed of thought”
  3. Reduce data movement between systems & eliminate double storage

You can also grab the slides here

.Almost fifty-nine minutes.

Speedup on Hive over MapReduce reported to be 4-30X faster.

I’m not sure about #2 above but then I lack a skull-jack. Maybe this next Christmas. 😉

Cloudera’s Impala and the Semantic “Mosh Pit”

Thursday, October 25th, 2012

Cloudera’s Impala tool binds Hadoop with business intelligence apps by Christina Farr.

From the post:

In traditional circles, Hadoop is viewed as a bright but unruly problem child.

Indeed, it is still in the nascent stages of development. However the scores of “big data” startups that leverage Hadoop will tell you that it is here to stay.

Cloudera, the venture-backed startup that ushered the mainstream deployment of Hadoop, has unveiled a new technology at the Hadoop World, the data-focused conference in New York.

Its new product, known as “Impala”, addresses many of the concerns that large enterprises still have about Hadoop, namely that it does not integrate well with traditional business intelligence applications.

“We have heard this criticism,” said Charles Zedlewski, Cloudera’s VP of Product in a phone interview with VentureBeat. “That’s why we decided to do something about it,” he said.

Impala enables its users to store vast volumes of unwieldy data and run queries in HBase, Hadoop’s NoSQL database. What’s interesting is that it is built to maximise speed: it runs on top of Hadoop storage, but speaks to SQL and works with pre-existing drivers.

Legacy data is a well known concept.

Are we approaching the point of legacy applications? Applications that are too widely/deeply embedded in IT infrastructure to be replaced?

Or at least not replaced quickly?

The semantics of legacy data are known to be fair game for topic maps. Do the semantics of legacy applications offer the same possibilities?

Mapping the semantics of “legacy” applications, their ancestors and descendants, data, legacy and otherwise, results in a semantic mosh pit.

Some strategies for a semantic “mosh pit:”

  1. Prohibit it (we know the success rate on that option)
  2. Ignore it (costly but more “successful” than #1)
  3. Create an app on top of the legacy app (an error repeated isn’t an error, it’s following precedent)
  4. Sample it (but what are you missing?)
  5. Map it (being mindful of cost/benefit)

Which one are you going to choose?