The Impala Cookbook by Justin Kestelyn.
From the post:
Impala, the open source MPP analytic database for Apache Hadoop, is now firmly entrenched in the Big Data mainstream. How do we know this? For one, Impala is now the standard against which alternatives measure themselves, based on a proliferation of new benchmark testing. Furthermore, Impala has been adopted by multiple vendors as their solution for letting customers do exploratory analysis on Big Data, natively and in place (without the need for redundant architecture or ETL). Also significant, we’re seeing the emergence of best practices and patterns out of customer experiences.
As an effort to streamline deployments and shorten the path to success, Cloudera’s Impala team has compiled a “cookbook” based on those experiences, covering:
- Physical and Schema Design
- Memory Usage
- Cluster Sizing and Hardware Recommendations
- Multi-tenancy Best Practices
- Query Tuning Basics
- Interaction with Apache Hive, Apache Sentry, and Apache Parquet
By using these recommendations, Impala users will be assured of proper configuration, sizing, management, and measurement practices to provide an optimal experience. Happy cooking!
I must confess to some confusion when I first read Justin’s post. I thought the slide set was a rather long description of the cookbook and not the cookbook itself. I was searching for the cookbook and kept finding the slides. 😉
Oh, the slides are very much worth your time but I would reserve the term “cookbook” for something a bit more substantive.
Although O’Reilly thinks a few more than 800 responses constitutes a “survey” of data scientists. Survey results that are free from any mention of Impala. Another reason to use that “survey” with caution.