Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 3, 2012

Column Statistics in Hive

Filed under: Cloudera,Hive,Merging,Statistics — Patrick Durusau @ 2:48 pm

Column Statistics in Hive by Shreepadma Venugopalan.

From the post:

Over the last couple of months the Hive team at Cloudera has been working hard to bring a bunch of exciting new features to Hive. In this blog post, I’m going to talk about one such feature – Column Statistics in Hive – and how Hive’s query processing engine can benefit from it. The feature is currently a work in progress but we expect it to be available for review imminently.

Motivation

While there are many possible execution plans for a query, some plans are more optimal than others. The query optimizer is responsible for generating an efficient execution plan for a given SQL query from the space of all possible plans. Currently, Hive’s query optimizer uses rules of thumbs to generate an efficient execution plan for a query. While such rules of thumb optimizations transform the query plan into a more efficient one, the resulting plan is not always the most efficient execution plan.

In contrast, the query optimizer in a traditional RDBMS is cost based; it uses the statistical properties of the input column values to estimate the cost alternative query plans and chooses the plan with the lowest cost. The cost model for query plans assigns an estimated execution cost to the plans. The cost model is based on the CPU and I/O costs of query execution for every operator in the query plan. As an example consider a query that represents a join among {A, B, C} with the predicate {A.x == B.x == C.x}. Assume table A has a total of 500 records, table B has a total of 6000 records, table C has a total of 1000 records. In the absence of cost based query optimization, the system picks the join order specified by the user. In our example, let us further assume that the result of joining A and B yields 2000 records and the result of joining A and C yields 50 records.Hence the cost of performing the join between A, B and C, without join reordering, is the cost of joining A and B + cost of joining the output of A Join B with C. In our example this would result in a cost of (500 * 6000) + (2000 * 1000). On the other hand, a cost based optimizer (CBO) in a RDBMS would pick the more optimal alternate order [(A Join C) Join B] thus resulting in a cost of (500 * 1000) + (50 * 6000). However, in order to pick the more optimal join order the CBO needs cardinality estimates on the join column.

Today, Hive supports statistics at the table and partition level – count of files, raw data size, count of rows etc, but doesn’t support statistics on column values. These table and partition level statistics are insufficient for the purpose of building a CBO because they don’t provide any information about the individual column values. Hence obtaining the statistical summary of the column values is the first step towards building a CBO for Hive.

In addition to join reordering, Hive’s query optimizer will be able to take advantage of column statistics to decide whether to perform a map side aggregation as well as estimate the cardinality of operators in the execution plan better.

Some days I wonder where improvements to algorithms and data structures are going to lead?

Other days, I just enjoy the news.

Today is one of the latter.

PS: What a cost based optimizer (CBO) would look like for merging operations? Or perhaps better, merge cost estimator (MCE)? Metered merging anyone?

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress