Imperative and Declarative Hadoop: TPC-H in Pig and Hive

Imperative and Declarative Hadoop: TPC-H in Pig and Hive by Russell Jurney.

From the post:

According to the Transaction Processing Council, TPC-H is:

The TPC Benchmark™H (TPC-H) is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions.

TPC-H was implemented for Hive in HIVE-600 and for Pig in PIG-2397 by Hortonworks intern Jie Li. In going over this work, I was struck by how it outlined differences between Pig and SQL.

There seems to be tendency for simple SQL to provide greater clarity than Pig. At some point as the TPC-H queries become more demanding, complex SQL seems to have less clarity than the comparable Pig. Lets take a look.
(emphasis in original)

A refresher in the lesson that what solution you need, in this case Hive or PIg, depends upon your requirements.

Use either one blindly at the risk of poor performance or failing to meet other requirements.

Comments are closed.