Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 26, 2012

Shark (Hive on Spark)

Filed under: Shark,Spark — Patrick Durusau @ 4:57 pm

Shark (Hive on Spark)

From the webpage:

Shark is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive. It can answer Hive QL queries up to 70 times faster than Hive without modification to the existing data nor queries. Shark supports Hive’s query language, metastore, serialization formats, and user-defined functions.

We released Shark 0.2 on Oct 15, 2012. The new version is much more stable and also features significant performance improvements.

Getting Started

See our documentation on Github to get started. It takes around 5 mins to set up Shark on a single node for a quick spin, and about 20 mins on an Amazon EC2 cluster.

Fast Execution Engine

Shark is built on top of Spark, a data-parallel execution engine that is fast and fault-tolerant. Even if data are on disk, Shark can be noticeably faster than Hive because of the fast execution engine. It avoids the high task launching overhead of Hadoop MapReduce and does not require materializing intermediate data between stages on disk. Thanks to this fast engine, Shark can answer queries in sub-second latency.

They say that imitation is the sincerest form of flattery.

In software, do claims of compatibility with your software mean the same thing?

It isn’t possible to know which database solutions will be around in five years but the rapid emergence of alternative solutions certainly is exciting!

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress