Archive for the ‘STINGER’ Category

Hive 0.13 and Stinger!

Monday, April 21st, 2014

Announcing Apache Hive 0.13 and Completion of the Stinger Initiative! by Harish Butani.

From the post:

The Apache Hive community has voted on and released version 0.13 today. This is a significant release that represents a major effort from over 70 members who worked diligently to close out over 1080 JIRA tickets.

Hive 0.13 also delivers the third and final phase of the Stinger Initiative, a broad community based initiative to drive the future of Apache Hive, delivering 100x performance improvements at petabyte scale with familiar SQL semantics. These improvements extend Hive beyond its traditional roots and brings true interactive SQL query to Hadoop.

Ultimately, over 145 developers representing 44 companies, from across the Apache Hive community contributed over 390,000 lines of code to the project in just 13 months, nearly doubling the Hive code base.

The three phases of this important project spanned Hive versions 0.11, 0.12 and 0.13. Additionally, the Apache Hive team coordinated this 0.13 release with the simultaneous release of Apache Tez 0.4. Tez’s DAG execution speeds Hive queries run on Tez.

Hive 0.13

Kudos to one and all!

Open source work at its very best!

…Stinger Phase 3 Technical Preview

Saturday, December 21st, 2013

Announcing Stinger Phase 3 Technical Preview by Carter Shanklin.

From the post:

As an early Christmas present, we’ve made a technical preview of Stinger Phase 3 available. While just a preview by moniker, the release marks a significant milestone in the transformation of Hadoop from a batch-oriented system to a data platform capable of interactive data processing at scale and delivering on the aims of the Stinger Initiative.

Apache Tez and SQL: Interactive Query-IN-Hadoop

stinger-phase-3Tez is a low-level runtime engine not aimed directly at data analysts or data scientists. Frameworks need to be built on top of Tez to expose it to a broad audience… enter SQL and interactive query in Hadoop.

Stinger Phase 3 Preview combines the Tez execution engine with Apache Hive, Hadoop’s native SQL engine. Now, anyone who uses SQL tools in Hadoop can enjoy truly interactive data query and analysis.

We have already seen Apache Pig move to adopt Tez, and we will soon see others like Cascading do the same, unlocking many forms of interactive data processing natively in Hadoop. Tez is the technology that takes Hadoop beyond batch and into interactive, and we’re excited to see it available in a way that is easy to use and accessible to any SQL user.

….

Further on in the blog Carter mentions that for real fun you need four (4) physical nodes and a fairly large dataset.

I have yet to figure out the price break point between a local cluster and using a cloud service. Suggestions on that score?

Delivering on Stinger:…

Thursday, October 31st, 2013

Delivering on Stinger: a Phase 3 Progress Update by Arun Murthy.

From the post:

With the attention of the Hadoop community on Strata/Hadoop World in New York this week, it’s seems an appropriate time to give everyone an early update on continued community development of Apache Hive. This progress well and truly cements Hive as the standard open-source SQL solution for the Apache Hadoop ecosystem for not just extremely large-scale, batch queries but also for low-latency, human-interactive queries.

Many of you have heard of Project Stinger already, but for those who have not, Stinger is a community-facing roadmap laid out to improve Hive’s performance 100x and bring true interactive query to Hadoop. You can read more at www.hortonworks.com/labs/stinger.

We’ve gotten really excited lately as we’ve started to piece together the performance gains brought on by the past 9 months of hard work, including more than 700 closed Hive JIRAs and the launch of Apache Tez, which moves Hadoop beyond batch into a truly interactive big data platform.

I won’t replicate the performance graphics but I can hint that 200x improvements are worth your attention.

That’s right. 200x improvement in query performance.

Don’t take my word for it, read Arun’s post.

Apache Hive 0.12: Stinger Phase Two… DELIVERED [Unlike Obamacare]

Monday, October 21st, 2013

Apache Hive 0.12: Stinger Phase Two… DELIVERED by Thejas Nair.

From the post:

Stinger is not a product. Stinger is a broad community based initiative to bring interactive query at petabyte scale to Hadoop. And today, as representatives of this open, community led effort we are very proud to announce delivery of Apache Hive 0.12, which represents the critical second phase of this project!

Only five months in the making, Apache Hive 0.12 comprises over 420 closed JIRA tickets contributed by ten companies, with nearly 150 thousand lines of code! This work is perfectly representative of our approach… it is a substantial release with major contributions from a wide group of talented engineers from Microsoft, Facebook , Yahoo and others.

Delivery of SQL-IN-Hadoop Marches

The Stinger Initiative was announced in February and as promised, we have seen consistent regular delivery of new features and improvements as outlined in the Stinger plan. There are three roadmap vectors for Stinger: Speed, Scale and SQL. Each phase of the initiative advances on all three goals and this release provides a significant increase in SQL semantics, adding the VARCHAR and DATE datatypes and improving performance ORDER by and GROUP by. Several features to optimize queries have also been added.

We also contributed numerous “under the hood” improvements, ie refactoring code and making it easier to build on top of hive – getting rid of some of the technical debt. This helps us deliver further optimizations in the long term, especially for the upcoming Apache Tez integration.

A complete list of the notable improvements included in the release is listed here and expect an updated performance benchmark soon!

It is so nice to see a successful software project!

And an open source one at that!

Unlike the no bid IT mega-failure that is Obamacare.

Maybe there is something to having a good infrastructure for code development as opposed to contractors billing by the phone call, lunch meeting and hour.

BTW, all the protests about the volume of users trying to register with Obamacare? More managerial incompetence.

When you are rolling out a system for potentially 300 million+ users, don’t you anticipate load as part of the requirements?

If you didn’t, there is the start of the trail of managerial incompetence in Obamacare.

Stinger Phase 2:…

Thursday, September 5th, 2013

Stinger Phase 2: The Journey to 100x Faster Hive on Hadoop by Carter Shanklin.

From the post:

The Stinger Initiative is Hortonworks’ community-facing roadmap laying out the investments Hortonworks is making to improve Hive performance 100x and evolve Hive to SQL compliance to simplify migrating SQL workloads to Hive.

We launched the Stinger Initiative along with Apache Tez to evolve Hadoop beyond its MapReduce roots into a data processing platform that satisfies the need for both interactive query AND petabyte scale processing. We believe it’s more feasible to evolve Hadoop to cover interactive needs rather than move traditional architectures into the era of big data.

If you don’t think SQL is all that weird, ;-), this is a status update for you!

Serious progress is being made by a broad coalition of more than 60 developers.

Take the challenge and download HDP 2.0 Beta.

You can help build the future of SQL-IN-Hadoop.

But only if you participate.

Apache Hive 0.11: Stinger Phase 1 Delivered

Saturday, May 18th, 2013

Apache Hive 0.11: Stinger Phase 1 Delivered by Owen O’Malley.

From the post:

In February, we announced the Stinger Initiative, which outlined an approach to bring interactive SQL-query into Hadoop. Simply put, our choice was to double down on Hive to extend it so that it could address human-time use cases (i.e. queries in the 5-30 second range). So, with input and participation from the broader community we established a fairly audacious goal of 100X performance improvement and SQL compatibility.

Introducing Apache Hive 0.11 – 386 JIRA tickets closed

As representatives of this open, community led effort we are very proud to announce the first release of the new and improved Apache Hive, version 0.11. This substantial release embodies the work of a wide group of people from Microsoft, Facebook , Yahoo, SAP and others. Together we have addressed 386 JIRA tickets, of which there were 28 new features and 276 bug fixes. There were FIFTY-FIVE developers involved in this and I would like to thank every one of them. See below for a full list.

Delivering on the promise of Stinger Phase 1

As promised we have delivered phase 1 of the Stinger Initiative in late spring. This release is another proof point that that the open community can innovate at a rate unequaled by any proprietary vendor. As part of phase 1 we promised windowing, new data types, the optimized RC (ORC) file and base optimizations to the Hive Query engine and the community has delivered these key features.

Stinger

Welcome news for the Hive and SQL communities alike!

STINGER: Spatio-Temporal Interaction Networks and Graphs (STING) Extensible Representation

Monday, August 1st, 2011

STINGER: Spatio-Temporal Interaction Networks and Graphs (STING) Extensible Representation by David A. Bader, Georgia Institute of Technolgy; Jonathan Berry, Sandia National Laboratories; Adam Amos-Binks, Carleton University, Canada; Daniel Chavarrıa-Miranda, Pacific Northwest National Laboratory; Charles Hastings, Hayden Software Consulting, Inc.; Kamesh Madduri, Lawrence Berkeley National Laboratory; and, Steven C. Poulos, U.S. Department of Defense. Dated May 9, 2009.

Abstract:

In this document, we propose a dynamic graph data structure that can serve as a common data structure for multiple real-world applications. The extensible representation for dynamic complex networks is space-efficient, allows parallelism over vertices and edges independently, and can be used for efficient checkpoint/restart of the data.

Describes a deeply interesting data structure for graphs that can be used on different frameworks.

See the Stinger wiki page (with source code as attachments).

And, see D. Ediger, K. Jiang, J. Riedy, and D.A. Bader, “Massive Streaming Data Analytics: A Case Study with Clustering Coefficients,” 4th Workshop on Multithreaded Architectures and Applications (MTAAP), Atlanta, GA, April 23, 2010.

Abstract:

We present a new approach for parallel massive graph analysis of streaming, temporal data with a dynamic and extensible representation. Handling the constant stream of new data from health care, security, business, and social network applications requires new algorithms and data structures. We examine data structure and algorithm trade-offs that extract the parallelism necessary for high-performance updating analysis of massive graphs. Static analysis kernels often rely on storing input data in a specific structure. Maintaining these structures for each possible kernel with high data rates incurs a significant performance cost. A case study computing clustering coefficients on a general-purpose data structure demonstrates incremental updates can be more efficient than global recomputation. Within this kernel, we compare three methods for dynamically updating local clustering coefficients: a brute-force local recalculation, a sorting algorithm, and our new approximation method using a Bloom filter. On 32 processors of a Cray XMT with a synthetic scale-free graph of 224 ≈ 16 million vertices and 229 ≈ 537 million edges, the brute-force method processes a mean of over 50,000 updates per second and our Bloom filter approaches 200,000 updates per second.

The authors refer to their approach as “massive streaming data analytics“. I think you will agree.

OK, admittedly they used a Cray XMT. But, such processing power will be available the average site sooner than you think. Soon enough that reading along these lines will put you ahead of the next curve.