From the post:
When our team at Databricks planned our contributions to the upcoming Apache Spark 2.0 release, we set out with an ambitious goal by asking ourselves: Apache Spark is already pretty fast, but can we make it 10x faster?
This question led us to fundamentally rethink the way we built Spark’s physical execution layer. When you look into a modern data engine (e.g. Spark or other MPP databases), a majority of the CPU cycles are spent in useless work, such as making virtual function calls or reading or writing intermediate data to CPU cache or memory. Optimizing performance by reducing the amount of CPU cycles wasted in this useless work has been a long-time focus of modern compilers.
Apache Spark 2.0 will ship with the second generation Tungsten engine. Built upon ideas from modern compilers and MPP databases and applied to data processing queries, Tungsten emits (SPARK-12795) optimized bytecode at runtime that collapses the entire query into a single function, eliminating virtual function calls and leveraging CPU registers for intermediate data. As a result of this streamlined strategy, called “whole-stage code generation,” we significantly improve CPU efficiency and gain performance.
(emphasis in original)
How much better you ask?
cost per row (in nanoseconds, single thread)
primitive Spark 1.6 Spark 2.0 filter 15 ns 1.1 ns sum w/o group 14 ns 0.9 ns sum w/ group 79 ns 10.7 ns hash join 115 ns 4.0 ns sort (8-bit entropy) 620 ns 5.3 ns sort (64-bit entropy) 620 ns 40 ns sort-merge join 750 ns 700 ns Parquet decoding (single int column) 120 ns 13 ns
Don’t just stare at the numbers:
What’s the matter?
Haven’t you ever seen a 1 billion record join in 0.8 seconds? (Down from 61.7 seconds.)
If all that weren’t impressive enough, the post walks you through the dominate (currently) query evaluation strategy as a setup to Spark 2.0 and then into why “whole-stage code generation is so powerful.”
A must read!