Open Sourcing Cubert: A High Performance Computation Engine for Complex Big Data Analytics by Maneesh Varshney and Srinivas Vemuri.
From the post:
Cubert was built with the primary focus on better algorithms that can maximize map-side aggregations, minimize intermediate data, partition work in balanced chunks based on cost-functions, and ensure that the operators scan data that is resident in memory. Cubert has introduced a new paradigm of computation that:
- organizes data in a format that is ideally suited for scalable execution of subsequent query processing operators
- provides a suite of specialized operators (such as MeshJoin, Cube, Pivot) using algorithms that exploit the organization to provide significantly improved CPU and resource utilization
Cubert was shown to outperform other engines by a factor of 5-60X even when the data set sizes extend into 10s of TB and cannot fit into main memory.
The Cubert operators and algorithms were developed to specifically address real-life big data analytics needs:
- Complex Joins and aggregations frequently arise in the context of analytics on various user level metrics which are gathered on a daily basis from a user facing website. Cubert provides the unique MeshJoin algorithm that can process data sets running into terabytes over large time windows.
- Reporting workflows are distinct from ad-hoc queries by virtue of the fact that the computation pattern is regular and repetitive, allowing for efficiency gains from partial result caching and incremental processing, a feature exploited by the Cubert runtime for significantly improved efficiency and resource footprint.
- Cubert provides the new power-horse CUBE operator that can efficiently (CPU and memory) compute additive, non-additive (e.g. Count Distinct) and exact percentile rank (e.g. Median) statistics; can roll up inner dimensions on-the-fly and compute multiple measures within a single job.
- Cubert provides novel algorithms for graph traversal and aggregations for large-scale graph analytics.
Finally, Cubert Script is a developer-friendly language that takes out the hints, guesswork and surprises when running the script. The script provides the developers complete control over the execution plan (without resorting to low-level programming!), and is extremely extensible by adding new functions, aggregators and even operators.
and the source/documentation:
Cubert source code and documentation
The source code is open sourced under Apache v2 License and is available at https://github.com/linkedin/Cubert
The documentation, user guide and javadoc are available at http://linkedin.github.io/Cubert
The abstractions for data organization and calculations were present in the following paper:
“Execution Primitives for Scalable Joins and Aggregations in Map Reduce”, Srinivas Vemuri, Maneesh Varshney, Krishna Puttaswamy, Rui Liu. 40th International Conference on Very Large Data Bases (VLDB), Hangzhou, China, Sept 2014. (PDF)
Another advance in the processing of big data!
Now if we could just see a similar advance in the identification of entities/subjects/concepts/relationships in big data.
Nothing wrong with faster processing but a PB of poorly understood data is a PB of poorly understood data no matter how fast you process it.