I saw a tweet from Stratosphere today saying: “20 secs per iteration for PageRank on a billion scale graph using #stratosphere’s iterative data flows.” Enough to get me to look further! 😉
Tracking the source of the tweet, I found the homepage of Stratosphere and there read:
Stratosphere is a DFG-funded research project investigating “Information Management on the Cloud” and creating the Stratosphere System for Big Data Analytics. The current openly released version is 0.2 with many new features and enhancements for usability, robustness, and performance. See the Change Log for a complete list of new features.
What is the Stratosphere System?
The Stratosphere System is an open-source cluster/cloud computing framework for Big Data analytics. It comprises a rich stack of components with different programming abstractions for complex analytics tasks:
- An extensible higher level language (Meteor) to quickly compose queries for common and recurring use cases. Internally, Meteor scripts are translated into Sopremo algebra and optimized.
- A parallel programming model (PACT, an extension of MapReduce) to run user-defined operations. PACT is based on second-order functions and features an optimizer that chooses parallelization strategies.
- An efficient massively parallel runtime (Nephele) for fault tolerant execution of acyclic data flows.
Stratosphere is open source under the Apache License, Version 2.0. Feel free to download it, try it out and give feedback or ask for help on our mailing lists.
Meteor Language
Meteor is a textual higher-level language for rapid composition of queries. It uses a JSON-like data model and features in its core typical operation for analysis and transformation of (semi-) structured nested data.
The meteor language is highly extensible and supports the addition of custom operations that integrate fluently with the syntax, in order to create problem specific Domain Languages. Meteor queries are translated into Sopremo algebra, optimized, and transformed into PACT programs by the compiler.
PACT Programming Model
The PACT programming model is an extension of the well known MapReduce programming model. PACT features a richer set of second-order functions (Map/Reduce/Match/CoGroup/Cross) that can be flexibly composed as DAGs into programs. PACT programs use a generic schema-free tuple data model to ease composition of more complex programs.
PACT programs are parallelized by a cost-based compiler that picks data shipping and local processing strategies such that network- and disk I/O is minimized. The compiler incorporates user code properties (when possible) to find better plans; it thus alleviates the need for many manual optimizations (such as job merging) that one typically does to create efficient MapReduce programs. Compiled PACT programs are executed by the Nephele Data Flow Engine.
Nephele Data Flow Engine
Nephele is a massively parallel data flow engine dealing with resource management, work scheduling, communication, and fault tolerance. Nephele can run on top of a cluster and govern the resources itself, or directly connect to an IaaS cloud service to allocate computing resources on demand.
Another big data contender!