Spark: Cluster Computing with Working Sets
From the post:
One of the aspects you can’t miss even as you just begin reading this paper is the strong scent of functional programming that the design of Spark bears. The use of FP idioms is quite widespread across the architecture of Spark such as the ability to restore a partition from by applying a closure block, operations such as reduce and map/collect, distributed accumulators etc. It would suffice to say that it is a very functional system. Pun intended!
Spark is written in Scala and is well suited for the class of applications that reuse a working set of data across multiple parallel operations. It claims to outperform Hadoop by 10x in iterative machine learning jobs, and has been tried successfully to interactively query a 39 GB dataset with sub-second response time!
Its is built on top of Mesos, a resource management infrastructure, that lets multiple parallel applications share a cluster in a fine-grained manner and provides an API for applications to launch tasks on a cluster.
Developers write a driving program that orchestrates various parallel operations. Spark’s programming model provides two abstractions to work with large datasets : resilient distributed datasets and parallel operations. In addition it supports two kinds of shared variables.
If more technical papers had previews like this one, more technical papers would be read!
Interesting approach on first blush. Not sure I make that much out of sub-second queries on 39 GB dataset as that is a physical memory issue these days. I do like the idea of sets of data, subject to repeated operations.
New: Spark Project Homepage.