M3R: Increased Performance for In-Memory Hadoop Jobs by Avraham Shinnar, David Cunningham, Benjamin Herta, Vijay Saraswat.
Abstract:
Main Memory Map Reduce (M3R) is a new implementation of the Hadoop Map Reduce (HMR) API targeted at online analytics on high mean-time-to-failure clusters. It does not support resilience, and supports only those workloads which can fit into cluster memory. In return, it can run HMR jobs unchanged – including jobs produced by compilers for higher-level languages such as Pig, Jaql, and SystemML and interactive front-ends like IBM BigSheets – while providing significantly better performance than the Hadoop engine on several workloads (e.g. 45x on some input sizes for sparse matrix vector multiply). M3R also supports extensions to the HMR API which can enable Map Reduce jobs to run faster on the M3R engine, while not affecting their performance under the Hadoop engine.
The authors start with the assumption of “clean” data that has already been reduced to terabytes in size and that can be stored in main memory for “scores” of nodes as opposed to thousands of nodes. (score = 20)
And they make the point that main memory is only going to increase in the coming years.
While phrased as “interactive analytics (e.g. interactive machine learning),” I wonder if the design point is avoiding non-random memory?
And what the consequences of entirely random memory will have on algorithm design? Or the assumptions that drive algorithmic design?
One way to test the impact of large memory on design would be to award access to cluster with several terabytes of data on a competitive basis, for some time period, with all the code, data, runs, etc., being streamed to a pubic forum.
One qualification being that the user not already have access to that level of computing power at work. 😉
I first saw this at Alex Popescu’s Paper: M3R – Increased Performance for In-Memory Hadoop Jobs.