Archive for the ‘Peregrine’ Category

Map-Reduce-Merge

Friday, January 13th, 2012

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters by Hung-chih Yang, Ali Dasdan, Ruey-Lung Hsiao and D. Stott Parker.

Abstract:

Map-Reduce is a programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. Through a simple interface with two functions, map and reduce, this model facilitates parallel implementation of many real-world tasks such as data processing for search engines and machine learning.

However, this model does not directly support processing multiple related heterogeneous datasets. While processing relational data is a common need, this limitation causes difficulties and/or inefficiency when Map-Reduce is applied on relational operations like joins.

We improve Map-Reduce into a new model called Map-Reduce-Merge. It adds to Map-Reduce a Merge phase that can efficiently merge data already partitioned and sorted (or hashed) by map and reduce modules. We also demonstrate that this new model can express relational algebra operators as well as implement several join algorithms.

As of today, I count sixty-three (63) citations of this paper. I just discovered it today and it is going to take some time to work through all the citing materials and then materials that cite those papers.

The Peregrine software I mentioned in another post, implements this map-reduce-merge framework.

Peregrine

Friday, January 13th, 2012

Peregrine

From the webpage:

Peregrine is a map reduce framework designed for running iterative jobs across partitions of data. Peregrine is designed to be FAST for executing map reduce jobs by supporting a number of optimizations and features not present in other map reduce frameworks.

Among its many “modern” features, Peregrine includes: “MapReduceMerge style computations including a new merge() operation.

I will have a separate blog entry on a paper describing MapReduceMerge computations for heterogeneous data sets.

This looks very important for the future of topic maps in a big data (heterogeneous) universe.