Efficient Parallel Set-Similarity Joins Using MapReduce

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 11, 2010

Efficient Parallel Set-Similarity Joins Using MapReduce

Filed under: Data Integration,Heterogeneous Data,Information Retrieval,MapReduce,Semantic Diversity,Software — Patrick Durusau @ 9:36 am

Efficient Parallel Set-Similarity Joins Using MapReduce by Rares Vernica, Michael J. Carey, and, Chen Li, Department of Computer Science, University of California, Irvine, used Citeseer (1.3M publications) and DBLP (1.2M publications) and “…increased their sizes as needed.”

The contributions of this paper are:

“We describe efficient ways to partition a large dataset across nodes in order to balance the workload and minimize the need for replication. Compared to the equi-join case, the set-similarity joins case requires “partitioning” the data based on set contents.
We describe efficient solutions that exploit the MapReduce framework. We show how to efficiently deal with problems such as partitioning, replication, and multiple
inputs by manipulating the keys used to route the data in the framework.
We present methods for controlling the amount of data kept in memory during a join by exploiting the properties of the data that needs to be joined.
We provide algorithms for answering set-similarity self-join queries end-to-end, where we start from records containing more than just the join attribute and end with actual pairs of joined records.
We show how our set-similarity self-join algorithms can be extended to answer set-similarity R-S join queries.
We present strategies for exceptional situations where, even if we use the finest-granularity partitioning method, the data that needs to be held in the main memory of one node is too large to fit.”

A number of lessons and insights relevant to topic maps in this paper.

Makes me think of domain specific (as well as possibly one or more “general”) set-similarity join interchange languages! What are you thinking of?

Comments (2)

2 Comments

[…] Parallel Platform for Semistructured Data Management and Analysis is one of the projects behind the self-similarity and MapReduce […]

Pingback by ASTERIX: A Highly Scalable Parallel Platform for Semistructured Data Management and Analysis – SITE « Another Word For It — July 13, 2010 @ 5:22 am
[…] The FLAMINGO Project on Data Cleaning is the other project that has influenced the self-similarity work with MapReduce. […]

Pingback by The FLAMINGO Project on Data Cleaning – Site « Another Word For It — July 13, 2010 @ 5:28 am

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.