Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 11, 2011

Fast Clustering using MapReduce

Filed under: Clustering,MapReduce — Patrick Durusau @ 7:09 pm

Fast Clustering using MapReduce by Alina Ene, Sungjin Im, and Benjamin Moseley.

Abstract:

Clustering problems have numerous applications and are becoming more challenging as the size of the data increases. In this paper, we consider designing clustering algorithms that can be used in MapReduce, the most popular programming environment for processing large datasets. We focus on the practical and popular clustering problems, $k$-center and $k$-median. We develop fast clustering algorithms with constant factor approximation guarantees. From a theoretical perspective, we give the first analysis that shows several clustering algorithms are in $\mathcal{MRC}^0$, a theoretical MapReduce class introduced by Karloff et al. \cite{KarloffSV10}. Our algorithms use sampling to decrease the data size and they run a time consuming clustering algorithm such as local search or Lloyd’s algorithm on the resulting data set. Our algorithms have sufficient flexibility to be used in practice since they run in a constant number of MapReduce rounds. We complement these results by performing experiments using our algorithms. We compare the empirical performance of our algorithms to several sequential and parallel algorithms for the $k$-median problem. The experiments show that our algorithms’ solutions are similar to or better than the other algorithms’ solutions. Furthermore, on data sets that are sufficiently large, our algorithms are faster than the other parallel algorithms that we tested.

In the Introduction the authors note:

In the clustering problems that we consider in this paper, the goal is to partition the data into subsets, called clusters, such that the data points assigned to the same cluster are similar according to some metric

At first a trivial observation but then on reflection, perhaps not.

First, clustering is described as a means to collect up “data points” (I would say “subjects” maybe, read further) that are “similar” by some measure. A collective subject, such as all the members of a football team, books in a library, users in a chat room at some point in time.

Second, and I think this goes unsaid too often, if the degree of similarity is high enough, we may well conclude that a single “subject” is described by the clustering.

But the choice between collective versus single subject is one that is one that has no permanent or universal resolution. So it is best to state which one is in operation.

Algorithms for MapReduce is an exciting area to watch.

1 Comment

  1. […] Fast Clustering using MapReduce « Another Word For It Fast Clustering using MapReduce by Alina Ene, Sungjin Im, and Benjamin Moseley. Abstract: … Our algorithms have sufficient flexibility to be used in practice since they run in a constant number of MapReduce rounds. Source: tm.durusau.net […]

    Pingback by Fast Clustering using MapReduce « Another Word For It | Big Data News | Scoop.it — September 13, 2011 @ 10:03 am

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress