Parallel Implementation of Classification Algorithms Based on MapReduce Authors: Qing He, Fuzhen Zhuang, Jincheng Li and Zhongzhi Shi Keywords: Data Mining, Classification, Parallel Implementation, Large Dataset, MapReduce
Abstract:
Data mining has attracted extensive research for several decades. As an important task of data mining, classification plays an important role in information retrieval, web searching, CRM, etc. Most of the present classification techniques are serial, which become impractical for large dataset. The computing resource is under-utilized and the executing time is not waitable. Provided the program mode of MapReduce, we propose the parallel implementation methods of several classification algorithms, such as k-nearest neighbors, naive bayesian model and decision tree, etc. Preparatory experiments show that the proposed parallel methods can not only process large dataset, but also can be extended to execute on a cluster, which can significantly improve the efficiency.
From the paper:
In this paper, we introduced the parallel implementation of several classification algorithms based on MapReduce, which make them be applicable to mine large dataset. The key is to design the proper key/value pairs. (emphasis in original)
Questions:
- Annotated bibliography of parallel classification algorithms (newer than this paper, 3-5 pages, citations)
- Report for class on application of parallel classification algorithms (report + paper)
- Application of parallel classification algorithm to a library dataset (project)
- Can the key/value pairs be interchanged with others? Yes/no, why? (3-5 pages, no citations.)