Parallel WaveCluster: A linear scaling parallel clustering algorithm implementation with application to very large datasets by Ahmet Artu Yıldırım and Cem Özdoğana.
Abstract:
A linear scaling parallel clustering algorithm implementation and its application to very large datasets for cluster analysis is reported. WaveCluster is a novel clustering approach based on wavelet transforms. Despite this approach has an ability to detect clusters of arbitrary shapes in an efficient way, it requires considerable amount of time to collect results for large sizes of multi-dimensional datasets. We propose the parallel implementation of the WaveCluster algorithm based on the message passing model for a distributed-memory multiprocessor system. In the proposed method, communication among processors and memory requirements are kept at minimum to achieve high efficiency. We have conducted the experiments on a dense dataset and a sparse dataset to measure the algorithm behavior appropriately. Our results obtained from performed experiments demonstrate that developed parallel WaveCluster algorithm exposes high speedup and scales linearly with the increasing number of processors.
This paper mentions, although it doesn’t treat, an important issue in the use of clustering algorithms with semantic data. In its description of the WaveCluster algorithm, the authors say:
WaveCluster algorithm contains three phases. In the first phase, algorithm quantizes feature space and then assigns the objects to the units.
(This is important work and deserved a better editing from its publisher.)
The problem is the quantizes feature space when the feature space is a semantic one.
Measuring a semantic dimension is not an easy task nor are the results free from doubt. The Wikipedia article on psychological testing is woefully below par for Wikipedia. I did manage to locate course materials for a psychology “research methods” course at UC Davis. PSC 41 Research Methods In particular, take a look at the Scaling module to get an idea of what it means to construct a scale for a semantic dimension.
I don’t have a handle on current social science research methods but the links in Free Resources for Program Evaluation and Social Research Methods should give you a place to start.
BTW, the original paper on WaveCluster: Wavecluster: A multi-resolution clustering approach for very large spatial databases has the following abstract:
Many applications require the management of spatial data. Clustering large spatial databases is an important problem which tries to find the densely populated regions in the feature space to be used in data mining, knowledge discovery, or efficient information retrieval. A good clustering approach should be efficient and detect clusters of arbitrary shape. It must be insensitive to the outliers (noise) and the order of input data. We pro-pose WaveCluster, a novel clustering approach based on wavelet transforms, which satisfies all the above requirements. Using multi-resolution property of wavelet transforms, we can effectively identify arbitrary shape clus-ters at different degrees of accuracy. We also demonstrate that WaveCluster is highly effi-cient in terms of time complexity. Experi-mental results on very large data sets are pre-sented which show the efficiency and effective-ness of the proposed approach compared to the other recent clustering methods. (Emphasis added.)
The ranges of spatial data dimensions are not in doubt so quantization makes sense. The “range” of semantic dimensions, on the other hand, are not nearly so certain.
As I pointed out at the beginning, this is very important research and used with appropriate data sets it can make a real difference. Used with inappropriate data sets and you have just cost your employer and possibly customers a good deal of time and effort.