Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 11, 2013

The Filtering vs. Clustering Dilemma

Filed under: Clustering,Graphs,Mathematics,Topological Data Analysis,Topology — Patrick Durusau @ 1:10 pm

Hierarchical information clustering by means of topologically embedded graphs by Won-Min Song, T. Di Matteo, and Tomaso Aste.

Abstract:

We introduce a graph-theoretic approach to extract clusters and hierarchies in complex data-sets in an unsupervised and deterministic manner, without the use of any prior information. This is achieved by building topologically embedded networks containing the subset of most significant links and analyzing the network structure. For a planar embedding, this method provides both the intra-cluster hierarchy, which describes the way clusters are composed, and the inter-cluster hierarchy which describes how clusters gather together. We discuss performance, robustness and reliability of this method by first investigating several artificial data-sets, finding that it can outperform significantly other established approaches. Then we show that our method can successfully differentiate meaningful clusters and hierarchies in a variety of real data-sets. In particular, we find that the application to gene expression patterns of lymphoma samples uncovers biologically significant groups of genes which play key-roles in diagnosis, prognosis and treatment of some of the most relevant human lymphoid malignancies.

I like the framing of the central issue a bit further in the paper:

Filtering information out of complex datasets is becoming a central issue and a crucial bottleneck in any scientifi c endeavor. Indeed, the continuous increase in the capability of automatic data acquisition and storage is providing an unprecedented potential for science. However, the ready accessibility of these technologies is posing new challenges concerning the necessity to reduce data-dimensionality by fi ltering out the most relevant and meaningful information with the aid of automated systems. In complex datasets information is often hidden by a large degree of redundancy and grouping the data into clusters of elements with similar features is essential in order to reduce complexity [1]. However, many clustering methods require some a priori information and must be performed under expert supervision. The requirement of any prior information is a potential problem because often the fi ltering is one of the preliminary processing on the data and therefore it is performed at a stage where very little information about the system is available. Another difficulty may arise from the fact that, in some cases, the reduction of the system into a set of separated local communities may hide properties associated with the global organization. For instance, in complex systems, relevant features are typically both local and global and di fferent levels of organization emerge at diff erent scales in a way that is intrinsically not reducible. We are therefore facing the problem of catching simultaneously two complementary aspects: on one side there is the need to reduce the complexity and the dimensionality of the data by identifying clusters which are associated with local features; but, on the other side, there is a need of keeping the information about the emerging global organization that is responsible for cross-scale activity. It is therefore essential to detect clusters together with the diff erent hierarchical gatherings above and below the cluster levels. (emphasis added)

Simplification of data is always lossy. The proposed technique does not avoid all loss but hopes to mitigate its consequences.

Briefly the technique relies upon building a network of the “most significant” links and analyzing the network structure. The synthetic and real data sets show that the technique works quite well. At least for data sets where we can judge the outcome.

What of larger data sets? Where the algorithmic approaches are the only feasible means of analysis? How do we judge accuracy in those cases?

A revised version of this paper appears as: Hierarchical Information Clustering by Means of Topologically Embedded Graphs by Won-Min Song, T. Di Matteo, and Tomaso Aste.

The original development of the technique used here can be found in: A tool for filtering information in complex systems by M. Tumminello, T. Aste, T. Di Matteo, and R. N. Mantegna.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress