Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 7, 2012

Parallel clustering with CFinder

Filed under: CFinder,Clustering,Networks,Parallel Programming — Patrick Durusau @ 7:18 pm

Parallel clustering with CFinder by Peter Pollner, Gergely Palla, and Tamas Vicsek.

Abstract:

The amount of available data about complex systems is increasing every year, measurements of larger and larger systems are collected and recorded. A natural representation of such data is given by networks, whose size is following the size of the original system. The current trend of multiple cores in computing infrastructures call for a parallel reimplementation of earlier methods. Here we present the grid version of CFinder, which can locate overlapping communities in directed, weighted or undirected networks based on the clique percolation method (CPM). We show that the computation of the communities can be distributed among several CPU-s or computers. Although switching to the parallel version not necessarily leads to gain in computing time, it definitely makes the community structure of extremely large networks accessible.

If you aren’t familiar with CFinder, you should be.

April 22, 2012

Finding New Story Links Through Blog Clustering

Filed under: Blogs,Clustering,Searching — Patrick Durusau @ 7:08 pm

Finding New Story Links Through Blog Clustering

Matthew Hurst writes:

The basic mechanism used in track // microsoft to cluster articles is similar to that used by Techmeme. A fixed set of blogs are crawled and clustered based on specific features such as link structure and content (and in the case of Techmeme, additional human input). However, what about blogs that aren't known to the system?

I recently added a feature to track // microsoft which analyses clusters for popular urls and adds those to the bottom of the cluster. The title of the web page is used as a simple description of the popular page.

In the recent story about Nuno Silva's mistaken comment regarding the future of Windows Phone devices, there were many links to Nuno's own blog post. In addition to the large cluster of known blogs that were determined to be talking about the story, track // microsoft also surfaced Nuno's post through analysing the popular links discovered within the cluster.

Interesting blog discovery method.

April 11, 2012

Close Counts In Horseshoes, Hand Grenades and Clustering

Filed under: Clustering,Machine Learning,R — Patrick Durusau @ 6:18 pm

Machine Learning in R: Clustering by Ricky Ho.

Ricky writes:

Clustering is a very common technique in unsupervised machine learning to discover groups of data that are “close-by” to each other. It is broadly used in customer segmentation and outlier detection.

It is based on some notion of “distance” (the inverse of similarity) between data points and use that to identify data points that are close-by to each other. In the following, we discuss some very basic algorithms to come up with clusters, and use R as examples.

Covers K-Means, Hierarchical Clustering, Fuzzy C-Means, Multi-Gaussian with Expectation-Maximization, and Density-based Cluster algorithms.

Good introduction to the basics of clustering in R.

Reviews and Natural Language Processing: Clustering

Filed under: Clustering,Natural Language Processing — Patrick Durusau @ 6:15 pm

Reviews and Natural Language Processing: Clustering

From the post:

This quote initiated a Natural Language investigation into the HomeAway Review corpus: do the Traveler reviews (of properties) adhere to some set of standards? Reviews contain text and a “star” rating; does the text align with the rating? Analyzing its various corpora with Natural Language Processing tools allows HomeAway to better listen to – and better serve – its customers.

Interesting. Home Away is a vacation rental marketplace and so has a pressing interest in the analysis of reviews.

Promises to be a very good grounding in NLP as applied to reviews. Worth watching closely.

April 8, 2012

Lumia Review Cluster

Filed under: Clustering,Marketing — Patrick Durusau @ 4:20 pm

Lumia Review Cluster

Matthew Hurst has clustered reviews of the Nokia Lumia 900.

His blog post is an image of the cluster at a point in time so you will have to go to http://d8taplex.com/track/microsoft-widescreen.html to interact with the cluster.

What would you add to this cluster to make it more useful? Such as sub-clustering strictly reviews of the Nokia Lumia 900 or perhaps clustering based on the mentioning of other phones for comparison?

Other navigation?

You have the subject (what is being buzzed about) and you have a relative measure of the “buzz.” How do we take advantage of that “buzz?”

Navigating to a place is great fun but most people expect something at the end of the journey.

What does your topic map enable at the end of a journey?

April 3, 2012

Tracking Video Game Buzz

Filed under: Blogs,Clustering,Data Mining,Tweets — Patrick Durusau @ 4:17 pm

Tracking Video Game Buzz

Matthew Hurst writes:

Briefly, I pushed out an experimental version of track // games to track tropics in the blogosphere relating to video games. As with track // microsoft it uses gathers posts from blogs, clusters them and uses an attention metric based on Bitly and Twitter to rank the clusters, new posts and videos.

Currently at the top of the stack is Bungie Waves Goodbye To Halo.

Wonder if Matthew could be persuaded to do the same for the elections this Fall in the United States? 😉

April 2, 2012

UCR Time Series Classification/Clustering Page

Filed under: Classification,Clustering,Dataset,Time Series — Patrick Durusau @ 5:46 pm

UCR Time Series Classification/Clustering Page

From the webpage:

This webpage has been created as a public service to the data mining/machine learning community, to encourage reproducible research for time series classification and clustering.

While chasing the details on Eamonn Keogh and his time series presentation, I encountered this collection of data sets.

April 1, 2012

A Very Simple Explanation of Spectral Clustering

Filed under: Clustering,Spectral Clustering — Patrick Durusau @ 7:11 pm

A Very Simple Explanation of Spectral Clustering

Akshay writes:

I’ve been wanting to write a post about some of the research I’ve been working on but I realized that my work has gotten really technical and not very accessible to someone not in machine learning. So I thought as a precursor I would write about spectral clustering at a very high level and with lots of pictures. I hope you can follow along. For a more technical introduction to spectral clustering, this tutorial is very nice and much more complete than this article.

Clustering

To start out, it is important to understand what the clustering problem is and why it is important. Clustering is an extremely useful tool in data analysis whereby a set of objects are organized into groups based on their similarity. We typically want to form groups so that objects in the same group are highly similar and objects in different groups are less similar. This is typically known as unsupervised learning because the data set is not labeled (differentiating it from other tasks like regression and classification).

I think the point that eigenvectors are “stable under noise” could be made earlier and without a lot of technical detail.

I would insert/change:

The changes to the rows and columns introduces “noise” into the example.

The eigenvectors are $(1\ 0\ 1\ 0)^T$ and $(0\ 1\ 0\ 1)^T$, with the first and third objects are in one cluster and the second and fourth are in the other. The eigenvectors make the correct identifications of the clusters, despite the introduction of noise.

Why eigenvectors perform well in the presence of noise is beyond the scope of this post.

Still, highly recommended as an introduction to spectral clustering.

March 6, 2012

Using your Lucene index as input to your Mahout job – Part I

Filed under: Clustering,Collocation,Lucene,Mahout — Patrick Durusau @ 8:08 pm

Using your Lucene index as input to your Mahout job – Part I

From the post:

This blog shows you how to use an upcoming Mahout feature, the lucene2seq program or https://issues.apache.org/jira/browse/MAHOUT-944. This program reads the contents of stored fields in your Lucene index and converts them into text sequence files, to be used by a Mahout text clustering job. The tool contains both a sequential and MapReduce implementation and can be run from the command line or from Java using a bean configuration object. In this blog I demonstrate how to use the sequential version on an index of Wikipedia.

Access to original text can help with improving clustering results. See the blog post for details.

February 25, 2012

XML data clustering: An overview

Filed under: Clustering,Data Clustering,XML,XML Data Clustering — Patrick Durusau @ 7:39 pm

XML data clustering: An overview by Alsayed Algergawy, Marco Mesiti, Richi Nayak, and Gunter Saake.

Abstract:

In the last few years we have observed a proliferation of approaches for clustering XML documents and schemas based on their structure and content. The presence of such a huge amount of approaches is due to the different applications requiring the clustering of XML data. These applications need data in the form of similar contents, tags, paths, structures, and semantics. In this article, we first outline the application contexts in which clustering is useful, then we survey approaches so far proposed relying on the abstract representation of data (instances or schema), on the identified similarity measure, and on the clustering algorithm. In this presentation, we aim to draw a taxonomy in which the current approaches can be classified and compared. We aim at introducing an integrated view that is useful when comparing XML data clustering approaches, when developing a new clustering algorithm, and when implementing an XML clustering component. Finally, the article moves into the description of future trends and research issues that still need to be faced.

I thought this survey article would be of particular interest since it covers the syntax and semantics of XML that contains data.

Not to mention that our old friend, heterogeneous data, isn’t far behind:

Since XML data are engineered by different people, they often have different structural and terminological heterogeneities. The integration of heterogeneous data sources requires many tools for organizing and making their structure and content homogeneous. XML data integration is a complex activity that involves reconciliation at different levels: (1) at schema level, reconciling different representations of the same entity or property, and (2) at instance level, determining if different objects coming from different sources represent the same real-world entity. Moreover, the integration of Web data increases the integration process challenges in terms of heterogeneity of data. Such data come from different resources and it is quite hard to identify the relationship with the business subjects. Therefore, a first step in integrating XML data is to find clusters of the XML data that are similar in semantics and structure [Lee et al. 2002; Viyanon et al. 2008]. This allows system integrators to concentrate on XML data within each cluster. We remark that reconciling similar XML data is an easier task than reconciling XML data that are different in structures and semantics, since the later involves more restructuring. (emphasis added)

Two comments to bear in mind while reading this paper.

First, print our or photocopy Table II on page 35, “Features of XML Clustering Approaches.” It will be a handy reminder/guide as you read the coverage of the various techniques.

Second, on the last page, page 41, note that the article was accepted in October of 2009 but not published until October of 2011. It’s great that the ACM has an abundance of excellent survey articles but a two year delay is publication is unreasonable.

Surveys in rapidly developing fields are of most interest when they are timely. Electronic publication upon final acceptance should be the rule at an organization such as the ACM.

January 6, 2012

General Purpose Computer-Assisted Clustering and Conceptualization

Filed under: Clustering,Conceptualizations — Patrick Durusau @ 11:39 am

General Purpose Computer-Assisted Clustering and Conceptualization by Justin Grimmer and Gary King.

Abstract:

We develop a computer-assisted method for the discovery of insightful conceptualizations, in the form of clusterings (i.e., partitions) of input objects. Each of the numerous fully automated methods of cluster analysis proposed in statistics, computer science, and biology optimize a different objective function. Almost all are well defined, but how to determine before the fact which one, if any, will partition a given set of objects in an “insightful” or “useful” way for a given user is unknown and difficult, if not logically impossible. We develop a metric space of partitions from all existing cluster analysis methods applied to a given data set (along with millions of other solutions we add based on combinations of existing clusterings), and enable a user to explore and interact with it, and quickly reveal or prompt useful or insightful conceptualizations. In addition, although uncommon in unsupervised learning problems, we offer and implement evaluation designs that make our computer-assisted approach vulnerable to being proven suboptimal in specific data types. We demonstrate that our approach facilitates more efficient and insightful discovery of useful information than either expert human coders or many existing fully automated methods.

Despite my misgivings about metric spaces for semantics, the central theme that clustering (dare I say merging?) cannot be determined in advance of some user viewing the data, makes sense to me. Not every user will want or perhaps even need to do interactive clustering but I think this theme represents a substantial advance in this area.

The publication appeared in the Proceeding of the National Academy of Sciences of the United States of America and the authors are from Stanford and Harvard, respectively. Institutions that value technical and scientific brilliance.

December 4, 2011

Clustering Large Attributed Graphs: An Efficient Incremental Approach

Filed under: Algorithms,Clustering,Graphs — Patrick Durusau @ 8:19 pm

Clustering Large Attributed Graphs: An Efficient Incremental Approach by Yang Zhou, Hong Cheng, and Jeffrey Xu Yu. (PDF file)

Abstract:

In recent years, many networks have become available for analysis, including social networks, sensor networks, biological networks, etc. Graph clustering has shown its effectiveness in analyzing and visualizing large networks. The goal of graph clustering is to partition vertices in a large graph into clusters based on various criteria such as vertex connectivity or neighborhood similarity. Many existing graph clustering methods mainly focus on the topological structures, but largely ignore the vertex properties which are often heterogeneous. Recently, a new graph clustering algorithm, SA-Cluster, has been proposed which combines structural and attribute similarities through a unified distance measure. SACluster performs matrix multiplication to calculate the random walk distances between graph vertices. As the edge weights are iteratively adjusted to balance the importance between structural and attribute similarities, matrix multiplication is repeated in each iteration of the clustering process to recalculate the random walk distances which are affected by the edge weight update.

In order to improve the efficiency and scalability of SA-Cluster, in this paper, we propose an efficient algorithm Inc-Cluster to incrementally update the random walk distances given the edge weight increments. Complexity analysis is provided to estimate how much runtime cost Inc-Cluster can save. Experimental results demonstrate that Inc-Cluster achieves significant speedup over SA-Cluster on large graphs, while achieving exactly the same clustering quality in terms of intra-cluster structural cohesiveness and attribute value homogeneity.

Seeing this reminded me that I need to review the other papers presented at the 2010 IEEE International Conference on Data Mining. The problem is that papers that seem the most relevant at one time, six months later don’t seem as relevant as they once did. Same papers, same person looking at them. Passage of time and other papers I suspect.

Graph algorithms continue to improve, if you are working with large graphs, suggest you give some time to this paper.

November 19, 2011

Updates to d8taplex News Stream

Filed under: Clustering,News,Visualization — Patrick Durusau @ 10:22 pm

Updates to d8taplex News Stream by Matthew Hurst.

From the post:

I’ve made some updates to the news stream experiment on d8taplex. The page now displays a stream of news articles with clusters of related articles being shown with emboldened titles. Each article also shows the number of clicks tracked by bit.ly. The bit.ly data is coloured according to how hot it is (red represents the most clicked articles, orange represents ‘warm’ stories and gray represents stories that haven’t yet received any significant attention).

With two view of news (the clusters represent what the media companies are interested in and the bit.ly data represents what the weberati are interested in) the stream provides a slightly different overview of the news cycle. One can find clusters for which there is no significant bit.ly data, yet also find stories that are getting a lot of clicks but which don’t belong to any clusters. In addition, as each story is annotated with the source (CNN, BBC, etc.) you can also find clusters generated by only one source.

Mathew continues his experiments with news streams.

I wonder if media company clusters are related to their advertisers? Or owners? Or is the relationship more subtle? That is the Wall Street Journal clusters are driven more by the self-selection of people who work at the Journal than its support by mainstream financial backers? Not sure how you would even start to tease that apart.

October 20, 2011

AutoComplete with Suggestion Groups

Filed under: AutoComplete,Clustering,Interface Research/Design — Patrick Durusau @ 6:41 pm

AutoComplete with Suggestion Groups from Sematext.

From the post:

While Otis is talking about our new Search Analytics (it’s open and free now!) and Scalable Performance Monitoring (it’s also open and free now!) services at Lucene Eurocon in Barcelona Pascal, one of the new members of our multinational team at Sematext, is working on making improvements to our search-lucene.com and search-hadoop.com sites. One of the recent improvements is on the AutoComplete functionality we have there. If you’ve used these sites before, you may have noticed that the AutoComplete there now groups suggestions. In the screen capture below you can see several suggestion groups divided with pink lines. Suggestions can be grouped by any criterion, and here we have them grouped by the source of suggestions. The very first suggestion is from “mail # general”, which is our name for the “general” mailing list that some of the projects we index have. The next two suggestions are from “mail # user”, followed by two suggestions from “mail # dev”, and so on. On the left side of each suggestion you can see icons that signify the type of suggestion and help people more quickly focus on types of suggestions they are after.

Very nice!

Curious how you would distinguish “grouping suggestions” from faceted navigation? (I have a distinction in mind but curious about yours.)

October 3, 2011

Algorithms of the Intelligent Web Review

Algorithms of the Intelligent Web Review by Pearlene McKinley

From the post:

I have always had an interest in AI, machine learning, and data mining but I found the introductory books too mathematical and focused mostly on solving academic problems rather than real-world industrial problems. So, I was curious to see what this book was about.

I have read the book front-to-back (twice!) before I write this report. I started reading the electronic version a couple of months ago and read the paper print again over the weekend. This is the best practical book in machine learning that you can buy today — period. All the examples are written in Java and all algorithms are explained in plain English. The writing style is superb! The book was written by one author (Marmanis) while the other one (Babenko) contributed in the source code, so there are no gaps in the narrative; it is engaging, pleasant, and fluent. The author leads the reader from the very introductory concepts to some fairly advanced topics. Some of the topics are covered in the book and some are left as an exercise at the end of each chapter (there is a “To Do” section, which was a wonderful idea!). I did not like some of the figures (they were probably made by the authors not an artist) but this was only a minor aesthetic inconvenience.

The book covers four cornerstones of machine learning and intelligence, i.e. intelligent search, recommendations, clustering, and classification. It also covers a subject that today you can find only in the academic literature, i.e. combination techniques. Combination techniques are very powerful and although the author presents the techniques in the context of classifiers, it is clear that the same can be done for recommendations — as the Bell Korr team did for the Netflix prize.

Wonder if this will be useful in the Stanford AI course that starts next week with more than 130,000 students? Introduction to Artificial Intelligence – Stanford Class

I am going to order a copy, if for no other reason than to evaluate the reviewer’s claim of explanations “in plain English.” I have seen some fairly clever explanations of AI algorithms and would like to see how these stack up.

September 27, 2011

The ubiquity of small-world networks

Filed under: Clustering,Graphs,Networks — Patrick Durusau @ 6:46 pm

The ubiquity of small-world networks by Qawi K. Telesford, Karen E. Joyce, Satoru Hayasaka, Jonathan H. Burdette, and Paul J. Laurienti.

Abstract:

Small-world networks by Watts and Strogatz are a class of networks that are highly clustered, like regular lattices, yet have small characteristic path lengths, like random graphs. These characteristics result in networks with unique properties of regional specialization with efficient information transfer. Social networks are intuitive examples of this organization with cliques or clusters of friends being interconnected, but each person is really only 5-6 people away from anyone else. While this qualitative definition has prevailed in network science theory, in application, the standard quantitative application is to compare path length (a surrogate measure of distributed processing) and clustering (a surrogate measure of regional specialization) to an equivalent random network. It is demonstrated here that comparing network clustering to that of a random network can result in aberrant findings and networks once thought to exhibit small-world properties may not. We propose a new small-world metric, {\omega} (omega), which compares network clustering to an equivalent lattice network and path length to a random network, as Watts and Strogatz originally described. Example networks are presented that would be interpreted as small-world when clustering is compared to a random network but are not small-world according to {\omega}. These findings have significant implications in network science as small-world networks have unique topological properties, and it is critical to accurately distinguish them from networks without simultaneous high clustering and low path length.

What sort of network is your topic map?

Wonder if there will emerge classes of topic maps? Some of which are small-world networks and others that are not? I ask because knowing the conditions/requirements that lead to one type or the other would be another tool for designing topic maps for particular purposes.

September 16, 2011

Active Learning for Node Classification in Assortative and Disassortative Networks

Filed under: Clustering,Networks — Patrick Durusau @ 6:42 pm

Active Learning for Node Classification in Assortative and Disassortative Networks by Cristopher Moore, Xiaoran Yan, Yaojia Zhu, Jean-Baptiste Rouquier, and Terran Lane.

Abstract:

In many real-world networks, nodes have class labels, attributes, or variables that affect the network’s topology. If the topology of the network is known but the labels of the nodes are hidden, we would like to select a small subset of nodes such that, if we knew their labels, we could accurately predict the labels of all the other nodes. We develop an active learning algorithm for this problem which uses information-theoretic techniques to choose which nodes to explore. We test our algorithm on networks from three different domains: a social network, a network of English words that appear adjacently in a novel, and a marine food web. Our algorithm makes no initial assumptions about how the groups connect, and performs well even when faced with quite general types of network structure. In particular, we do not assume that nodes of the same class are more likely to be connected to each other—only that they connect to the rest of the network in similar ways.

If abstract doesn’t recommend this paper as weekend reading, perhaps the following quote from the paper will:

our focus is on the discovery of functional communities in the network, and our underlying generative model is designed around the assumption of that these communities exist.

You will recall from Don’t Trust Your Instincts that we are likely to see what we expect to see in text, or in this case, networks. Not that using this approach frees us from introducing bias, but it does insure the observer bias is uniformly applied across the data set. Which may lead to results that startle us, interest us or that we consider to be spurious. In any event, this is one more approach to test and possibly illuminate our understanding of a network.

PS: Are communities the equivalent of clusters?

September 13, 2011

Discovering, Summarizing and Using Multiple Clusterings

Filed under: Clustering,Data Analysis,Data Mining — Patrick Durusau @ 7:16 pm

Proceedings of the 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings
Athens, Greece, September 5, 2011.

This collection of papers reflects what I think is rapidly becoming the consensus view: There is no one/right way to look at data.

That is important because by the application of multiple techniques, in these papers clustering techniques, you may make unanticipated discoveries about your data. Recording the trail you followed, as all explorers should, will help others duplicate your steps, to test them or to go further. In topic map terms, I would you would be discovering and identifying subjects.

Edited by

Emmanuel Müller *
Stephan Günnemann **
Ira Assent ***
Thomas Seidl **

* Karlsruhe Institute of Technology, Germany
** RWTH Aachen University, Germany
*** Aarhus University, Denmark


Complete workshop proceedings as one file (~16 MB).

Table of Contents

    Invited Talks

  1. Combinatorial Approaches to Clustering and Feature Selection
    Michael E. Houle
  2. Cartification: Turning Similarities into Itemset Frequencies
    Bart Goethals
  3. Research Papers

  4. When Pattern Met Subspace Cluster
    Jilles Vreeken, Arthur Zimek
  5. Fast Multidimensional Clustering of Categorical Data
    Tengfei Liu, Nevin L. Zhang, Kin Man Poon, Yi Wang, Hua Liu
  6. Factorial Clustering with an Application to Plant Distribution Data
    Manfred Jaeger, Simon Lyager, Michael Vandborg, Thomas Wohlgemuth
  7. Subjectively Interesting Alternative Clusters
    Tijl De Bie
  8. Evaluation of Multiple Clustering Solutions
    Hans-Peter Kriegel, Erich Schubert, Arthur Zimek
  9. Browsing Robust Clustering-Alternatives
    Martin Hahmann, Dirk Habich, Wolfgang Lehner
  10. Generating a Diverse Set of High-Quality Clusterings
    Jeff M. Phillips, Parasaran Raman, Suresh Venkatasubramanian

September 12, 2011

A Variational HEM Algorithm for Clustering Hidden Markov Models

Filed under: Clustering,Hidden Markov Model — Patrick Durusau @ 8:27 pm

A Variational HEM Algorithm for Clustering Hidden Markov Models by: Emanuele Coviello, Antoni B. Chan, and Gert R.G. Lanckriet.

Abstract:

The hidden Markov model (HMM) is a generative model that treats sequential data under the assumption that each observation is conditioned on the state of a discrete hidden variable that evolves in time as a Markov chain. In this paper, we derive a novel algorithm to cluster HMMs through their probability distributions. We propose a hierarchical EM algorithm that i) clusters a given collection of HMMs into groups of HMMs that are similar, in terms of the distributions they represent, and ii) characterizes each group by a “cluster center”, i.e., a novel HMM that is representative for the group. We present several empirical studies that illustrate the benefits of the proposed algorithm.

Warning: Heavy sledding but the examples of improved hierarchical motion clustering, music tagging, and online hand writing recognition are quite compelling.

September 11, 2011

On Clustering on Graphs with Multiple Edge Types

Filed under: Clustering,Graphs — Patrick Durusau @ 7:12 pm

On Clustering on Graphs with Multiple Edge Types by Matthew Rocklin and Ali Pinar.

Abstract:

We study clustering on graphs with multiple edge types. Our main motivation is that similarities between objects can be measured in many different metrics. For instance similarity between two papers can be based on common authors, where they are published, keyword similarity, citations, etc. As such, graphs with multiple edges is a more accurate model to describe similarities between objects. Each edge/metric provides only partial information about the data; recovering full information requires aggregation of all the similarity metrics. Clustering becomes much more challenging in this context, since in addition to the difficulties of the traditional clustering problem, we have to deal with a space of clusterings. We generalize the concept of clustering in single-edge graphs to multi-edged graphs and investigate problems such as: Can we find a clustering that remains good, even if we change the relative weights of metrics? How can we describe the space of clusterings efficiently? Can we find unexpected clusterings (a good clustering that is distant from all given clusterings)? If given the ground-truth clustering, can we recover how the weights for edge types were aggregated? %In this paper, we discuss these problems and the underlying algorithmic challenges and propose some solutions. We also present two case studies: one based on papers on Arxiv and one based on CIA World Factbook.

From the introduction:

In many real-world problems, similarities between objects can be defined by many different relationships. For instance, similarity between two scientific articles can be defined based on authors, citations to, citations from, keywords, titles, where they are published, text similarity and many more. Relationships between people can be based on the nature of the relationship (e.g., business, family, friendships) a means of communication (e.g., emails, phone, in person), etc. Electronic files can be grouped by their type (Latex, C, html), names, the time they are created, or the pattern they are accessed. In these examples, there are multiple graphs that define relationships between the subjects. We may choose to reduce this multivariate information to construct a single composite graph. This is convenient as it enables application of many strong results from the literature. However, information being lost during this aggregation may be crucial, and we believe working on graphs with multiple edge types is a more precise representation of the problem, and thus will lead to more accurate analyses. Despite its importance, the literature on clustering graphs with multiple edge types is very rare….

Sounds familiar doesn’t it? A good deal like associations in a topic map. Where implicit information, such as roles and which role is being played becomes explicit.

The clustering described in this paper is, of course, a way to group and identify collective subjects.

You will probably be interested in the Graclus software that the authors use for splitting graphs.

Fast Clustering using MapReduce

Filed under: Clustering,MapReduce — Patrick Durusau @ 7:09 pm

Fast Clustering using MapReduce by Alina Ene, Sungjin Im, and Benjamin Moseley.

Abstract:

Clustering problems have numerous applications and are becoming more challenging as the size of the data increases. In this paper, we consider designing clustering algorithms that can be used in MapReduce, the most popular programming environment for processing large datasets. We focus on the practical and popular clustering problems, $k$-center and $k$-median. We develop fast clustering algorithms with constant factor approximation guarantees. From a theoretical perspective, we give the first analysis that shows several clustering algorithms are in $\mathcal{MRC}^0$, a theoretical MapReduce class introduced by Karloff et al. \cite{KarloffSV10}. Our algorithms use sampling to decrease the data size and they run a time consuming clustering algorithm such as local search or Lloyd’s algorithm on the resulting data set. Our algorithms have sufficient flexibility to be used in practice since they run in a constant number of MapReduce rounds. We complement these results by performing experiments using our algorithms. We compare the empirical performance of our algorithms to several sequential and parallel algorithms for the $k$-median problem. The experiments show that our algorithms’ solutions are similar to or better than the other algorithms’ solutions. Furthermore, on data sets that are sufficiently large, our algorithms are faster than the other parallel algorithms that we tested.

In the Introduction the authors note:

In the clustering problems that we consider in this paper, the goal is to partition the data into subsets, called clusters, such that the data points assigned to the same cluster are similar according to some metric

At first a trivial observation but then on reflection, perhaps not.

First, clustering is described as a means to collect up “data points” (I would say “subjects” maybe, read further) that are “similar” by some measure. A collective subject, such as all the members of a football team, books in a library, users in a chat room at some point in time.

Second, and I think this goes unsaid too often, if the degree of similarity is high enough, we may well conclude that a single “subject” is described by the clustering.

But the choice between collective versus single subject is one that is one that has no permanent or universal resolution. So it is best to state which one is in operation.

Algorithms for MapReduce is an exciting area to watch.

September 7, 2011

Web Pages Clustering: A New Approach

Filed under: Clustering,Dictionary — Patrick Durusau @ 7:02 pm

Web Pages Clustering: A New Approach by Jeevan H E, Prashanth P P, Punith Kumar S N, and Vinay Hegde.

Abstract:

The rapid growth of web has resulted in vast volume of information. Information availability at a rapid speed to the user is vital. English language (or any for that matter) has lot of ambiguity in the usage of words. So there is no guarantee that a keyword based search engine will provide the required results. This paper introduces the use of dictionary (standardised) to obtain the context with which a keyword is used and in turn cluster the results based on this context. These ideas can be merged with a metasearch engine to enhance the search efficiency.

The first part of this paper is concerned with the use of a dictionary to create separate queries for each “sense” of a term. I am not sure that is an innovation.

I don’t have the citation at hand but seem to recall that term rewriting for queries has used something very much like a dictionary. Perhaps not a “dictionary” in the conventional sense but I would not even bet on that. Anyone have a better memory than mine and/or working in query rewriting?

September 2, 2011

Discovering, Summarizing and Using Multiple Clusterings

Filed under: Clustering,Summarization — Patrick Durusau @ 8:00 pm

Discovering, Summarizing and Using Multiple Clusterings

Proceedings of the 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings

Athens, Greece, September 5, 2011.

Where you will find:

Invited Talks

1. Combinatorial Approaches to Clustering and Feature Selection, Michael E. Houle

2. Cartification: Turning Similarities into Itemset Frequencies, Bart Goethals

Research Papers

3. When Pattern Met Subspace Cluster, Jilles Vreeken, Arthur Zimek

4. Fast Multidimensional Clustering of Categorical Data, Tengfei Liu, Nevin L. Zhang, Kin Man Poon, Yi Wang, Hua Liu

5. Factorial Clustering with an Application to Plant Distribution Data, Manfred Jaeger, Simon Lyager, Michael Vandborg, Thomas Wohlgemuth

6. Subjectively Interesting Alternative Clusters,Tijl De Bie

7. Evaluation of Multiple Clustering Solutions, Hans-Peter Kriegel, Erich Schubert, Arthur Zimek

8. Browsing Robust Clustering-Alternatives, Martin Hahmann, Dirk Habich, Wolfgang Lehner

9. Generating a Diverse Set of High-Quality Clusterings, Jeff M. Phillips, Parasaran Raman, Suresh Venkatasubramanian

Parallel WaveCluster:…

Filed under: Clustering — Patrick Durusau @ 7:50 pm

Parallel WaveCluster: A linear scaling parallel clustering algorithm implementation with application to very large datasets by Ahmet Artu Yıldırım and Cem Özdoğana.

Abstract:

A linear scaling parallel clustering algorithm implementation and its application to very large datasets for cluster analysis is reported. WaveCluster is a novel clustering approach based on wavelet transforms. Despite this approach has an ability to detect clusters of arbitrary shapes in an efficient way, it requires considerable amount of time to collect results for large sizes of multi-dimensional datasets. We propose the parallel implementation of the WaveCluster algorithm based on the message passing model for a distributed-memory multiprocessor system. In the proposed method, communication among processors and memory requirements are kept at minimum to achieve high efficiency. We have conducted the experiments on a dense dataset and a sparse dataset to measure the algorithm behavior appropriately. Our results obtained from performed experiments demonstrate that developed parallel WaveCluster algorithm exposes high speedup and scales linearly with the increasing number of processors.

This paper mentions, although it doesn’t treat, an important issue in the use of clustering algorithms with semantic data. In its description of the WaveCluster algorithm, the authors say:

WaveCluster algorithm contains three phases. In the first phase, algorithm quantizes feature space and then assigns the objects to the units.

(This is important work and deserved a better editing from its publisher.)

The problem is the quantizes feature space when the feature space is a semantic one.

Measuring a semantic dimension is not an easy task nor are the results free from doubt. The Wikipedia article on psychological testing is woefully below par for Wikipedia. I did manage to locate course materials for a psychology “research methods” course at UC Davis. PSC 41 Research Methods In particular, take a look at the Scaling module to get an idea of what it means to construct a scale for a semantic dimension.

I don’t have a handle on current social science research methods but the links in Free Resources for Program Evaluation and Social Research Methods should give you a place to start.

BTW, the original paper on WaveCluster: Wavecluster: A multi-resolution clustering approach for very large spatial databases has the following abstract:

Many applications require the management of spatial data. Clustering large spatial databases is an important problem which tries to find the densely populated regions in the feature space to be used in data mining, knowledge discovery, or efficient information retrieval. A good clustering approach should be efficient and detect clusters of arbitrary shape. It must be insensitive to the outliers (noise) and the order of input data. We pro-pose WaveCluster, a novel clustering approach based on wavelet transforms, which satisfies all the above requirements. Using multi-resolution property of wavelet transforms, we can effectively identify arbitrary shape clus-ters at different degrees of accuracy. We also demonstrate that WaveCluster is highly effi-cient in terms of time complexity. Experi-mental results on very large data sets are pre-sented which show the efficiency and effective-ness of the proposed approach compared to the other recent clustering methods. (Emphasis added.)

The ranges of spatial data dimensions are not in doubt so quantization makes sense. The “range” of semantic dimensions, on the other hand, are not nearly so certain.

As I pointed out at the beginning, this is very important research and used with appropriate data sets it can make a real difference. Used with inappropriate data sets and you have just cost your employer and possibly customers a good deal of time and effort.

August 18, 2011

BMC Bioinformatics

Filed under: Bioinformatics,Biomedical,Clustering — Patrick Durusau @ 6:49 pm

BMC Bioinformatics

From the webpage:

BMC Bioinformatics is an open access journal publishing original peer-reviewed research articles in all aspects of the development, testing and novel application of computational and statistical methods for the modeling and analysis of all kinds of biological data, as well as other areas of computational biology. BMC Bioinformatics (ISSN 1471-2105) is indexed/tracked/covered by PubMed, MEDLINE, BIOSIS, CAS, EMBASE, Scopus, ACM, CABI, Thomson Reuters (ISI) and Google Scholar.

Let me give you a sample of what you will find here:

MINE: Module Identification in Networks by Kahn Rhrissorrakrai and Kristin C Gunsalus. BMC Bioinformatics 2011, 12:192 doi:10.1186/1471-2105-12-192.

Abstract:

Graphical models of network associations are useful for both visualizing and integrating multiple types of association data. Identifying modules, or groups of functionally related gene products, is an important challenge in analyzing biological networks. However, existing tools to identify modules are insufficient when applied to dense networks of experimentally derived interaction data. To address this problem, we have developed an agglomerative clustering method that is able to identify highly modular sets of gene products within highly interconnected molecular interaction networks.

Medicine isn’t my field by profession (although I enjoy reading about it) but it doesn’t take much to see the applicability of an “agglomerative clustering method” to other highly interconnected networks.

Reading across domain specific IR publications can help keep you from re-inventing the wheel or perhaps sparking an idea for a better wheel of your own making.

June 21, 2011

Investigating thousands (or millions) of documents by visualizing clusters

Filed under: Clustering,Visualization — Patrick Durusau @ 7:11 pm

Investigating thousands (or millions) of documents by visualizing clusters

From the post:

This is a recording of my talk at the NICAR (National Institute of Computer-Assisted Reporting) conference last week, where I discuss some of our recent work at the AP with the Iraq and Afghanistan war logs.

Very good presentation which includes:

  • “A full-text visualization of the Iraq war logs”, a detailed writeup of the technique used to generate the first set of maps presented in the talk.
  • The Glimmer high-performance, parallel multi-dimensional scaling algorithm, which is the software I presented in the live demo portion. It will be the basis of our clustering work going forward. (We are also working on other large-scale visualizations which may be more appropriate for e.g. email dumps.)
  • “Quantitative Discovery from Qualitative Information: A General-Purpose Document Clustering Methodology.” Justin Grimmer, Gary King, 2009. A paper that everyone working in document clustering needs to read. It clearly makes the point that there is no “best” clustering, just different algorithms that correspond to different pre-conceived frames on the story — and gives a method to compare clusterings (though I don’t think it will scale well to millions of docs.)
  • Wikipedia pages for bag of words model, tf-idf, and cosine similarity, the basic text processing techniques we’re using.
  • Gephi, a free graph visualization system, which we used for the one-month Iraq map. It will work up to a few tens of thousands of nodes.
  • Knight News Challenge application for “Overview,” the open-source system we’d like to build for doing this and other kinds of visual explorations of large document sets. If you like our work, why not leave a comment on our proposal?

The war logs are not “yesterday’s news.” People of all nationalities are still dying. And those responsible go unpunished.

Public newspapers and military publications will list prior postings of military personnel. Indexing those against the locations in the war logs using a topic map could produce a first round of people to ask about events and decisions reported in the war logs. (Of course, our guardians have cleansed all the versions I know of and pronounced them safe for our reading. Unaltered versions would work better. I can only imagine what would have happened to Shakespeare as a war document.)

June 16, 2011

Clustering with outliers

Filed under: Clustering — Patrick Durusau @ 3:41 pm

Clustering with outliers

Part of a series of posts on clustering. From the most recent post (13 June 2011):

When abstracting the clustering problem, we often assume that the data is perfectly clusterable and so we only need to find the right clusters. But what if your data is not so perfect? Maybe there’s background noise, or a few random points were added to your data by an adversary. Some clustering formulations, in particular k-center or k-means, are not stable — the addition of a single point can dramatically change the optimal clustering. For the case of k-center, if a single point x is added far away from all of the original data points, it will become its own center in the optimum solution, necessitating that the other points are only clustered with k−1 centers.

Clustering is relevant to topic maps in a couple of ways.

First, there are numerous collective subjects, sports teams, music groups, military units, etc., that all have some characteristic by which they can be gathered together.

Second, in some very real sense, when all the information about a subject is brought together, clustering would be a fair description of that activity. True, it is clustering with some extra processing thrown in but it is still clustering. Just a bit more fine grained.

Not to mention that researchers have been working on clustering algorithms for years and they certainly should be part of any topic map authoring tool.

June 12, 2011

clusterPy: Library of spatially constrained
clustering algorithms

Filed under: Clustering,Geo Analytics,Geographic Data,Geographic Information Retrieval — Patrick Durusau @ 4:13 pm

clusterPy: Library of spatially constrained clustering algorithms

From the webpage:

Analytical regionalization (also known as spatially constrained clustering) is a scientific way to decide how to group a large number of geographic areas or points into a smaller number of regions based on similarities in one or more variables (i.e., income, ethnicity, environmental condition, etc.) that the researcher believes are important for the topic at hand. Conventional conceptions of how areas should be grouped into regions may either not be relevant to the information one is trying to illustrate (i.e., using political regions to map air pollution) or may actually be designed in ways to bias aggregated results.

May 12, 2011

Intuition & Data-Driven Machine Learning

Filed under: Clustering,Machine Learning — Patrick Durusau @ 7:59 am

Intuition & Data-Driven Machine Learning

Ilya Grigorik includes his Intelligent Ruby: Getting Started with Machine Learning presentation and, asks the following question:

… to perform a clustering, we need to define a function to measure the “pairwise distance” between all pairs of objects. Can you think of a generic way to do so?

Think about it and then go to his blog post to see the answer.

March 31, 2011

Unified analysis of streaming news

Filed under: Aggregation,Analytics,Clustering — Patrick Durusau @ 3:38 pm

Unified analysis of streaming news by Amr Ahmed, Qirong Ho, Jacob Eisenstein, and, Eric Xing Carnegie Mellon University, Pittsburgh, USA, and Alexander J. Smola and Choon Hui Teo of Yahoo! Research, Santa Clara, CA, USA.

News clustering, categorization and analysis are key components of any news portal. They require algorithms capable of dealing with dynamic data to cluster, interpret and to temporally aggregate news articles. These three tasks are often solved separately. In this paper we present a unified framework to group incoming news articles into temporary but tightly-focused storylines, to identify prevalent topics and key entities within these stories, and to reveal the temporal structure of stories as they evolve. We achieve this by building a hybrid clustering and topic model. To deal with the available wealth of data we build an efficient parallel inference algorithm by sequential Monte Carlo estimation. Time and memory costs are nearly constant in the length of the history, and the approach scales to hundreds of thousands of documents. We demonstrate the efficiency and accuracy on the publicly available TDT dataset and data of a major internet news site.

From the article:

Such an approach combines the strengths of clustering and topic models. We use topics to describe the content of each cluster, and then we draw articles from the associated story. This is a more natural fit for the actual process of how news is created: after an event occurs (the story), several journalists write articles addressing various aspects of the story. While their vocabulary and their view of the story may differ, they will by necessity agree on the key issues related to a story (at least in terms of their vocabulary). Hence, to analyze a stream of incoming news we need to infer a) which (possibly new) cluster could have generated the article and b) which topic mix describes the cluster best.

I single out that part of the paper to remark that at first the authors say that the vocabulary for a story may vary and then in the next breath say that for key issues the vocabulary will agree on key issues.

Given the success of their results, it may be that news reporting is more homogeneous in its vocabulary than other forms of writing?

Perhaps news compression where duplicated content is suppressed but the “fact” of reportage is retained, that could make an interesting topic map.

« Newer PostsOlder Posts »

Powered by WordPress