Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 30, 2011

Machine Learning

Filed under: Classification,Clustering,Machine Learning,Regression — Patrick Durusau @ 12:35 pm

Machine Learning

From the site:

This page documents all the machine learning algorithms present in the library. In particular, there are algorithms for performing classification, regression, clustering, anomaly detection, and feature ranking, as well as algorithms for doing more specialized computations.

A good tutorial and introduction to the general concepts used by most of the objects in this part of the library can be found in the svm example program. After reading this example another good one to consult would be the model selection example program. Finally, if you came here looking for a binary classification or regression tool then I would try the krr_trainer first as it is generally the easiest method to use.

The major design goal of this portion of the library is to provide a highly modular and simple architecture for dealing with kernel algorithms….

Update: Dlib – machine learning. Why I left out the library name I cannot say. Sorry!

January 28, 2011

Building a Better Word Cloud – Post

Filed under: Clustering — Patrick Durusau @ 7:31 am

Building a Better Word Cloud

Drew Conway talks about why word clouds don’t work (space based display of non-spatial data is how I would summarize it, but see for yourself).

He then proceeds to create a comparative word cloud. Palin and Obama on the Arizona shootings.

I include this post here as a caution that space based clustering can be mis-leading if not outright deceptive.

January 19, 2011

Topic-based Index Partitions for Efficient and Effective Selective Search

Filed under: Clustering,Search Interface,Searching — Patrick Durusau @ 11:10 am

Topic-based Index Partitions for Efficient and Effective Selective Search Authors: Anagha Kulkarni and Jamie Callan

Abstract:

Indexes for large collections are often divided into shards that are distributed across multiple computers and searched in parallel to provide rapid interactive search. Typically, all index shards are searched for each query. This paper investigates document allocation policies that permit searching only a few shards for each query (selective search) without sacrificing search quality. Three types of allocation policies (random, source-based and topic-based) are studied. K-means clustering is used to create topic-based shards. We manage the computational cost of applying these techniques to large datasets by defining topics on a subset of the collection. Experiments with three large collections demonstrate that selective search using topic-based shards reduces search costs by at least an order of magnitude without reducing search accuracy.

What is unclear to me is whether mapping shards across independent and distinct collections that have topic-based shards would be as effective?

That would depend on the similarity of the shards but that is measurable. Not to mention mappable by a topic map.

It would be interesting if large collections started offering topic-based shard APIs to their contents.

Such that a distributed query could search shards that have been mapped as being relevant to a particular query.

December 18, 2010

Self-organization in Distributed Semantic Repositories – Presentation

Filed under: Clustering,Self-organization,Similarity — Patrick Durusau @ 6:19 am

Kia Teymourian Video, Slides from SOKS: Self-Organising Knowledge Systems, Amsterdam, 29 April 2010

Abstract:

Principles from nature-inspired selforganization can help to attack the massive scalability challenges in future internet infrastructures. We researched into ant-like mechanisms for clustering semantic information. We outline algorithms to store related information within clusters to facilitate efficient and scalable retrieval.

At the core are similarity measures that cannot consider global information such as a completely shared ontology. Mechanisms for syntax-based URI-similarity and the usage of a dynamic partial view on an ontology for path-length based similarity are described and evaluated. We give an outlook on how to consider application specific relations for clustering with a usecase in geo-information systems.

Research questions:

  1. What about a similarity function where “sim = 1.0?”
  2. What about ants with different similarity functions?
  3. The similarity measure is RDF bound. What other similarity measures are in use?

Observation: The Wordnet Ontology is used for the evaluation. It occurred to me that Wordnet gets used a lot, but never reused. Or rather, the results of using Wordnet are never reused.

Isn’t it odd that we keep reasoning about sparrows being like ducks, over and over again? Seems like we should be able to take the results of others and build upon them. What prevents that from happening? Either in searching or ontology systems.

December 9, 2010

Mining of Massive Datasets – eBook

Mining of Massive Datasets

Jeff Dalton, Jeff’s Search Engine Caffè reports a new data mining book by Anand Rajaraman and Jeffrey D. Ullman (yes, that Jeffrey D. Ullman, think “dragon book.”).

A free eBook no less.

Read Jeff’s post on your way to get a copy.

Look for more comments as I read through it.

Has anyone written a comparison of the recent search engine titles? Just curious.


Update: New version out in hard copy and e-book remains available. See: Mining Massive Data Sets – Update

December 4, 2010

Probabilistic User Modeling in the Presence of Drifting Concepts

Probabilistic User Modeling in the Presence of Drifting Concepts Authors(s): Vikas Bhardwaj, Ramaswamy Devarajan

Abstract:

We investigate supervised prediction tasks which involve multiple agents over time, in the presence of drifting concepts. The motivation behind choosing the topic is that such tasks arise in many domains which require predicting human actions. An example of such a task is recommender systems, where it is required to predict the future ratings, given features describing items and context along with the previous ratings assigned by the users. In such a system, the relationships among the features and the class values can vary over time. A common challenge to learners in such a setting is that this variation can occur both across time for a given agent, and also across different agents, (i.e. each agent behaves differently). Furthermore, the factors causing this variation are often hidden. We explore probabilistic models suitable for this setting, along with efficient algorithms to learn the model structure. Our experiments use the Netflix Prize dataset, a real world dataset which shows the presence of time variant concepts. The results show that the approaches we describe are more accurate than alternative approaches, especially when there is a large variation among agents. All the data and source code would be made open-source under the GNU GPL.

Interesting because not only do concepts drift from user to user but modeling users as existing in neighborhoods of other users was more accurate than purely homogeneous or heterogeneous models.

Questions:

  1. If there is a “neighborhood” effect on users, what, if anything does that imply for co-occurrence of terms? (3-5 pages, no citations)
  2. How would you determine “neighborhood” boundaries for terms? (3-5 pages, citations)
  3. Do “neighborhoods” for terms vary by semantic domains? (3-5 pages, citations)

*****
Be aware that the Netflix dataset is no longer available. Possibly in response to privacy concerns. A demonstration of the utility of such concerns and their advocates.

November 30, 2010

Apache Mahout – Website

Filed under: Classification,Clustering,Data Mining,Mahout,Pattern Recognition,Software — Patrick Durusau @ 8:54 pm

Apache Mahout

From the website:

Apache Mahout’s goal is to build scalable machine learning libraries. With scalable we mean:

Scalable to reasonably large data sets. Our core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. However we do not restrict contributions to Hadoop based implementations: Contributions that run on a single node or on a non-Hadoop cluster are welcome as well. The core libraries are highly optimized to allow for good performance also for non-distributed algorithms.

Current capabilities include:

  • Collaborative Filtering
  • User and Item based recommenders
  • K-Means, Fuzzy K-Means clustering
  • Mean Shift clustering
  • Dirichlet process clustering
  • Latent Dirichlet Allocation
  • Singular value decomposition
  • Parallel Frequent Pattern mining
  • Complementary Naive Bayes classifier
  • Random forest decision tree based classifier
  • High performance java collections (previously colt collections)

A topic maps class will only have enough time to show some examples of using Mahout. Perhaps an informal group?

November 20, 2010

Classification and Pattern Discovery of Mood in Weblogs

Filed under: Classification,Clustering,Pattern Recognition — Patrick Durusau @ 10:18 am

Classification and Pattern Discovery of Mood in Weblogs Authors(s): Thin Nguyen, Dinh Phung, Brett Adams, Truyen Tran, Svetha Venkatesh

Abstract:

Automatic data-driven analysis of mood from text is an emerging problem with many potential applications. Unlike generic text categorization, mood classification based on textual features is complicated by various factors, including its context- and user-sensitive nature. We present a comprehensive study of different feature selection schemes in machine learning for the problem of mood classification in weblogs. Notably, we introduce the novel use of a feature set based on the affective norms for English words (ANEW) lexicon studied in psychology. This feature set has the advantage of being computationally efficient while maintaining accuracy comparable to other state-of-the-art feature sets experimented with. In addition, we present results of data-driven clustering on a dataset of over 17 million blog posts with mood groundtruth. Our analysis reveals an interesting, and readily interpreted, structure to the linguistic expression of emotion, one that comprises valuable empirical evidence in support of existing psychological models of emotion, and in particular the dipoles pleasure-displeasure and activation-deactivation.

The classification and pattern discovery of sentiment in weblogs will be a high priority for some topic maps.

Detection of teenagers who post to MySpace about violence for example.

Questions:

  1. How would you use this technique for research on weblogs? (3-5 pages, no citations)
  2. What other word lists could be applied to research on weblogs? Thoughts on how they could be applied? (3-5 pages, citations)
  3. Does the “mood” of a text impact its classification in traditional schemes? How would you test that question? (3-5 pages, no citations)

Additional resources:

Affective Norms for English Words (ANEW) Instruction Manual and Affective Ratings

ANEW Message: Request form for ANEW word list.

November 17, 2010

Normalized Kernels as Similarity Indices (and algorithm bias)

Filed under: Clustering,Kernel Methods,Similarity — Patrick Durusau @ 8:20 am

Normalized Kernels as Similarity Indices Authors(s): Julien Ah-Pine Keywords Kernels normalization, similarity indices, kernel PCA based clustering

Abstract:

Measuring similarity between objects is a fundamental issue for numerous applications in data-mining and machine learning domains. In this paper, we are interested in kernels. We particularly focus on kernel normalization methods that aim at designing proximity measures that better fit the definition and the intuition of a similarity index. To this end, we introduce a new family of normalization techniques which extends the cosine normalization. Our approach aims at refining the cosine measure between vectors in the feature space by considering another geometrical based score which is the mapped vectors’ norm ratio. We show that the designed normalized kernels satisfy the basic axioms of a similarity index unlike most unnormalized kernels. Furthermore, we prove that the proposed normalized kernels are also kernels. Finally, we assess these different similarity measures in the context of clustering tasks by using a kernel PCA based clustering approach. Our experiments employing several real-world datasets show the potential benefits of normalized kernels over the cosine normalization and the Gaussian RBF kernel.

Points out that some methods don’t result in an object being found to be most similar to…itself. What an odd result.

Moreover, it is possible for vectors the represent different scores to be treated as identical.

Questions:

  1. What axioms of similarity indexes should we take notice of? (3-5 pages, citations)
  2. What methods treat vectors with different scores as identical? (3-5 pages, citations)
  3. Are geometric based similarity indices measuring semantic or geometric similarity? Are those the same concepts or different concepts? (10-15 pages, citations, you can make this a final paper if you like.)

November 12, 2010

As Time Goes by: Discovering Eras in Evolving Social Networks

Filed under: Clustering,Data Mining,Evoluntionary — Patrick Durusau @ 6:21 pm

As Time Goes by: Discovering Eras in Evolving Social Networks Authors(s): Michele Berlingerio, Michele Coscia, Fosca Giannotti, Anna Monreale, Dino Pedreschi

Abstract:

Within the large body of research in complex network analysis, an important topic is the temporal evolution of networks. Existing approaches aim at analyzing the evolution on the global and the local scale, extracting properties of either the entire network or local patterns. In this paper, we focus instead on detecting clusters of temporal snapshots of a network, to be interpreted as eras of evolution. To this aim, we introduce a novel hierarchical clustering methodology, based on a dissimilarity measure (derived from the Jaccard coefficient) between two temporal snapshots of the network. We devise a framework to discover and browse the eras, either in top-down or a bottom-up fashion, supporting the exploration of the evolution at any level of temporal resolution. We show how our approach applies to real networks, by detecting eras in an evolving co-authorship graph extracted from a bibliographic dataset; we illustrate how the discovered temporal clustering highlights the crucial moments when the network had profound changes in its structure. Our approach is finally boosted by introducing a meaningful labeling of the obtained clusters, such as the characterizing topics of each discovered era, thus adding a semantic dimension to our analysis.

Deeply interesting work.

Questions:

  1. Is is a fair assumption that terms used by one scholar will be used the same way by scholars that cite them? (discussion)
  2. If you think #1 is true, then does entity resolution, etc., however you want to talk about recognition of subjects, apply from the first scholar outwards? If so, how far? (discussion)
  3. If you think #1 is false, why? (discussion)
  4. How would you go about designing a project to identify usages of terms in a body of literature? Such that you could detect changes in usage? What questions would you have to ask? (3-5 pages, citations)

PS: Another way to think about this area is: Do terms have social lives? Is that a useful way to talk about them?

November 10, 2010

Semantic-Distance Based Clustering for XML Keyword Search

Filed under: Clustering,Uncategorized — Patrick Durusau @ 1:45 pm

Semantic-Distance Based Clustering for XML Keyword Search Authors(s): Weidong Yang, Hao Zhu Keywords: XML, Keyword Search, Clustering

Abstract:

XML Keyword Search is a user-friendly information discovery technique, which is well-suited to schema-free XML documents. We propose a novel scheme for XML keyword search called XKLUSTER, in which a novel semantic-distance model is proposed to specify the set of nodes contained in a result. Based on this model, we use clustering approaches to generate all meaningful results in XML keyword search. A ranking mechanism is also presented to sort the results.

The author’s develop an interesting notion of “semantic distance” and then say:

Strictly speaking, the searching intentions of users can never be confirmed accurately; so different than existing researches, we suggest that all keyword nodes are useful more or less and should be included in
results. Based on the semantic distance model, we divide the set of keyword nodes X into a group of smaller sets, and each of them is called a “cluster”.

Well…, but the goal is to present the user with results relevant to their query, not results relevant to some query.

Still, an interesting paper and one that XML types will enjoy reading.

November 9, 2010

Rule Synthesizing from Multiple Related Databases

Filed under: Clustering,Data Mining,Heterogeneous Data,Uncategorized — Patrick Durusau @ 7:33 pm

Rule Synthesizing from Multiple Related Databases Authors(s): Dan He, Xindong Wu, Xingquan Zhu Keywords: Association rule mining, rule synthesizing, multiple databases, clustering

In this paper, we study the problem of rule synthesizing from multiple related databases where items representing the databases may be different, and the databases may not be relevant, or similar to each other. We argue that, for such multi-related databases, simple rule synthesizing without a detailed understanding of the databases is not able to reveal meaningful patterns inside the data collections. Consequently, we propose a two-step clustering on the databases at both item and rule levels such that the databases in the final clusters contain both similar items and similar rules. A weighted rule synthesizing method is then applied on each such cluster to generate final rules. Experimental results demonstrate that the new rule synthesizing method is able to discover important rules which can not be synthesized by other methods.

The authors observe:

…existing rule synthesizing methods for distributed mining commonly assumes that related databases are relevant, share similar data distributions, and have identical items. This is equivalent to the assumption that all stores have the same type of business with identical meta-data structures, which is hardly the case in practice.

I should start collecting quotes that recognize semantic diversity as the rule rather than the exception.

More on that later. Enjoy the article.

November 7, 2010

Orthogonal Nonnegative Matrix Tri-factorization for Semi-supervised Document Co-clustering

Filed under: Clustering,Indexing — Patrick Durusau @ 8:26 pm

Orthogonal Nonnegative Matrix Tri-factorization for Semi-supervised Document Co-clustering Authors: Huifang Ma, Weizhong Zhao, Qing Tan and Zhongzhi Shi Keywords: Semi-supervised Clustering, Pairwise Constraints, Word-Level Constraints, Nonnegative Matrix tri-Factorization

Abstract:

Semi-supervised clustering is often viewed as using labeled data to aid the clustering process. However, existing algorithms fail to consider dual constraints between data points (e.g. documents) and features (e.g. words). To address this problem, in this paper, we propose a novel semi-supervised document co-clustering model OSS-NMF via orthogonal nonnegative matrix tri-factorization. Our model incorporates prior knowledge both on document and word side to aid the new word-category and document-cluster matrices construction. Besides, we prove the correctness and convergence of our model to demonstrate its mathematical rigorous. Our experimental evaluations show that the proposed document clustering model presents remarkable performance improvements with certain constraints.

Questions:

  1. Relies on user input, but is the user input transferable? Or is it document/collection specific? (3-5 pages, no citations)
  2. Is document level retrieval too coarse? (discussion)
  3. Subset selection, understandable for testing/development. Doesn’t it seem odd no tests were done against entire collections? (discussion)
  4. What of the exclusion of words that occur less than 3 times? Aren’t infrequent terms more likely to be significant? (3-5 pages, no citations)

November 4, 2010

A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise

Filed under: Authoring Topic Maps,Clustering,Data Mining — Patrick Durusau @ 11:26 am

A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise (1996) Authors: Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu Keywords: Clustering Algorithms, Arbitrary Shape of Clusters, Efficiency on Large Spatial Databases, Handling Noise.

Before you decide to skip this paper as “old” consider that it has > 600 citations in CiteSeer.

Abstract:

Clustering algorithms are attractive for the task of class identification in spatial databases. However, the application to large spatial databases rises the following requirements for clustering algorithms: minimal requirements of domain knowledge to determine the input parameters, discovery of clusters with arbitrary shape and good efficiency on large databases. The well-known clustering algorithms offer no solution to the combination of these requirements. In this paper, we present the new clustering algorithm DBSCAN relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape. DBSCAN requires only one input parameter and supports the user in determining an appropriate value for it. We performed an experimental evaluation of the effectiveness and efficiency of DBSCAN using synthetic data and real data of the SEQUOIA 2000 benchmark. The results of our experiments demonstrate that (1) DBSCAN is significantly more effective in discovering clusters of arbitrary shape than the well-known algorithm CLAR-ANS, and that (2) DBSCAN outperforms CLARANS by a factor of more than 100 in terms of efficiency.

Discovery of classes is always an issue in topic map authoring/design and clustering is one way to find classes, perhaps even ones you did not suspect existed.

October 21, 2010

mloss.org – machine learning open source software

mloss.org – machine learning open source software

Open source repository of machine learning software.

Not only are subjects being recognized by these software packages but their processes and choices are subjects as well. Not to mention their description in the literature.

Fruitful grounds for adaptation to topic maps as well as being the subject of topic maps.

There are literally hundreds of software packages here so I welcome suggestions, comments, etc. on any and all of them.

Questions:

  1. Examples of vocabulary mis-match in machine learning literature?
  2. Using one sample data set, how would you integrate results from different packages? Assume you are not merging classifiers.
  3. What if the classifiers are unknown? That is all you have are the final results. Is your result different? Reliable?
  4. Describe a (singular) merging of classifiers in subject identity terms.

October 20, 2010

GPM: A Graph Pattern Matching Kernel with Diffusion for Chemical Compound Classification

GPM: A Graph Pattern Matching Kernel with Diffusion for Chemical Compound Classification
Authors: Aaron Smalter, Jun Huan and Gerald Lushington

Abstract:

Classifying chemical compounds is an active topic in drug design and other cheminformatics applications. Graphs are general tools for organizing information from heterogeneous sources and have been applied in modeling many kinds of biological data. With the fast accumulation of chemical structure data, building highly accurate predictive models for chemical graphs emerges as a new challenge.

In this paper, we demonstrate a novel technique called Graph Pattern Matching kernel (GPM). Our idea is to leverage existing ? frequent pattern discovery methods and explore their application to kernel classifiers (e.g. support vector machine) for graph classification. In our method, we first identify all frequent patterns from a graph database. We then map subgraphs to graphs in the database and use a diffusion process to label nodes in the graphs. Finally the kernel is computed using a set matching algorithm. We performed experiments on 16 chemical structure data sets and have compared our methods to other major graph kernels. The experimental results demonstrate excellent performance of our method.

The authors also note:

Publicly-available large-scale chemical compound databases have offered tremendous opportunities for creating highly efficient in silico drug design methods. Many machine learning and data mining algorithms have been applied to study the structure-activity relationship of chemicals with the goal of building classifiers for graph-structured data.

In other words, with a desktop machine, public data and a little imagination, you can make a fundamental contribution to drug design methods. (FWI, the pharmaceuticals are making money hand over fist.)

Integrating your contribution or its results into existing information, such as with topic maps, will only increase its value.

Integrating Biological Data – Not A URL In Sight!

Actual title: Kernel methods for integrating biological data by Dick de Ridder, The Delft Bioinformatics Lab, Delft University of Technology.

Biological data integration to improve protein expression – read hugely profitable industrial processes based on biology.

Need to integrate biological data, including “prior knowledge.”

In case kernel methods aren’t your “thing,” one important point:

There are vast seas of economically important data unsullied by URLs.

Kernel methods are one method to integrate some of that data.

Questions:

  1. How to integrate kernel methods into topic maps? (research project)
  2. Subjects in a kernel method? (research paper, limit to one method)
  3. Modeling specific uses of kernels in topic maps. (research project)
  4. Edges of kernels? Are there subject limits to kernels? (research project>

October 15, 2010

EigenSpokes: Surprising Patterns and Scalable Community Chipping in Large Graphs

EigenSpokes: Surprising Patterns and Scalable Community Chipping in Large Graphs. Authors: B. Aditya Prakash, Ashwin Sridharan, Mukund Seshadri, Sridhar Machiraju, and Christos Faloutsos Keywords: EigenSpokes – Communities – Graphs

Abstract:

We report a surprising, persistent pattern in large sparse social graphs, which we term EigenSpokes. We focus on large Mobile Call graphs, spanning about 186K nodes and millions of calls, and find that the singular vectors of these graphs exhibit a striking EigenSpokes pattern wherein, when plotted against each other, they have clear, separate lines that often neatly align along specific axes (hence the term “spokes”). Furthermore, analysis of several other real-world datasets e.g., Patent Citations, Internet, etc. reveals similar phenomena indicating this to be a more fundamental attribute of large sparse graphs that is related to their community structure.

This is the first contribution of this paper. Additional ones include (a) study of the conditions that lead to such EigenSpokes, and (b) a fast algorithm for spotting and extracting tightly-knit communities, called SpokEn, that exploits our findings about the EigenSpokes pattern.

The notion of “chipping” off communities for further study from a large graph is quite intriguing.

In part because those communities (need I say subjects?) are found as the result of a process of exploration rather than declaration.

To be sure, those subjects can be “declared” in a topic map but the finding, identifying, deciding on subject identity properties for subjects is a lot more fun.

October 14, 2010

linloglayout

Filed under: Clustering,Graphs,Subject Identity — Patrick Durusau @ 10:45 am

linloglayout

Overview:

LinLogLayout is a simple program for computing graph layouts (positions of graph nodes in two- or three-dimensional space) and graph clusterings. It reads a graph from a file, computes a layout and a clustering, writes the layout and the clustering to a file, and displays them in a dialog. LinLogLayout can be used to identify groups of densely connected nodes in graphs, like communities of friends or collaborators in social networks, related documents in hyperlink structures (e.g. web graphs), cohesive subsystems in software systems, etc. With a change of a parameter in the main method, it can also compute classical “nice” (i.e. readable) force-directed layouts.

Finding “densely connected nodes” is one step towards finding subjects.

Subject finding tool kits will include a variety of such techniques.

October 9, 2010

Evolutionary Clustering and Analysis of Heterogeneous Information Networks

Filed under: Clustering,Evoluntionary,Heterogeneous Data,Networks — Patrick Durusau @ 4:48 pm

Evolutionary Clustering and Analysis of Heterogeneous Information Networks Authors: Manish Gupta; Charu Aggarwal; Jiawei Han; Yizhou Sun Keywords: ENetClus, evolutionary clustering, typed-clustering, DBLP, bibliographic networks

Abstract:

In this paper, we study the problem of evolutionary clustering of multi-typed objects in a heterogeneous bibliographic network. The traditional methods of homogeneous clustering methods do not result in a good typed-clustering. The design of heterogeneous methods for clustering can help us better understand the evolution of each of the types apart from the evolution of the network as a whole. In fact, the problem of clustering and evolution diagnosis are closely related because of the ability of the clustering process to summarize the network and provide insights into the changes in the objects over time. We present such a tightly integrated method for clustering and evolution diagnosis of heterogeneous bibliographic information networks. We present an algorithm, ENetClus, which performs such an agglomerative evolutionary clustering which is able to show variations in the clusters over time with a temporal smoothness approach. Previous work on clustering networks is either based on homogeneous graphs with evolution, or it does not account for evolution in the process of clustering heterogeneous networks. This paper provides the first framework for evolution-sensitive clustering and diagnosis of heterogeneous information networks. The ENetClus algorithm generates consistent typed-clusterings across time, which can be used for further evolution diagnosis and insights. The framework of the algorithm is specifically designed in order to facilitate insights about the evolution process. We use this technique in order to provide novel insights about bibliographic information networks.

Exploring heterogeneous information networks is a first step towards discovery/recognition of new subjects. What other novel insights will emerge from work on heterogeneous information networks only future research can answer.

October 4, 2010

Finding your way in a multi-dimensional semantic space with Luminoso

Filed under: Clustering,Interface Research/Design,Natural Language Processing — Patrick Durusau @ 4:53 am

Finding your way in a multi-dimensional semantic space with luminoso Authors: Robert H. Speer, Catherine Havasi, K. Nichole Treadway, Henry Lieberman Keywords: common sense, n-dimensional visualization, natural language processing, SVD

Abstract:

In AI, we often need to make sense of data that can be measured in many different dimensions — thousands of dimensions or more — especially when this data represents natural language semantics. Dimensionality reduction techniques can make this kind of data more understandable and more powerful, by projecting the data into a space of many fewer dimensions, which are suggested by the computer. Still, frequently, these results require more dimensions than the human mind can grasp at once to represent all the meaningful distinctions in the data.

We present Luminoso, a tool that helps researchers to visualize and understand a multi-dimensional semantic space by exploring it interactively. It also streamlines the process of creating such a space, by inputting text documents and optionally including common-sense background information. This interface is based on the fundamental operation of “grabbing” a point, which simultaneously allows a user to rotate their view using that data point, view associated text and statistics, and compare it to other data points. This also highlights the point’s neighborhood of semantically-associated points, providing clues for reasons as to why the points were classified along the dimensions they were. We show how this interface can be used to discover trends in a text corpus, such as free-text responses to a survey.

I particularly like the interactive rotation about a data point.

Makes me think of rotating identifications or even within complexes of subjects.

The presentation of “rotation” I suspect to be domain specific.

The “geek” graph/node presentation probably isn’t the best one for all audiences. Open question as to what might work better.

See: Luminoso (homepage) and Luminoso (Github)

September 29, 2010

LingPipe

Filed under: Classification,Clustering,Entity Extraction,Full-Text Search,Searching — Patrick Durusau @ 7:06 am

LingPipe.

The tutorial listing for LingPipe is the best summary of its capabilities.

Its sandbox is another “must see” location.

There may be better introductions to linguistic processing but I haven’t seen them.

September 17, 2010

New Approach for Automated Categorizing and Finding Similarities in Online Persian News

Filed under: Clustering,Similarity — Patrick Durusau @ 4:39 am

New Approach for Automated Categorizing and Finding Similarities in Online Persian News Authors: Naser Ezzati Jivan, Mahlagha Fazeli and Khadije Sadat Yousefi Keywords: Categorization of web pages – category – automatic categorization of Persian news – feature – similarity – clustering – structure of web pages.

Abstract:

The Web is a great source of information where data are stored in different formats, e.g., web-pages, archive files and images. Algorithms and tools which automatically categorize web-pages have wide applications in real-life situations. A web-site which collects news from different sources can be an example of such situations. In this paper, an algorithm for categorizing news is proposed. The proposed approach is specialized to work with documents (news) written in the Persian language but it can be easily generalized to work with documents in other languages, too. There is no standard test-bench or measure to evaluate the performance of this kind of algorithms as the amount of similarity between two documents (news) is not well-defined. To test the performance of the proposed algorithm, we implemented a web-site which uses the proposed approach to find similar news. Some of the similar news items found by the algorithm has been reported.

Similarity: The first step towards subject identification.

September 16, 2010

Data Clustering: 50 Years Beyond K-Means

Filed under: Clustering,Subject Identity — Patrick Durusau @ 4:23 am

Data Clustering: 50 Years Beyond K-Means Author: Anil K. Jain Keywords: clustering, clustering algorithms, semi-supervised clustering, ensemble clustering, simultaneous feature selection, data clustering, large scale data clustering.

Excellent survey and history of clustering.

September 14, 2010

Towards a Principled Theory of Clustering

Filed under: Clustering — Patrick Durusau @ 4:06 am

Towards a Principled Theory of Clustering Author:Reza Bosagh Zadeh  Keywords: Clustering functions, Single-Linkage, Max-Sum, Minimum/Maximum Spanning Trees, Effective Similarity.

Exploration of methods to characterize clustering algorithms “…in terms of the effective similarity between two points.” A line of research that may make choice of clustering algorithms less arbitrary.

« Newer Posts

Powered by WordPress