Archive for the ‘Similarity’ Category

DELTACON: A Principled Massive-Graph Similarity Function

Saturday, April 20th, 2013

DELTACON: A Principled Massive-Graph Similarity Function by Danai Koutra, Joshua T. Vogelstein, Christos Faloutsos.

Abstract:

How much did a network change since yesterday? How different is the wiring between Bob’s brain (a left-handed male) and Alice’s brain (a right-handed female)? Graph similarity with known node correspondence, i.e. the detection of changes in the connectivity of graphs, arises in numerous settings. In this work, we formally state the axioms and desired properties of the graph similarity functions, and evaluate when state-of-the-art methods fail to detect crucial connectivity changes in graphs. We propose DeltaCon, a principled, intuitive, and scalable algorithm that assesses the similarity between two graphs on the same nodes (e.g. employees of a company, customers of a mobile carrier). Experiments on various synthetic and real graphs showcase the advantages of our method over existing similarity measures. Finally, we employ DeltaCon to real applications: (a) we classify people to groups of high and low creativity based on their brain connectivity graphs, and (b) do temporal anomaly detection in the who-emails-whom Enron graph.

How different is your current topic map from a prior version?

Could be an interesting marketing ploy to colorize the distinct portions of the graph.

Not to mention using “similarity” to mean the same subject for some purposes. Group subjects come to mind.

And for other types of analysis.

Learning Hash Functions Using Column Generation

Monday, March 11th, 2013

Learning Hash Functions Using Column Generation by Xi Li, Guosheng Lin, Chunhua Shen, Anton van den Hengel, Anthony Dick.

Abstract:

Fast nearest neighbor searching is becoming an increasingly important tool in solving many large-scale problems. Recently a number of approaches to learning data-dependent hash functions have been developed. In this work, we propose a column generation based method for learning data-dependent hash functions on the basis of proximity comparison information. Given a set of triplets that encode the pairwise proximity comparison information, our method learns hash functions that preserve the relative comparison relationships in the data as well as possible within the large-margin learning framework. The learning procedure is implemented using column generation and hence is named CGHash. At each iteration of the column generation procedure, the best hash function is selected. Unlike most other hashing methods, our method generalizes to new data points naturally; and has a training objective which is convex, thus ensuring that the global optimum can be identified. Experiments demonstrate that the proposed method learns compact binary codes and that its retrieval performance compares favorably with state-of-the-art methods when tested on a few benchmark datasets.

Interesting review of hashing techniques.

Raises the question of customized similarity (read sameness) hashing algorithms for topic maps.

I first saw this in a tweet by Stefano Bertolo.

Similarity Search and Applications

Friday, January 18th, 2013

International Conference on Similarity Search and Applications (SISAP 2013)

From the webpage:

The International Conference on Similarity Search and Applications (SISAP) is an annual forum for researchers and application developers in the area of similarity data management. It aims at the technological problems shared by numerous application domains, such as data mining, information retrieval, computer vision, pattern recognition, computational biology, geography, biometrics, machine learning, and many others that need similarity searching as a necessary supporting service.

The SISAP initiative (www.sisap.org) aims to become a forum to exchange real-world, challenging and innovative examples of applications, new indexing techniques, common test-beds and benchmarks, source code and up-to-date literature through its web page, serving the similarity search community. Traditionally, SISAP puts emphasis on the distance-based searching, but in general the conference concerns both the effectiveness and efficiency aspects of any similarity search problem.

Dates:

Paper submission: April 2013
Notification: June 2013
Final version: July 2013
Conference: October 2, 3, and 4, 2013

The specific topics include, but are not limited to:

  • Similarity queries – k-NN, range, reverse NN, top-k, etc.
  • Similarity operations – joins, ranking, classification, categorization, filtering, etc.
  • Evaluation techniques for similarity queries and operations
  • Merging/combining multiple similarity modalities
  • Cost models and analysis for similarity data processing
  • Scalability issues and high-performance similarity data management
  • Feature extraction for similarity-based data findability
  • Test collections and benchmarks
  • Performance studies, benchmarks, and comparisons
  • Similarity Search over outsourced data repositories
  • Similarity search cloud services
  • Languages for similarity databases
  • New modes of similarity for complex data understanding
  • Applications of similarity-based operations
  • Image, video, voice, and music (multimedia) retrieval systems
  • Similarity for forensics and security

You should be able to find one or more topics that interest you. ;-)

How similar must two or more references to an entity be before they are identifying the same entity?

Or for that matter, is similarity an association between two or more references?

PubChem3D: conformer ensemble accuracy

Sunday, January 13th, 2013

PubChem3D: conformer ensemble accuracy by Sunghwan Kim, Evan E Bolton and Stephen H Bryant. (Journal of Cheminformatics 2013, 5:1 doi:10.1186/1758-2946-5-1)

Abstract:

Background

PubChem is a free and publicly available resource containing substance descriptions and their associated biological activity information. PubChem3D is an extension to PubChem containing computationally-derived three-dimensional (3-D) structures of small molecules. All the tools and services that are a part of PubChem3D rely upon the quality of the 3-D conformer models. Construction of the conformer models currently available in PubChem3D involves a clustering stage to sample the conformational space spanned by the molecule. While this stage allows one to downsize the conformer models to more manageable size, it may result in a loss of the ability to reproduce experimentally determined “bioactive” conformations, for example, found for PDB ligands. This study examines the extent of this accuracy loss and considers its effect on the 3-D similarity analysis of molecules.

Results

The conformer models consisting of up to 100,000 conformers per compound were generated for 47,123 small molecules whose structures were experimentally determined, and the conformers in each conformer model were clustered to reduce the size of the conformer model to a maximum of 500 conformers per molecule. The accuracy of the conformer models before and after clustering was evaluated using five different measures: root-mean-square distance (RMSD), shape-optimized shape-Tanimoto (STST-opt) and combo-Tanimoto (ComboTST-opt), and color-optimized color-Tanimoto (CTCT-opt) and combo-Tanimoto (ComboTCT-opt). On average, the effect of clustering decreased the conformer model accuracy, increasing the conformer ensemble’s RMSD to the bioactive conformer (by 0.18 +/- 0.12 A), and decreasing the STST-opt, ComboTST-opt, CTCT-opt, and ComboTCT-opt scores (by 0.04 +/- 0.03, 0.16 +/- 0.09, 0.09 +/- 0.05, and 0.15 +/- 0.09, respectively).

Conclusion

This study shows the RMSD accuracy performance of the PubChem3D conformer models is operating as designed. In addition, the effect of PubChem3D sampling on 3-D similarity measures shows that there is a linear degradation of average accuracy with respect to molecular size and flexibility. Generally speaking, one can likely expect the worst-case minimum accuracy of 90% or more of the PubChem3D ensembles to be 0.75, 1.09, 0.43, and 1.13, in terms of STST-opt, ComboTST-opt, CTCT-opt, and ComboTCT-opt, respectively. This expected accuracy improves linearly as the molecule becomes smaller or less flexible.

If I were to say, potential shapes of a subject, would that the importance of this work clearer?

Wikipedia has this two-liner that may also help:

A macromolecule is usually flexible and dynamic. It can change its shape in response to changes in its environment or other factors; each possible shape is called a conformation, and a transition between them is called a conformational change. A macromolecular conformational change may be induced by many factors such as a change in temperature, pH, voltage, ion concentration, phosphorylation, or the binding of a ligand.

Subjects and the manner of their identification is a very deep and rewarding field of study.

An identification method in isolation is no better or worse than any other identification method.

Only your requirements (which are also subjects) can help with the process of choosing one or more identification methods over others.

Music Network Visualization

Tuesday, December 11th, 2012

Music Network Visualization by Dimiter Toshkov.

From the post:

My music interests have always been rather, hmm…, eclectic. Somehow IDM, ambient, darkwave, triphop, acid jazz, bossa nova, qawali, Mali blues and other more or less obscure genres have managed to happily co-exist in my music collection. The sheer diversity always invited the question whether there is some structure to the collection, or each genre is an island of its own. Sounds like a job for network visualization!

Now, there are plenty of music network viz applications on the web. But they don’t show my collection, and just seem unsatisfactory for various reasons. So I decided to craft my own visualization using R and igraph.

Interesting for the visualization but also the use of similarity measures.

The test for identity of a subject, particularly collective subjects, artists “similar” to X, is as unlimited as your imagination.

Infinite Jukebox plays your favorite songs forever

Sunday, November 25th, 2012

Infinite Jukebox plays your favorite songs forever by Nathan Yau.

From the post:

You know those songs that you love so much that you cry because they’re over? Well, cry no more with the Inifinite Jukebox by Paul Lamere. Inspired by Infinite Gangnam Style, the Infinite Jukebox lets you upload a song, and it’ll figure out how to cut the beats and piece them back together for a version of that song that goes forever.

Requires advanced web audio so you need to fire up a late version of Chrome or Safari. (I am on Ubuntu so can tell you about IE. In a VM?)

I tried it with Metallica’s Unforgiven.

Very impressive, although that assessment will vary based on your taste in music.

Would make an interesting interface for exploring textual features.

To have calculation of features and automatic navigation based on some pseudo-randomness. So you encounter data or text you would not otherwise have seen.

Many would argue we navigate with intention and rational purpose, but to be honest, that’s comfort analysis. It’s an explanation we use to compliment ourselves. (see, Thinking, Fast and Slow) Research suggests decision making is complex and almost entirely non-rational.

Constructing Topological Spaces — A Primer

Friday, November 16th, 2012

Constructing Topological Spaces — A Primer by Jeremy Kun.

From the post:

Last time we investigated the (very unintuitive) concept of a topological space as a set of “points” endowed with a description of which subsets are open. Now in order to actually arrive at a discussion of interesting and useful topological spaces, we need to be able to take simple topological spaces and build them up into more complex ones. This will take the form of subspaces and quotients, and through these we will make rigorous the notion of “gluing” and “building” spaces.

More heavy sledding but pay special attention to the discussion of sets and equivalences.

Jeremy concludes with pointers to books for additional reading.

Any personal favorites you would like to add to the list?

Topological Spaces — A Primer

Friday, November 16th, 2012

Topological Spaces — A Primer by Jeremy Kun.

From the post:

In our last primer we looked at a number of interesting examples of metric spaces, that is, spaces in which we can compute distance in a reasonable way. Our goal for this post is to relax this assumption. That is, we want to study the geometric structure of space without the ability to define distance. That is not to say that some notion of distance necessarily exists under the surface somewhere, but rather that we include a whole new class of spaces for which no notion of distance makes sense. Indeed, even when there is a reasonable notion of a metric, we’ll still want to blur the lines as to what kinds of things we consider “the same.”

The reader might wonder how we can say anything about space if we can’t compute distances between things. Indeed, how could it even really be “space” as we know it? The short answer is: the reader shouldn’t think of a topological space as a space in the classical sense. While we will draw pictures and say some very geometric things about topological spaces, the words we use are only inspired by their classical analogues. In fact the general topological space will be a much wilder beast, with properties ranging from absolute complacency to rampant hooliganism. Even so, topological spaces can spring out of every mathematical cranny. They bring at least a loose structure to all sorts of problems, and so studying them is of vast importance.

Just before we continue, we should give a short list of how topological spaces are applied to the real world. In particular, this author is preparing a series of posts dedicated to the topological study of data. That is, we want to study the loose structure of data potentially embedded in a very high-dimensional metric space. But in studying it from a topological perspective, we aim to eliminate the dependence on specific metrics and parameters (which can be awfully constricting, and even impertinent to the overall structure of the data). In addition, topology has been used to study graphics, image analysis and 3D modelling, networks, semantics, protein folding, solving systems of polynomial equations, and loads of topics in physics.

Topology offers an alternative to the fiction of metric distances between the semantics of words. It is a useful fiction, but a fiction none the less.

Deep sledding but well worth the time.

Efficient similarity search on multimedia databases [No Petraeus Images, Yet]

Wednesday, November 14th, 2012

Efficient similarity search on multimedia databases by Mariela Lopresti, Natalia Miranda, Fabiana Piccoli, Nora Reyes.

Abstract:

Manipulating and retrieving multimedia data has received increasing attention with the advent of cloud storage facilities. The ability of querying by similarity over large data collections is mandatory to improve storage and user interfaces. But, all of them are expensive operations to solve only in CPU; thus, it is convenient to take into account High Performance Computing (HPC) techniques in their solutions. The Graphics Processing Unit (GPU) as an alternative HPC device has been increasingly used to speedup certain computing processes. This work introduces a pure GPU architecture to build the Permutation Index and to solve approximate similarity queries on multimedia databases. The empirical results of each implementation have achieved different level of speedup which are related with characteristics of GPU and the particular database used.

No images have been published, yet, in the widening scandal around David Petraeus.

When they do, searching multimedia databases such as Flickr, Facebook, YouTube and others will be a hot issue.

Once found, there is the problem of finding unique ones again and duplicates not again.

item similarity by bipartite graph dispersion

Monday, October 15th, 2012

item similarity by bipartite graph dispersion by Mat Kelcey.

From the post:

the basis of most recommendation systems is the ability to rate similarity between items. there are lots of different ways to do this.

one model is based the idea of an interest graph where the nodes of the graph are users and items and the edges of the graph represent an interest, whatever that might mean for the domain.

if we only allow edges between users and items the graph is bipartite.

As Mat says, there are lots of ways to judge similarity.

How “similar” two subject representatives need be to represent the same subject is also wide open.

For reliable interchange, however, it is best that you declare (disclose) your similarity measure and what happens when it is met.

The Topic Maps Data Model declares similarity based on matching IRIs within sets of IRIs and prescribes what follows when matches are found. If you need an example of disclosure.

Estimating Subject Sameness?

Friday, October 12th, 2012

If you think about it, graph isomorphism is a type of subject sameness problem.

Sadly, graph isomorphism remains a research problem so not immediately applicable to problems you encounter with topic maps.

However, Alex Smola in The Weisfeiler-Lehman algorithm and estimation on graphs covers some “cheats” that you may find useful.

Imagine you have two graphs \(G\) and \(G’\) and you’d like to check how similar they are. If all vertices have unique attributes this is quite easy:

FOR ALL vertices \(v \in G \cup G’\) DO

  • Check that \(v \in G\) and that \(v \in G’\)
  • Check that the neighbors of v are the same in \(G\) and \(G’\)

This algorithm can be carried out in linear time in the size of the graph, alas many graphs do not have vertex attributes, let alone unique vertex attributes. In fact, graph isomorphism, i.e. the task of checking whether two graphs are identical, is a hard problem (it is still an open research question how hard it really is). In this case the above algorithm cannot be used since we have no idea which vertices we should match up.

The Weisfeiler-Lehman algorithm is a mechanism for assigning fairly unique attributes efficiently. Note that it isn’t guaranteed to work, as discussed in this paper by Douglas – this would solve the graph isomorphism problem after all. The idea is to assign fingerprints to vertices and their neighborhoods repeatedly. We assume that vertices have an attribute to begin with. If they don’t then simply assign all of them the attribute 1. Each iteration proceeds as follows:

Curious if you find the negative approach, “these two graphs are not isomorphic,” as useful as a positive one (where it works), “these two graphs are isomorphic?”

Or is it sufficient to reliably know that graphs are different?

Item based similarity with GraphChi

Tuesday, September 11th, 2012

Item based similarity with GraphChi by Danny Bickson.

From the post:

Item based collaborative filtering is one of the most popular collaborative filtering methods used in more than 70% of the companies I am talking to. Following my mega collaborator Justin Yan‘s advice, I have started to implement some item based similarity methods in GraphChi.

Item based methods compare all pairs of items together, for computing similarity metric between each pair of items. This task becomes very quickly computation intensive. For example, Netflix data has around 17K movies. If we want to compute all pairs of movies to find the most similar ones, we need to compare around 290M pairs!

If we use a symmetric similarity measure, the distance between movie A and B, is similar to the distance between movie B and A. Thus for the Netflix example we have around 145M pairs to compute. To reduce the work furthermore, we only compare movies which where watched together by at least X users, for example X=5. Otherwise, those movies are not considered similar.

When the dataset is big, it is not possible to load it fully into memory at a single machine. That is where GraphChi comes in. My preliminary implementation of the item similarity computes similarity between all pairs without fully reading the dataset into memory. The idea is to load a chunk of the items (called pivots) into memory, and then stream over the rest of the items by comparing the pivots to the rest of the items.

I need to check on Danny’s blog and the GraphChi/GraphLab webpages more often!

Measuring Similarity in Large-scale Folksonomies [Users vs. Authorities]

Tuesday, September 11th, 2012

Measuring Similarity in Large-scale Folksonomies by Giovanni Quattrone, Emilio Ferrara, Pasquale De Meo, and Licia Capra.

Abstract:

Social (or folksonomic) tagging has become a very popular way to describe content within Web 2.0 websites. Unlike taxonomies, which overimpose a hierarchical categorisation of content, folksonomies enable end-users to freely create and choose the categories (in this case, tags) that best describe some content. However, as tags are informally defined, continually changing, and ungoverned, social tagging has often been criticised for lowering, rather than increasing, the efficiency of searching, due to the number of synonyms, homonyms, polysemy, as well as the heterogeneity of users and the noise they introduce. To address this issue, a variety of approaches have been proposed that recommend users what tags to use, both when labelling and when looking for resources.

As we illustrate in this paper, real world folksonomies are characterized by power law distributions of tags, over which commonly used similarity metrics, including the Jaccard coefficient and the cosine similarity, fail to compute. We thus propose a novel metric, specifically developed to capture similarity in large-scale folksonomies, that is based on a mutual reinforcement principle: that is, two tags are deemed similar if they have been associated to similar resources, and vice-versa two resources are deemed similar if they have been labelled by similar tags. We offer an efficient realisation of this similarity metric, and assess its quality experimentally, by comparing it against cosine similarity, on three large-scale datasets, namely Bibsonomy, MovieLens and CiteULike.

Studying language (tags) as used tells you about users.

Studying language as proscribed by an authority, tells you about that authority.

Which one is of interest to you?

Graphlet-based edge clustering reveals pathogen-interacting proteins

Monday, September 10th, 2012

Graphlet-based edge clustering reveals pathogen-interacting proteins by R. W. Solava, R. P. Michaels and T. Milenković. (Bioinformatics (2012) 28 (18): i480-i486. doi: 10.1093/bioinformatics/bts376)

Abstract:

Motivation: Prediction of protein function from protein interaction networks has received attention in the post-genomic era. A popular strategy has been to cluster the network into functionally coherent groups of proteins and assign the entire cluster with a function based on functions of its annotated members. Traditionally, network research has focused on clustering of nodes. However, clustering of edges may be preferred: nodes belong to multiple functional groups, but clustering of nodes typically cannot capture the group overlap, while clustering of edges can. Clustering of adjacent edges that share many neighbors was proposed recently, outperforming different node clustering methods. However, since some biological processes can have characteristic ‘signatures’ throughout the network, not just locally, it may be of interest to consider edges that are not necessarily adjacent.

Results: We design a sensitive measure of the ‘topological similarity’ of edges that can deal with edges that are not necessarily adjacent. We cluster edges that are similar according to our measure in different baker’s yeast protein interaction networks, outperforming existing node and edge clustering approaches. We apply our approach to the human network to predict new pathogen-interacting proteins. This is important, since these proteins represent drug target candidates.

Availability: Software executables are freely available upon request.

Contact: tmilenko@nd.edu

Of interest for bioinformatics but more broadly for its insights into topological similarity and edge clustering by topological similarity.

Being mindful that an “edge” is for all intents and purposes a “node” that records a connection between two (non-hyperedge and non-looping edge) other nodes. Nodes could, but don’t generally record their connection to other nodes, that connection being represented by an edge.

Introduction to Recommendations with Map-Reduce and mrjob [Ode to Similarity, Music]

Saturday, August 25th, 2012

Introduction to Recommendations with Map-Reduce and mrjob by Marcel Caraciolo

From the post:

In this post I will present how can we use map-reduce programming model for making recommendations. Recommender systems are quite popular among shopping sites and social network thee days. How do they do it ? Generally, the user interaction data available from items and products in shopping sites and social networks are enough information to build a recommendation engine using classic techniques such as Collaborative Filtering.

Usual recommendation post except for the emphasis on multiple tests of similarity.

Useful because simply reporting that two (or more) items are “similar” isn’t all that helpful. At least unless or until you know the basis for the comparison.

And have the expectation that a similar notion of “similarity” works for your audience.

For example, I read an article this morning about a “new” invention that will change the face of sheet music publishing, in three to five years. Invention Will Strike a Chord With Musicians

Despite the lack of terms like “markup,” “HyTime,” “SGML,” “XML,” “Music Encoding Initiative (MEI),” or “MusicXML,” all of those seemed quite “similar” to me. That may not be the “typical” experience but it is mine.

If you don’t want to wait three to five years for the sheet music revolution, you can check out MusicXML. It has been reported that more than 150 applications support MusicXML. Oh, that would be today, not three to five years from now.

You might want to pass the word along in the music industry before the next “revolution” in sheet music starts up.

Bi-directional semantic similarity….

Sunday, August 19th, 2012

Bi-directional semantic similarity for gene ontology to optimize biological and clinical analyses by Sang Jay Bien, Chan Hee Park, Hae Jin Shim, Woongcheol Yang, Jihun Kim and Ju Han Kim.

Abstract:

Background Semantic similarity analysis facilitates automated semantic explanations of biological and clinical data annotated by biomedical ontologies. Gene ontology (GO) has become one of the most important biomedical ontologies with a set of controlled vocabularies, providing rich semantic annotations for genes and molecular phenotypes for diseases. Current methods for measuring GO semantic similarities are limited to considering only the ancestor terms while neglecting the descendants. One can find many GO term pairs whose ancestors are identical but whose descendants are very different and vice versa. Moreover, the lower parts of GO trees are full of terms with more specific semantics.

Methods This study proposed a method of measuring semantic similarities between GO terms using the entire GO tree structure, including both the upper (ancestral) and the lower (descendant) parts. Comprehensive comparison studies were performed with well-known information content-based and graph structure-based semantic similarity measures with protein sequence similarities, gene expression-profile correlations, protein–protein interactions, and biological pathway analyses.

Conclusion The proposed bidirectional measure of semantic similarity outperformed other graph-based and information content-based methods.

Makes me curious what the experience with direction and identification has been with other ontologies?

Measuring similarity and distance function

Sunday, August 12th, 2012

Measuring similarity and distance function by Ricky Ho.

Ricky covers:

  • Distance between numeric data points
  • Distance between categorical data points
  • Distance between mixed categorical and numeric data points
  • Distance between sequence (String, TimeSeries)
  • Distance between nodes in a network
  • Distance between population distribution

Not every measure or function but enough to get you started.

Scholarly network similarities…

Monday, June 25th, 2012

Scholarly network similarities: How bibliographic coupling networks, citation networks, cocitation networks, topical networks, coauthorship networks, and coword networks relate to each other by Erjia Yan and Ying Ding. (Journal of the American Society for Information Science and Technology Volume 63, Issue 7, pages 1313–1326, July 2012)

Abstract:

This study explores the similarity among six types of scholarly networks aggregated at the institution level, including bibliographic coupling networks, citation networks, cocitation networks, topical networks, coauthorship networks, and coword networks. Cosine distance is chosen to measure the similarities among the six networks. The authors found that topical networks and coauthorship networks have the lowest similarity; cocitation networks and citation networks have high similarity; bibliographic coupling networks and cocitation networks have high similarity; and coword networks and topical networks have high similarity. In addition, through multidimensional scaling, two dimensions can be identified among the six networks: Dimension 1 can be interpreted as citation-based versus noncitation-based, and Dimension 2 can be interpreted as social versus cognitive. The authors recommend the use of hybrid or heterogeneous networks to study research interaction and scholarly communications.

Interesting that I should come across this article after posting about data sets. See http://info.slis.indiana.edu/~eyan/papers/citation/ the post-processing data and interactive visualizations that are reproduced as static views in the article.

At page 1323 the authors say:

In addition to social networks versus information networks, another distinction of real connection-based networks versus artificial connection-based networks can be made. Coauthorship networks and citation networks are constructed based on real connections, whereas cocitation, bibliographic coupling, topical, and coword networks are constructed based on artificial connections,5 usually in the form of similarity measurements.

I am not sure about the “real” versus “artificial” connection that comes so easily to hand. In part because authors, in my experience, tend to use terminology similar to other scholars with who they agree. So the connection between the work of scholars isn’t “artificial,” although it isn’t accounted for in this study.

There is much to admire and even imitate in this article but the interaction between scholars is more complex than its representation by the networks here.

Neo4j + last.fm in Java

Saturday, April 14th, 2012

Neo4j + last.fm in Java by Niklas Lindblad.

Video that walks through the use of Neo4j and the music service last.fm. (Full code: https://niklaslindblad.se/kod/Database.java.html)

Watching the import, I was reminded of Iron Maiden, a group I had not thought of for years.

Directional “similarity” relationships/edges. Can also store “measures” of similarity along the edges.

Question: For subject identity purposes, can we do more than boolean yes/no?

That is for some purposes, if a subject representative is => .90 similar, can we merge it with another node?

Or for other purposes we might use a different measure of similarity?

Is it possible for boolean and measure of similarity topic maps to exist in the same topic map?

On nonmetric similarity search problems in complex domains

Saturday, February 25th, 2012

On nonmetric similarity search problems in complex domains by Tomáš Skopal and Benjamin Bustos.

Abstract:

The task of similarity search is widely used in various areas of computing, including multimedia databases, data mining, bioinformatics, social networks, etc. In fact, retrieval of semantically unstructured data entities requires a form of aggregated qualification that selects entities relevant to a query. A popular type of such a mechanism is similarity querying. For a long time, the database-oriented applications of similarity search employed the definition of similarity restricted to metric distances. Due to its topological properties, metric similarity can be effectively used to index a database which can then be queried efficiently by so-called metric access methods. However, together with the increasing complexity of data entities across various domains, in recent years there appeared many similarities that were not metrics—we call them nonmetric similarity functions. In this article we survey domains employing nonmetric functions for effective similarity search, and methods for efficient nonmetric similarity search. First, we show that the ongoing research in many of these domains requires complex representations of data entities. Simultaneously, such complex representations allow us to model also complex and computationally expensive similarity functions (often represented by various matching algorithms). However, the more complex similarity function one develops, the more likely it will be a nonmetric. Second, we review state-of-the-art techniques for efficient (fast) nonmetric similarity search, concerning both exact and approximate search. Finally, we discuss some open problems and possible future research trends.

The first paragraph of the conclusion of this survey on nonmetric similarity is an argument for topic maps (or at least the result of using a topic map):

In this article, we have surveyed the current situation concerning the employment of nonmetric similarity functions for effective and efficient similarity search in complex domains. One of the main results of the article is a surprising revelation that nonmetric similarity measuring is widely used in isolated domains, spanning many areas of interdisciplinary research. This includes multimedia databases, time series, and medical, scientific, chemical, and bioinformatic tasks, among others. (emphasis added)

True enough, survey articles such as this one may tempt a few researchers and possibly graduate students to peek over the discipline walls, however briefly. But research articles need to routinely cite the literature of other disciplines, betraying a current awareness of other fields. To take advantage of advances in other fields as well as to serve as an example for the next generation of researchers.

Sentence based semantic similarity measure for blog-posts

Saturday, January 14th, 2012

Sentence based semantic similarity measure for blog-posts by Mehwish Aziz and Muhammad Rafi.

Abstract:

Blogs-Online digital diary like application on web 2.0 has opened new and easy way to voice opinion, thoughts, and like-dislike of every Internet user to the World. Blogosphere has no doubt the largest user-generated content repository full of knowledge. The potential of this knowledge is still to be explored. Knowledge discovery from this new genre is quite difficult and challenging as it is totally different from other popular genre of web-applications like World Wide Web (WWW). Blog-posts unlike web documents are small in size, thus lack in context and contain relaxed grammatical structures. Hence, standard text similarity measure fails to provide good results. In this paper, specialized requirements for comparing a pair of blog-posts is thoroughly investigated. Based on this we proposed a novel algorithm for sentence oriented semantic similarity measure of a pair of blog-posts. We applied this algorithm on a subset of political blogosphere of Pakistan, to cluster the blogs on different issues of political realm and to identify the influential bloggers.

I am not sure I agree that “relaxed grammatical structures” are peculiar to blog posts. ;-)

A “measure” of similarity that I have not seen (would appreciate a citation if you have) is the listing of one blog by another by another in its “blogroll.” On the theory that blogs may cite blogs they disagree with both semantically and otherwise in post but are unlikely to list blogs in their “blogroll” that they find disagreeable.

Efficient Similarity Query Processing (Previously Efficient Exact Similarity Join)

Saturday, December 24th, 2011

Efficient Similarity Query Processing (Previously Efficient Exact Similarity Join)

From the webpage:

Given a similarity function and two sets of objects, a similarity join returns all pairs of objects (from each set respectively) such that their similarity value satisifies a given criterion. A typical example is to find pairs of documents such that their cosine similarity is above a constant threshold (say, 0.95), as they are likely to be near duplicate documents.

In this project, we focus on efficient algorithms to perform the similarity join on (multi-) sets or strings both exactly and approximately. Commonly used similarity functions for (multi-) sets or strings are more complex than Euclidean distance functions. As a result, many previous approaches have to compute the approximate results instead.

We have developed several fast algorithms that address the above performance issue. They work for Jaccard, Dice, Cosine similarities and Edit distance constraints. We have also devised algorithms to compute top-k similar pairs progressively so that a user does not need to specify a similarity threshold for some unknown dataset. Recently, we have obtained several interesting new results on edit similarity search and joins by exploiting asymmetric signature schemes. The resulting IndexChunkTurbo algorithm can process most of the similarity queries efficiently while occupying almost the least amount of index space.

We also investigated the rpbolem of approximate entity matching. Our SIGMOD09 work can extract approximate mentions of entities in a document given an entity dictionary.

The first five (5) papers:

  • Xiang Zhao, Chuan Xiao, Xuemin Lin, Wei Wang. Efficient Graph Similarity Joins with Edit Distance Constraints. ICDE 2012.

    Summary: An efficient algorithm to compute graphs within certain (graph) edit distance away from a query graph.

  • Sunanda Patro, Wei Wang. Learning Top-k Transformation Rules. DEXA 2011.

    Summary: A new algorithm to learn transformation rules (e.g., VLDB = Very Large Data Base) from unannotated, potentially noisy data. Compared with previous approaches, our method can find much more valid rules in the top-k output.

  • Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu, Guoren Wang. Efficient Similarity Joins for Near Duplicate Detection. ACM Transaction of Database Systems (TODS).

    Summary: This is the journal version of our WWW 2008 paper, with extension to implementing the PPJoin family of algorithms on top of relational database systems.

  • Jianbin Qin, Wei Wang, Yifei Lu, Chuan Xiao, Xuemin Lin. Efficient Exact Edit Similarity Query Processing with Asymmetric Signature Schemes. SIGMOD 2011.

    Summary: Two simple yet highly efficient algorithms are proposed in this paper that works very well for both edit similarity search and joins. Perhaps equally intereting is a comprehensive experiment involding Flamingo (ver 3), PartEnum, Ed-Join, Bed-tree, Trie-Join, NGPP, VGRAM, NS09.

  • Chaokun Wang, Jianmin Wang, Xuemin Lin, Wei Wang, Haixun Wang, Hongsong Li, Wanpeng Tian, Jun Xu, Rui Li. MapDupReducer: Detecting Near Duplicates over Massive Datasets. SIGMOD 2010. PPT

    Summary: This work essentially ports the ppjoin+ algorithm to the Map-Reduce framework in order to deal with huge volume of data.

The first paper is saying exactly what your suspect: similarity of arbitrary sub-graphs. Nothing like picking a hard problem is there? Not fully general solution but see what you think of theirs.

Did I mention there is working code at this site as well? Definitely a group to watch in the future.

Bed-tree: an all-purpose index structure for string similarity search based on edit distance

Friday, December 23rd, 2011

Bed-tree: an all-purpose index structure for string similarity search based on edit distance by Zhenjie Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, and Divesh Srivastava.

Abstract:

Strings are ubiquitous in computer systems and hence string processing has attracted extensive research effort from computer scientists in diverse areas. One of the most important problems in string processing is to efficiently evaluate the similarity between two strings based on a specified similarity measure. String similarity search is a fundamental problem in information retrieval, database cleaning, biological sequence analysis, and more. While a large number of dissimilarity measures on strings have been proposed, edit distance is the most popular choice in a wide spectrum of applications. Existing indexing techniques for similarity search queries based on edit distance, e.g., approximate selection and join queries, rely mostly on n-gram signatures coupled with inverted list structures. These techniques are tailored for specific query types only, and their performance remains unsatisfactory especially in scenarios with strict memory constraints or frequent data updates. In this paper we propose the Bed-tree, a B+-tree based index structure for evaluating all types of similarity queries on edit distance and normalized edit distance. We identify the necessary properties of a mapping from the string space to the integer space for supporting searching and pruning for these queries. Three transformations are proposed that capture different aspects of information inherent in strings, enabling efficient pruning during the search process on the tree. Compared to state-of-the-art methods on string similarity search, the Bed-tree is a complete solution that meets the requirements of all applications, providing high scalability and fast response time.

The authors classify similarity queries as:

Type Example
range address in customer database
top-k results of search engine
all-pairs joins
pairs of proteins or genes

There are a couple of things that trouble me about the paper.

First:

6.3 Top-K construction

In many cases, top-k queries are more practical than range queries. However, existing indexing schemes with inverted lists do not naturally support such queries. To illustrate
the performance benefits of our proposals, we implemented a simple strategy with Flamingo, by increasing the range query threshold gradually until more than k string results
are found. Notice that we use the same Bed-tree structures to support all different types of queries. Thus, we skip the performance comparisons on index construction but focus on query processing efficiency. (emphasis added)

I am not sure what is meant by inverted lists “…do not naturally support …[top-k queries].” Inverted list structures are wildly popular among WWW search engines so I would like to know more about this notion of “…naturally support….”

Moreover, indexes aren’t simply used, they are created as well. Puzzling that we are left to wonder how long it will take to have a Bed-tree database that we can use.

Second, there are a couple of fairly serious mis-citation errors in the paper. The authors refer to “Flamingo” and “Mismatch” (from 2008) as comparisons but the articles cited: “[15] C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257–266, 2008″ and “C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB, 1(1):933–944, 2008, respectively, are innocent of any such implementations.

C. Li is the faculty adviser for the Flamingo project, which has a new release since I mentioned it at: The FLAMINGO Project on Data Cleaning, but you don’t cite a project by picking a paper at random that doesn’t mention the project. (If you haven’t looked at the FLAMINGO project, its software and papers you really should.)

C. Xiao and company have a “mismatch” filter but it isn’t ever used as the name of an implementation.

Tracing the history of advances in computer science is easier or at least less time consuming if researchers don’t have to chase rabbits in the form of bad citations. Not to mention that if you aren’t cited properly, you may not get full credit for all the work you have actually done. Good citation practices are in everyone’s interest.

R package DOSE released

Saturday, December 17th, 2011

R package DOSE released

From the post:

Disease Ontology (DO) provides an open source ontology for the integration of biomedical data that is associated with human disease. DO analysis can lead to interesting discoveries that deserve further clinical investigation.

DOSE was designed for semantic similarity measure and enrichment analysis.

Four information content (IC)-based methods, proposed by Resnik, Jiang, Lin and Schlicker, and one graph structure-based method, proposed by Wang, were implemented. The calculation details can be referred to the vignette of R package GOSemSim.

Hypergeometric test was implemented for enrichment analysis.

I like that “…semantic similarity measure and enrichment analysis.”

What semantic similarity measures are you using?

From datasets to algorithms in R

Tuesday, December 13th, 2011

From datasets to algorithms in R by John Johnson.

From the post:

Many statistical algorithms are taught and implemented in terms of linear algebra. Statistical packages often borrow heavily from optimized linear algebra libraries such as LINPACK, LAPACK, or BLAS. When implementing these algorithms in systems such as Octave or MATLAB, it is up to you to translate the data from the use case terms (factors, categories, numerical variables) into matrices.

In R, much of the heavy lifting is done for you through the formula interface. Formulas resemble y ~ x1 + x2 + …, and are defined in relation to a data.frame….

Interesting to consider if R would be useful language for exploring similarity measures? After all, in Analysis of Amphibian Biodiversity Data I pointed out work that reviewed forty-six (46) similarity measures. I suspect that is a small percentage of all similarity measures. I remember a report that said (in an astronomy context) that more than 100 algorithms/data models for data integration appear every month.

Obviously a rough guess/estimate but one that should give us pause in terms of being too wedded to one measure of similarity or another.

Suggestions of existing collections of similarity measures? Either in literature or code?

Thinking it would be instructive to throw some of the open government data sources against similarity measures.

Similarity as Association?

Tuesday, November 29th, 2011

I was listening to Ian Robinson’s recent presentation on Dr. Who and Neo4j when Ian remarked that similarity could be modeled as a relationship.

It seemed like an off-hand remark at the time but it struck me as having immediate relevance to using Neo4j with topic maps.

One of my concerns for using Neo4j with topic maps has been the TMDM specification of merging topic items as:

1. Create a new topic item C.
2. Replace A by C wherever it appears in one of the following properties of an information item: [topics], [scope], [type], [player], and [reifier].
3. Repeat for B.
4. Set C’s [topic names] property to the union of the values of A and B’s [topic names] properties.
5. Set C’s [occurrences] property to the union of the values of A and B’s [occurrences] properties.
6. Set C’s [subject identifiers] property to the union of the values of A and B’s [subject identifiers] properties.
7. Set C’s [subject locators] property to the union of the values of A and B’s [subject locators] properties.
8. Set C’s [item identifiers] property to the union of the values of A and B’s [item identifiers] properties.
(TMDM, 6.2 Merging Topic Items)

Obviously the TMDM is specifying an end result and not how you get there but still, there has to be a mechanism by which a query that finds A or B, also results in the “union of the values of A and B’s [topic names] properties.” (And the other operations specified by the TMDM here and elsewhere.)

Ian’s reference to similarity being modeled as a relationship made me realize that similarity relationships could be created between nodes that share the same [subject identifiers} property value (and other conditions for merging). Thus, when querying a topic map, there should be the main query, followed by a query for “sameness” relationships for any returned objects.

This avoids the performance “hit” of having to update pointers to information items that are literally re-written with new identifiers. Not to mention that processing the information items that will be presented to the user as one object could even be off-loaded onto the client, with a further savings in server side processing.

There is an added bonus to this approach, particularly for complex merging conditions beyond the TMDM. Since the directed edges have properties, it would be possible to dynamically specify merging conditions beyond those of the TMDM based on those properties. Which means that “merging” operations could be “unrolled” as it were.

Or would that be “rolled down?” Thinking that a user could step through each addition of a “merging” condition and observe the values as they were added, along with their source. Perhaps even set “break points” as in debugging software.

Will have to think about this some more and work up some examples in Neo4j. Comments/suggestions most welcome!

PS: You know, if this works, Neo4j already has a query language, Cypher. I don’t know if Cypher supports declaration of routines that can be invoked as parts of queries but investigate that possibility. Just to keep users from having to write really complex queries to gather up all the information items on a subject. That won’t help people using other key/value stores but there are some interesting possibilities there as well. Will depend on the use cases and nature of “merging” requirements.

eTBLAST: a text-similarity based search engine

Friday, November 18th, 2011

eTBLAST: a text-similarity based search engine

eTBLAST from the Virginia Bioinformatics Institute at Virginia Tech lies at the heart of the Deja vu search service.

Unlike Deja vu, the eTBLAST interface is not limited to citations in PubMed, but includes:

  • MEDLINE
  • CRISP
  • NASA
  • Medical Cases
  • PMC Full Text
  • PMC METHODS
  • PMC INTRODUCTION
  • PMC RESULTS
  • PMC (paragraphs)
  • PMC Medical Cases
  • Clinical Trials
  • Arxiv
  • Wikipedia
  • VT Courses

There is a set of APIs, limited to some of the medical material.

While I was glad to see Arxiv included, given my research interests, CiteSeerX, or the ACM, IEEE, ALA or other information/CS would be of greater interest.

Among other things, I would like to have the ability to create a map of synonyms (can you say “topic map?”) that could be substituted during the comparison of the literature.

But, in the meantime, definitely will try the interface on Arxiv to see how it impacts my research on techniques relevant to topic maps.

Indexing Sound: Musical Riffs to Gunshots

Thursday, November 10th, 2011

Sound, Digested: New Software Tool Provides Unprecedented Searches of Sound, from Musical Riffs to Gunshots

From the post:

Audio engineers have developed a novel artificial intelligence system for understanding and indexing sound, a unique tool for both finding and matching previously un-labeled audio files.

Having concluded beta testing with one of the world’s largest Hollywood sound studios and leading media streaming and hosting services, Imagine Research of San Francisco, Calif., is now releasing MediaMined™ (http://www.imagine-research.com/node/51) for applications ranging from music composition to healthcare.

….

One of the key innovations of the new technology is the ability to perform sound-similarity searches. Now, when a musician wants a track with a matching feel to mix into a song, or an audio engineer wants a slightly different sound effect to work into a film, the process can be as simple as uploading an example file and browsing the detected matches.

“There are many tools to analyze and index sound, but the novel, machine-learning approach of MediaMined™ was one reason we felt the technology could prove important,” says Errol Arkilic, the NSF program director who helped oversee the Imagine Research grants. “The software enables users to go beyond finding unique objects, allowing similarity searches–free of the burden of keywords–that generate previously hidden connections and potentially present entirely new applications.”

Or from the Imagine Research Applications page:

Organize Sound

Automatically index the acoustic content of video, audio, and live streams across a companies web services. Analyze web-crawled data, user-generated content, professional broadcast content, and live streaming events.

Benefits:

  • Millions of minutes of content are now searchable
  • Recommending related content increases audience and viewer consumption
  • Better content discovery, intelligent navigation within media files
  • Search audio content with ease and accuracy
  • Audio content-aware targeted ads – improves ad performance and revenue

Search Sound

Perform sound-similarity searches for sounds and music by using example sounds.
Search for production music that matches a given track.
Perform the rhythmic similarity searches

Benefits:

  • Recommending related content increases audience and viewer consumption
  • Music/Audio licensing portals provide a unique-selling point: find content based on an input seed track.
  • Improved monetization of existing content with similarity-search and recommendations

And if you have a topic map of music, producers, studios, albums, etc., this could supplement your topic map based on similarity measures every time a new recording is released, or music is uploaded to a website or posted to YouTube. So you know who to contact for whatever reason.

A topic map of videos could serve up matches and links with thumbnails for videos with similar sound content based on a submitted sample, a variation as it were on “more of same.”

A topic map of live news feeds could detect repetition of news stories and with timing information could map the copying of content from one network to the next. Or provide indexing of news accounts without the necessity of actually sitting through the broadcasts. That is an advantage not mentioned above.

Sound recognition isn’t my area so if this is old news or there are alternatives to suggest to topic mappers, please sing out! (Sorry!)

Computing Document Similarity using Lucene Term Vectors

Wednesday, October 26th, 2011

Computing Document Similarity using Lucene Term Vectors

From the post:

Someone asked me a question recently about implementing document similarity, and since he was using Lucene, I pointed him to the Lucene Term Vector API. I hadn’t used the API myself, but I knew in general how it worked, so when he came back and told me that he could not make it work, I wanted to try it out for myself, to give myself a basis for asking further questions.

I already had a Lucene index (built by SOLR) of about 3000 medical articles for whose content field I had enabled term vectors as part of something I was trying out for highlighting, so I decided to use that. If you want to follow along and have to build your index from scratch, you can either use a field definition in your SOLR schema.xml file similar to this:

Nice walk through on document vectors.

Plus a reminder that “document” similarity can only take you so far. Once you find a relevant document, you still have to search for the subject of interest. Not to mention that you view that subject absent its relationship to other subjects, etc.

High Performance Computing with Python – Part 03

Thursday, October 6th, 2011

High Performance Computing with Python – Part 03

From the post:

In this series we will analyze how to optimize the statistical Spearman Rank’s Correlation coefficient, which it is a particular measure used to compute the similarity between items in recommender systems and assesses how well the relationship between two variables can be described using a monotonic function. The source code for this metric can be found in the first post.

Great series and all the greater for covering measures of similarity between items. If something becomes similar enough, it could be considered to be the same thing.