Similarity « Another Word For It

November 25, 2012

Infinite Jukebox plays your favorite songs forever

Filed under: Interface Research/Design,Music,Navigation,Similarity — Patrick Durusau @ 11:51 am

Infinite Jukebox plays your favorite songs forever by Nathan Yau.

From the post:

You know those songs that you love so much that you cry because they’re over? Well, cry no more with the Inifinite Jukebox by Paul Lamere. Inspired by Infinite Gangnam Style, the Infinite Jukebox lets you upload a song, and it’ll figure out how to cut the beats and piece them back together for a version of that song that goes forever.

Requires advanced web audio so you need to fire up a late version of Chrome or Safari. (I am on Ubuntu so can tell you about IE. In a VM?)

I tried it with Metallica’s Unforgiven.

Very impressive, although that assessment will vary based on your taste in music.

Would make an interesting interface for exploring textual features.

To have calculation of features and automatic navigation based on some pseudo-randomness. So you encounter data or text you would not otherwise have seen.

Many would argue we navigate with intention and rational purpose, but to be honest, that’s comfort analysis. It’s an explanation we use to compliment ourselves. (see, Thinking, Fast and Slow) Research suggests decision making is complex and almost entirely non-rational.

Comments Off

November 16, 2012

Constructing Topological Spaces — A Primer

Filed under: Identity,Similarity,Topology — Patrick Durusau @ 10:58 am

Constructing Topological Spaces — A Primer by Jeremy Kun.

From the post:

Last time we investigated the (very unintuitive) concept of a topological space as a set of “points” endowed with a description of which subsets are open. Now in order to actually arrive at a discussion of interesting and useful topological spaces, we need to be able to take simple topological spaces and build them up into more complex ones. This will take the form of subspaces and quotients, and through these we will make rigorous the notion of “gluing” and “building” spaces.

More heavy sledding but pay special attention to the discussion of sets and equivalences.

Jeremy concludes with pointers to books for additional reading.

Any personal favorites you would like to add to the list?

Comments Off

Topological Spaces — A Primer

Filed under: Identity,Similarity,Topology — Patrick Durusau @ 10:46 am

Topological Spaces — A Primer by Jeremy Kun.

From the post:

In our last primer we looked at a number of interesting examples of metric spaces, that is, spaces in which we can compute distance in a reasonable way. Our goal for this post is to relax this assumption. That is, we want to study the geometric structure of space without the ability to define distance. That is not to say that some notion of distance necessarily exists under the surface somewhere, but rather that we include a whole new class of spaces for which no notion of distance makes sense. Indeed, even when there is a reasonable notion of a metric, we’ll still want to blur the lines as to what kinds of things we consider “the same.”

The reader might wonder how we can say anything about space if we can’t compute distances between things. Indeed, how could it even really be “space” as we know it? The short answer is: the reader shouldn’t think of a topological space as a space in the classical sense. While we will draw pictures and say some very geometric things about topological spaces, the words we use are only inspired by their classical analogues. In fact the general topological space will be a much wilder beast, with properties ranging from absolute complacency to rampant hooliganism. Even so, topological spaces can spring out of every mathematical cranny. They bring at least a loose structure to all sorts of problems, and so studying them is of vast importance.

Just before we continue, we should give a short list of how topological spaces are applied to the real world. In particular, this author is preparing a series of posts dedicated to the topological study of data. That is, we want to study the loose structure of data potentially embedded in a very high-dimensional metric space. But in studying it from a topological perspective, we aim to eliminate the dependence on specific metrics and parameters (which can be awfully constricting, and even impertinent to the overall structure of the data). In addition, topology has been used to study graphics, image analysis and 3D modelling, networks, semantics, protein folding, solving systems of polynomial equations, and loads of topics in physics.

Topology offers an alternative to the fiction of metric distances between the semantics of words. It is a useful fiction, but a fiction none the less.

Deep sledding but well worth the time.

Comments Off

November 14, 2012

Efficient similarity search on multimedia databases [No Petraeus Images, Yet]

Filed under: GPU,Multimedia,Searching,Similarity — Patrick Durusau @ 11:42 am

Efficient similarity search on multimedia databases by Mariela Lopresti, Natalia Miranda, Fabiana Piccoli, Nora Reyes.

Abstract:

Manipulating and retrieving multimedia data has received increasing attention with the advent of cloud storage facilities. The ability of querying by similarity over large data collections is mandatory to improve storage and user interfaces. But, all of them are expensive operations to solve only in CPU; thus, it is convenient to take into account High Performance Computing (HPC) techniques in their solutions. The Graphics Processing Unit (GPU) as an alternative HPC device has been increasingly used to speedup certain computing processes. This work introduces a pure GPU architecture to build the Permutation Index and to solve approximate similarity queries on multimedia databases. The empirical results of each implementation have achieved different level of speedup which are related with characteristics of GPU and the particular database used.

No images have been published, yet, in the widening scandal around David Petraeus.

When they do, searching multimedia databases such as Flickr, Facebook, YouTube and others will be a hot issue.

Once found, there is the problem of finding unique ones again and duplicates not again.

Comments Off

October 15, 2012

item similarity by bipartite graph dispersion

Filed under: Graphs,Similarity — Patrick Durusau @ 10:39 am

From the post:

the basis of most recommendation systems is the ability to rate similarity between items. there are lots of different ways to do this.

one model is based the idea of an interest graph where the nodes of the graph are users and items and the edges of the graph represent an interest, whatever that might mean for the domain.

if we only allow edges between users and items the graph is bipartite.

As Mat says, there are lots of ways to judge similarity.

How “similar” two subject representatives need be to represent the same subject is also wide open.

For reliable interchange, however, it is best that you declare (disclose) your similarity measure and what happens when it is met.

The Topic Maps Data Model declares similarity based on matching IRIs within sets of IRIs and prescribes what follows when matches are found. If you need an example of disclosure.

Comments Off

October 12, 2012

Estimating Subject Sameness?

Filed under: Graphs,Networks,Similarity — Patrick Durusau @ 5:26 am

If you think about it, graph isomorphism is a type of subject sameness problem.

Sadly, graph isomorphism remains a research problem so not immediately applicable to problems you encounter with topic maps.

However, Alex Smola in The Weisfeiler-Lehman algorithm and estimation on graphs covers some “cheats” that you may find useful.

Imagine you have two graphs \(G\) and \(G’\) and you’d like to check how similar they are. If all vertices have unique attributes this is quite easy:

FOR ALL vertices \(v \in G \cup G’\) DO

Check that \(v \in G\) and that \(v \in G’\)

Check that the neighbors of v are the same in \(G\) and \(G’\)

This algorithm can be carried out in linear time in the size of the graph, alas many graphs do not have vertex attributes, let alone unique vertex attributes. In fact, graph isomorphism, i.e. the task of checking whether two graphs are identical, is a hard problem (it is still an open research question how hard it really is). In this case the above algorithm cannot be used since we have no idea which vertices we should match up.

The Weisfeiler-Lehman algorithm is a mechanism for assigning fairly unique attributes efficiently. Note that it isn’t guaranteed to work, as discussed in this paper by Douglas – this would solve the graph isomorphism problem after all. The idea is to assign fingerprints to vertices and their neighborhoods repeatedly. We assume that vertices have an attribute to begin with. If they don’t then simply assign all of them the attribute 1. Each iteration proceeds as follows:

Curious if you find the negative approach, “these two graphs are not isomorphic,” as useful as a positive one (where it works), “these two graphs are isomorphic?”

Or is it sufficient to reliably know that graphs are different?

Comments Off

September 11, 2012

Item based similarity with GraphChi

Filed under: GraphChi,GraphLab,Similarity — Patrick Durusau @ 10:15 am

Item based similarity with GraphChi by Danny Bickson.

From the post:

Item based collaborative filtering is one of the most popular collaborative filtering methods used in more than 70% of the companies I am talking to. Following my mega collaborator Justin Yan‘s advice, I have started to implement some item based similarity methods in GraphChi.

Item based methods compare all pairs of items together, for computing similarity metric between each pair of items. This task becomes very quickly computation intensive. For example, Netflix data has around 17K movies. If we want to compute all pairs of movies to find the most similar ones, we need to compare around 290M pairs!

If we use a symmetric similarity measure, the distance between movie A and B, is similar to the distance between movie B and A. Thus for the Netflix example we have around 145M pairs to compute. To reduce the work furthermore, we only compare movies which where watched together by at least X users, for example X=5. Otherwise, those movies are not considered similar.

When the dataset is big, it is not possible to load it fully into memory at a single machine. That is where GraphChi comes in. My preliminary implementation of the item similarity computes similarity between all pairs without fully reading the dataset into memory. The idea is to load a chunk of the items (called pivots) into memory, and then stream over the rest of the items by comparing the pivots to the rest of the items.
…

I need to check on Danny’s blog and the GraphChi/GraphLab webpages more often!

Comments Off

Measuring Similarity in Large-scale Folksonomies [Users vs. Authorities]

Filed under: Cosine Similarity,Folksonomy,Jaccard Similarity,Power Law Distributions,Similarity — Patrick Durusau @ 4:50 am

Abstract:

Social (or folksonomic) tagging has become a very popular way to describe content within Web 2.0 websites. Unlike taxonomies, which overimpose a hierarchical categorisation of content, folksonomies enable end-users to freely create and choose the categories (in this case, tags) that best describe some content. However, as tags are informally defined, continually changing, and ungoverned, social tagging has often been criticised for lowering, rather than increasing, the efficiency of searching, due to the number of synonyms, homonyms, polysemy, as well as the heterogeneity of users and the noise they introduce. To address this issue, a variety of approaches have been proposed that recommend users what tags to use, both when labelling and when looking for resources.

As we illustrate in this paper, real world folksonomies are characterized by power law distributions of tags, over which commonly used similarity metrics, including the Jaccard coefficient and the cosine similarity, fail to compute. We thus propose a novel metric, specifically developed to capture similarity in large-scale folksonomies, that is based on a mutual reinforcement principle: that is, two tags are deemed similar if they have been associated to similar resources, and vice-versa two resources are deemed similar if they have been labelled by similar tags. We offer an efficient realisation of this similarity metric, and assess its quality experimentally, by comparing it against cosine similarity, on three large-scale datasets, namely Bibsonomy, MovieLens and CiteULike.

Studying language (tags) as used tells you about users.

Studying language as proscribed by an authority, tells you about that authority.

Which one is of interest to you?

Comments Off

September 10, 2012

Graphlet-based edge clustering reveals pathogen-interacting proteins

Filed under: Clustering,Graphs,Networks,Similarity,Topology — Patrick Durusau @ 1:06 pm

Graphlet-based edge clustering reveals pathogen-interacting proteins by R. W. Solava, R. P. Michaels and T. Milenković. (Bioinformatics (2012) 28 (18): i480-i486. doi: 10.1093/bioinformatics/bts376)

Abstract:

Motivation: Prediction of protein function from protein interaction networks has received attention in the post-genomic era. A popular strategy has been to cluster the network into functionally coherent groups of proteins and assign the entire cluster with a function based on functions of its annotated members. Traditionally, network research has focused on clustering of nodes. However, clustering of edges may be preferred: nodes belong to multiple functional groups, but clustering of nodes typically cannot capture the group overlap, while clustering of edges can. Clustering of adjacent edges that share many neighbors was proposed recently, outperforming different node clustering methods. However, since some biological processes can have characteristic ‘signatures’ throughout the network, not just locally, it may be of interest to consider edges that are not necessarily adjacent.

Results: We design a sensitive measure of the ‘topological similarity’ of edges that can deal with edges that are not necessarily adjacent. We cluster edges that are similar according to our measure in different baker’s yeast protein interaction networks, outperforming existing node and edge clustering approaches. We apply our approach to the human network to predict new pathogen-interacting proteins. This is important, since these proteins represent drug target candidates.

Availability: Software executables are freely available upon request.

Contact: tmilenko@nd.edu

Of interest for bioinformatics but more broadly for its insights into topological similarity and edge clustering by topological similarity.

Being mindful that an “edge” is for all intents and purposes a “node” that records a connection between two (non-hyperedge and non-looping edge) other nodes. Nodes could, but don’t generally record their connection to other nodes, that connection being represented by an edge.

Comments Off

August 25, 2012

Introduction to Recommendations with Map-Reduce and mrjob [Ode to Similarity, Music]

Filed under: MapReduce,Music,Music Retrieval,Similarity — Patrick Durusau @ 10:56 am

Introduction to Recommendations with Map-Reduce and mrjob by Marcel Caraciolo

From the post:

In this post I will present how can we use map-reduce programming model for making recommendations. Recommender systems are quite popular among shopping sites and social network thee days. How do they do it ? Generally, the user interaction data available from items and products in shopping sites and social networks are enough information to build a recommendation engine using classic techniques such as Collaborative Filtering.

Usual recommendation post except for the emphasis on multiple tests of similarity.

Useful because simply reporting that two (or more) items are “similar” isn’t all that helpful. At least unless or until you know the basis for the comparison.

And have the expectation that a similar notion of “similarity” works for your audience.

For example, I read an article this morning about a “new” invention that will change the face of sheet music publishing, in three to five years. Invention Will Strike a Chord With Musicians

Despite the lack of terms like “markup,” “HyTime,” “SGML,” “XML,” “Music Encoding Initiative (MEI),” or “MusicXML,” all of those seemed quite “similar” to me. That may not be the “typical” experience but it is mine.

If you don’t want to wait three to five years for the sheet music revolution, you can check out MusicXML. It has been reported that more than 150 applications support MusicXML. Oh, that would be today, not three to five years from now.

You might want to pass the word along in the music industry before the next “revolution” in sheet music starts up.

Comments Off

August 19, 2012

Bi-directional semantic similarity….

Filed under: Bioinformatics,Biomedical,Semantics,Similarity — Patrick Durusau @ 6:32 pm

Bi-directional semantic similarity for gene ontology to optimize biological and clinical analyses by Sang Jay Bien, Chan Hee Park, Hae Jin Shim, Woongcheol Yang, Jihun Kim and Ju Han Kim.

Abstract:

Background Semantic similarity analysis facilitates automated semantic explanations of biological and clinical data annotated by biomedical ontologies. Gene ontology (GO) has become one of the most important biomedical ontologies with a set of controlled vocabularies, providing rich semantic annotations for genes and molecular phenotypes for diseases. Current methods for measuring GO semantic similarities are limited to considering only the ancestor terms while neglecting the descendants. One can find many GO term pairs whose ancestors are identical but whose descendants are very different and vice versa. Moreover, the lower parts of GO trees are full of terms with more specific semantics.

Methods This study proposed a method of measuring semantic similarities between GO terms using the entire GO tree structure, including both the upper (ancestral) and the lower (descendant) parts. Comprehensive comparison studies were performed with well-known information content-based and graph structure-based semantic similarity measures with protein sequence similarities, gene expression-profile correlations, protein–protein interactions, and biological pathway analyses.

Conclusion The proposed bidirectional measure of semantic similarity outperformed other graph-based and information content-based methods.

Makes me curious what the experience with direction and identification has been with other ontologies?

Comments Off

August 12, 2012

Measuring similarity and distance function

Filed under: Distance,Similarity — Patrick Durusau @ 4:42 pm

Ricky covers:

Distance between numeric data points

Distance between categorical data points

Distance between mixed categorical and numeric data points

Distance between sequence (String, TimeSeries)

Distance between nodes in a network

Distance between population distribution

Not every measure or function but enough to get you started.

Comments Off

June 25, 2012

Scholarly network similarities…

Filed under: Networks,Similarity — Patrick Durusau @ 4:40 pm

Scholarly network similarities: How bibliographic coupling networks, citation networks, cocitation networks, topical networks, coauthorship networks, and coword networks relate to each other by Erjia Yan and Ying Ding. (Journal of the American Society for Information Science and Technology Volume 63, Issue 7, pages 1313–1326, July 2012)

Abstract:

This study explores the similarity among six types of scholarly networks aggregated at the institution level, including bibliographic coupling networks, citation networks, cocitation networks, topical networks, coauthorship networks, and coword networks. Cosine distance is chosen to measure the similarities among the six networks. The authors found that topical networks and coauthorship networks have the lowest similarity; cocitation networks and citation networks have high similarity; bibliographic coupling networks and cocitation networks have high similarity; and coword networks and topical networks have high similarity. In addition, through multidimensional scaling, two dimensions can be identified among the six networks: Dimension 1 can be interpreted as citation-based versus noncitation-based, and Dimension 2 can be interpreted as social versus cognitive. The authors recommend the use of hybrid or heterogeneous networks to study research interaction and scholarly communications.

Interesting that I should come across this article after posting about data sets. See http://info.slis.indiana.edu/~eyan/papers/citation/ the post-processing data and interactive visualizations that are reproduced as static views in the article.

At page 1323 the authors say:

In addition to social networks versus information networks, another distinction of real connection-based networks versus artificial connection-based networks can be made. Coauthorship networks and citation networks are constructed based on real connections, whereas cocitation, bibliographic coupling, topical, and coword networks are constructed based on artificial connections,⁵ usually in the form of similarity measurements.

I am not sure about the “real” versus “artificial” connection that comes so easily to hand. In part because authors, in my experience, tend to use terminology similar to other scholars with who they agree. So the connection between the work of scholars isn’t “artificial,” although it isn’t accounted for in this study.

There is much to admire and even imitate in this article but the interaction between scholars is more complex than its representation by the networks here.

Comments Off

April 14, 2012

Neo4j + last.fm in Java

Filed under: Graphs,Neo4j,Similarity — Patrick Durusau @ 6:27 pm

Neo4j + last.fm in Java by Niklas Lindblad.

Video that walks through the use of Neo4j and the music service last.fm. (Full code: https://niklaslindblad.se/kod/Database.java.html)

Watching the import, I was reminded of Iron Maiden, a group I had not thought of for years.

Directional “similarity” relationships/edges. Can also store “measures” of similarity along the edges.

Question: For subject identity purposes, can we do more than boolean yes/no?

That is for some purposes, if a subject representative is => .90 similar, can we merge it with another node?

Or for other purposes we might use a different measure of similarity?

Is it possible for boolean and measure of similarity topic maps to exist in the same topic map?

Comments Off

February 25, 2012

On nonmetric similarity search problems in complex domains

Filed under: Nonmetric Similarity,Searching,Similarity,Similarity Retrieval — Patrick Durusau @ 7:39 pm

On nonmetric similarity search problems in complex domains by Tomáš Skopal and Benjamin Bustos.

Abstract:

The task of similarity search is widely used in various areas of computing, including multimedia databases, data mining, bioinformatics, social networks, etc. In fact, retrieval of semantically unstructured data entities requires a form of aggregated qualification that selects entities relevant to a query. A popular type of such a mechanism is similarity querying. For a long time, the database-oriented applications of similarity search employed the definition of similarity restricted to metric distances. Due to its topological properties, metric similarity can be effectively used to index a database which can then be queried efficiently by so-called metric access methods. However, together with the increasing complexity of data entities across various domains, in recent years there appeared many similarities that were not metrics—we call them nonmetric similarity functions. In this article we survey domains employing nonmetric functions for effective similarity search, and methods for efficient nonmetric similarity search. First, we show that the ongoing research in many of these domains requires complex representations of data entities. Simultaneously, such complex representations allow us to model also complex and computationally expensive similarity functions (often represented by various matching algorithms). However, the more complex similarity function one develops, the more likely it will be a nonmetric. Second, we review state-of-the-art techniques for efficient (fast) nonmetric similarity search, concerning both exact and approximate search. Finally, we discuss some open problems and possible future research trends.

The first paragraph of the conclusion of this survey on nonmetric similarity is an argument for topic maps (or at least the result of using a topic map):

In this article, we have surveyed the current situation concerning the employment of nonmetric similarity functions for effective and efficient similarity search in complex domains. One of the main results of the article is a surprising revelation that nonmetric similarity measuring is widely used in isolated domains, spanning many areas of interdisciplinary research. This includes multimedia databases, time series, and medical, scientific, chemical, and bioinformatic tasks, among others. (emphasis added)

True enough, survey articles such as this one may tempt a few researchers and possibly graduate students to peek over the discipline walls, however briefly. But research articles need to routinely cite the literature of other disciplines, betraying a current awareness of other fields. To take advantage of advances in other fields as well as to serve as an example for the next generation of researchers.

Comments Off

January 14, 2012

Sentence based semantic similarity measure for blog-posts

Filed under: Nutch,Semantics,Similarity — Patrick Durusau @ 7:40 pm

Sentence based semantic similarity measure for blog-posts by Mehwish Aziz and Muhammad Rafi.

Abstract:

Blogs-Online digital diary like application on web 2.0 has opened new and easy way to voice opinion, thoughts, and like-dislike of every Internet user to the World. Blogosphere has no doubt the largest user-generated content repository full of knowledge. The potential of this knowledge is still to be explored. Knowledge discovery from this new genre is quite difficult and challenging as it is totally different from other popular genre of web-applications like World Wide Web (WWW). Blog-posts unlike web documents are small in size, thus lack in context and contain relaxed grammatical structures. Hence, standard text similarity measure fails to provide good results. In this paper, specialized requirements for comparing a pair of blog-posts is thoroughly investigated. Based on this we proposed a novel algorithm for sentence oriented semantic similarity measure of a pair of blog-posts. We applied this algorithm on a subset of political blogosphere of Pakistan, to cluster the blogs on different issues of political realm and to identify the influential bloggers.

I am not sure I agree that “relaxed grammatical structures” are peculiar to blog posts. 😉

A “measure” of similarity that I have not seen (would appreciate a citation if you have) is the listing of one blog by another by another in its “blogroll.” On the theory that blogs may cite blogs they disagree with both semantically and otherwise in post but are unlikely to list blogs in their “blogroll” that they find disagreeable.

Comments (1)

December 24, 2011

Efficient Similarity Query Processing (Previously Efficient Exact Similarity Join)

Filed under: Query Language,Similarity — Patrick Durusau @ 4:42 pm

From the webpage:

Given a similarity function and two sets of objects, a similarity join returns all pairs of objects (from each set respectively) such that their similarity value satisifies a given criterion. A typical example is to find pairs of documents such that their cosine similarity is above a constant threshold (say, 0.95), as they are likely to be near duplicate documents.

In this project, we focus on efficient algorithms to perform the similarity join on (multi-) sets or strings both exactly and approximately. Commonly used similarity functions for (multi-) sets or strings are more complex than Euclidean distance functions. As a result, many previous approaches have to compute the approximate results instead.

We have developed several fast algorithms that address the above performance issue. They work for Jaccard, Dice, Cosine similarities and Edit distance constraints. We have also devised algorithms to compute top-k similar pairs progressively so that a user does not need to specify a similarity threshold for some unknown dataset. Recently, we have obtained several interesting new results on edit similarity search and joins by exploiting asymmetric signature schemes. The resulting IndexChunkTurbo algorithm can process most of the similarity queries efficiently while occupying almost the least amount of index space.

We also investigated the rpbolem of approximate entity matching. Our SIGMOD09 work can extract approximate mentions of entities in a document given an entity dictionary.

The first five (5) papers:

Xiang Zhao, Chuan Xiao, Xuemin Lin, Wei Wang. Efficient Graph Similarity Joins with Edit Distance Constraints. ICDE 2012.
Summary: An efficient algorithm to compute graphs within certain (graph) edit distance away from a query graph.

Sunanda Patro, Wei Wang. Learning Top-k Transformation Rules. DEXA 2011.
Summary: A new algorithm to learn transformation rules (e.g., VLDB = Very Large Data Base) from unannotated, potentially noisy data. Compared with previous approaches, our method can find much more valid rules in the top-k output.

Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu, Guoren Wang. Efficient Similarity Joins for Near Duplicate Detection. ACM Transaction of Database Systems (TODS).
Summary: This is the journal version of our WWW 2008 paper, with extension to implementing the PPJoin family of algorithms on top of relational database systems.

Jianbin Qin, Wei Wang, Yifei Lu, Chuan Xiao, Xuemin Lin. Efficient Exact Edit Similarity Query Processing with Asymmetric Signature Schemes. SIGMOD 2011.
Summary: Two simple yet highly efficient algorithms are proposed in this paper that works very well for both edit similarity search and joins. Perhaps equally intereting is a comprehensive experiment involding Flamingo (ver 3), PartEnum, Ed-Join, B^ed-tree, Trie-Join, NGPP, VGRAM, NS09.

Chaokun Wang, Jianmin Wang, Xuemin Lin, Wei Wang, Haixun Wang, Hongsong Li, Wanpeng Tian, Jun Xu, Rui Li. MapDupReducer: Detecting Near Duplicates over Massive Datasets. SIGMOD 2010. PPT
Summary: This work essentially ports the ppjoin+ algorithm to the Map-Reduce framework in order to deal with huge volume of data.

The first paper is saying exactly what your suspect: similarity of arbitrary sub-graphs. Nothing like picking a hard problem is there? Not fully general solution but see what you think of theirs.

Did I mention there is working code at this site as well? Definitely a group to watch in the future.

Comments Off

December 23, 2011

B^ed-tree: an all-purpose index structure for string similarity search based on edit distance

Filed under: Edit Distance,Indexing,Similarity,String Matching — Patrick Durusau @ 4:28 pm

B^ed-tree: an all-purpose index structure for string similarity search based on edit distance by Zhenjie Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, and Divesh Srivastava.

Abstract:

Strings are ubiquitous in computer systems and hence string processing has attracted extensive research effort from computer scientists in diverse areas. One of the most important problems in string processing is to efficiently evaluate the similarity between two strings based on a specified similarity measure. String similarity search is a fundamental problem in information retrieval, database cleaning, biological sequence analysis, and more. While a large number of dissimilarity measures on strings have been proposed, edit distance is the most popular choice in a wide spectrum of applications. Existing indexing techniques for similarity search queries based on edit distance, e.g., approximate selection and join queries, rely mostly on n-gram signatures coupled with inverted list structures. These techniques are tailored for specific query types only, and their performance remains unsatisfactory especially in scenarios with strict memory constraints or frequent data updates. In this paper we propose the B^ed-tree, a B⁺-tree based index structure for evaluating all types of similarity queries on edit distance and normalized edit distance. We identify the necessary properties of a mapping from the string space to the integer space for supporting searching and pruning for these queries. Three transformations are proposed that capture different aspects of information inherent in strings, enabling efficient pruning during the search process on the tree. Compared to state-of-the-art methods on string similarity search, the Bed-tree is a complete solution that meets the requirements of all applications, providing high scalability and fast response time.

The authors classify similarity queries as:

Type	Example
range	address in customer database
top-k	results of search engine
all-pairs joins
pairs of proteins or genes

There are a couple of things that trouble me about the paper.

First:

6.3 Top-K construction

In many cases, top-k queries are more practical than range queries. However, existing indexing schemes with inverted lists do not naturally support such queries. To illustrate
the performance benefits of our proposals, we implemented a simple strategy with Flamingo, by increasing the range query threshold gradually until more than k string results
are found. Notice that we use the same Bed-tree structures to support all different types of queries. Thus, we skip the performance comparisons on index construction but focus on query processing efficiency. (emphasis added)

I am not sure what is meant by inverted lists “…do not naturally support …[top-k queries].” Inverted list structures are wildly popular among WWW search engines so I would like to know more about this notion of “…naturally support….”

Moreover, indexes aren’t simply used, they are created as well. Puzzling that we are left to wonder how long it will take to have a B^ed-tree database that we can use.

Second, there are a couple of fairly serious mis-citation errors in the paper. The authors refer to “Flamingo” and “Mismatch” (from 2008) as comparisons but the articles cited: “[15] C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257–266, 2008” and “C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB, 1(1):933–944, 2008, respectively, are innocent of any such implementations.

C. Li is the faculty adviser for the Flamingo project, which has a new release since I mentioned it at: The FLAMINGO Project on Data Cleaning, but you don’t cite a project by picking a paper at random that doesn’t mention the project. (If you haven’t looked at the FLAMINGO project, its software and papers you really should.)

C. Xiao and company have a “mismatch” filter but it isn’t ever used as the name of an implementation.

Tracing the history of advances in computer science is easier or at least less time consuming if researchers don’t have to chase rabbits in the form of bad citations. Not to mention that if you aren’t cited properly, you may not get full credit for all the work you have actually done. Good citation practices are in everyone’s interest.

Comments (2)

December 17, 2011

R package DOSE released

Filed under: Enrichment,R,Similarity — Patrick Durusau @ 7:53 pm

R package DOSE released

From the post:

Disease Ontology (DO) provides an open source ontology for the integration of biomedical data that is associated with human disease. DO analysis can lead to interesting discoveries that deserve further clinical investigation.

DOSE was designed for semantic similarity measure and enrichment analysis.

Four information content (IC)-based methods, proposed by Resnik, Jiang, Lin and Schlicker, and one graph structure-based method, proposed by Wang, were implemented. The calculation details can be referred to the vignette of R package GOSemSim.

Hypergeometric test was implemented for enrichment analysis.

I like that “…semantic similarity measure and enrichment analysis.”

What semantic similarity measures are you using?

Comments Off

December 13, 2011

From datasets to algorithms in R

Filed under: R,Similarity — Patrick Durusau @ 10:43 am

From datasets to algorithms in R by John Johnson.

From the post:

Many statistical algorithms are taught and implemented in terms of linear algebra. Statistical packages often borrow heavily from optimized linear algebra libraries such as LINPACK, LAPACK, or BLAS. When implementing these algorithms in systems such as Octave or MATLAB, it is up to you to translate the data from the use case terms (factors, categories, numerical variables) into matrices.

In R, much of the heavy lifting is done for you through the formula interface. Formulas resemble y ~ x1 + x2 + …, and are defined in relation to a data.frame….

Interesting to consider if R would be useful language for exploring similarity measures? After all, in Analysis of Amphibian Biodiversity Data I pointed out work that reviewed forty-six (46) similarity measures. I suspect that is a small percentage of all similarity measures. I remember a report that said (in an astronomy context) that more than 100 algorithms/data models for data integration appear every month.

Obviously a rough guess/estimate but one that should give us pause in terms of being too wedded to one measure of similarity or another.

Suggestions of existing collections of similarity measures? Either in literature or code?

Thinking it would be instructive to throw some of the open government data sources against similarity measures.

Comments Off

November 29, 2011

Similarity as Association?

Filed under: Associations,Neo4j,Similarity — Patrick Durusau @ 8:43 pm

I was listening to Ian Robinson’s recent presentation on Dr. Who and Neo4j when Ian remarked that similarity could be modeled as a relationship.

It seemed like an off-hand remark at the time but it struck me as having immediate relevance to using Neo4j with topic maps.

One of my concerns for using Neo4j with topic maps has been the TMDM specification of merging topic items as:

1. Create a new topic item C.
2. Replace A by C wherever it appears in one of the following properties of an information item: [topics], [scope], [type], [player], and [reifier].
3. Repeat for B.
4. Set C’s [topic names] property to the union of the values of A and B’s [topic names] properties.
5. Set C’s [occurrences] property to the union of the values of A and B’s [occurrences] properties.
6. Set C’s [subject identifiers] property to the union of the values of A and B’s [subject identifiers] properties.
7. Set C’s [subject locators] property to the union of the values of A and B’s [subject locators] properties.
8. Set C’s [item identifiers] property to the union of the values of A and B’s [item identifiers] properties.
(TMDM, 6.2 Merging Topic Items)

Obviously the TMDM is specifying an end result and not how you get there but still, there has to be a mechanism by which a query that finds A or B, also results in the “union of the values of A and B’s [topic names] properties.” (And the other operations specified by the TMDM here and elsewhere.)

Ian’s reference to similarity being modeled as a relationship made me realize that similarity relationships could be created between nodes that share the same [subject identifiers} property value (and other conditions for merging). Thus, when querying a topic map, there should be the main query, followed by a query for “sameness” relationships for any returned objects.

This avoids the performance “hit” of having to update pointers to information items that are literally re-written with new identifiers. Not to mention that processing the information items that will be presented to the user as one object could even be off-loaded onto the client, with a further savings in server side processing.

There is an added bonus to this approach, particularly for complex merging conditions beyond the TMDM. Since the directed edges have properties, it would be possible to dynamically specify merging conditions beyond those of the TMDM based on those properties. Which means that “merging” operations could be “unrolled” as it were.

Or would that be “rolled down?” Thinking that a user could step through each addition of a “merging” condition and observe the values as they were added, along with their source. Perhaps even set “break points” as in debugging software.

Will have to think about this some more and work up some examples in Neo4j. Comments/suggestions most welcome!

PS: You know, if this works, Neo4j already has a query language, Cypher. I don’t know if Cypher supports declaration of routines that can be invoked as parts of queries but investigate that possibility. Just to keep users from having to write really complex queries to gather up all the information items on a subject. That won’t help people using other key/value stores but there are some interesting possibilities there as well. Will depend on the use cases and nature of “merging” requirements.

Comments Off

November 18, 2011

eTBLAST: a text-similarity based search engine

Filed under: eTBLAST,Search Engines,Similarity — Patrick Durusau @ 9:37 pm

eTBLAST: a text-similarity based search engine

eTBLAST from the Virginia Bioinformatics Institute at Virginia Tech lies at the heart of the Deja vu search service.

Unlike Deja vu, the eTBLAST interface is not limited to citations in PubMed, but includes:

MEDLINE
CRISP
NASA
Medical Cases
PMC Full Text
PMC METHODS
PMC INTRODUCTION
PMC RESULTS
PMC (paragraphs)
PMC Medical Cases
Clinical Trials
Arxiv
Wikipedia
VT Courses

There is a set of APIs, limited to some of the medical material.

While I was glad to see Arxiv included, given my research interests, CiteSeerX, or the ACM, IEEE, ALA or other information/CS would be of greater interest.

Among other things, I would like to have the ability to create a map of synonyms (can you say “topic map?”) that could be substituted during the comparison of the literature.

But, in the meantime, definitely will try the interface on Arxiv to see how it impacts my research on techniques relevant to topic maps.

Comments Off

November 10, 2011

Indexing Sound: Musical Riffs to Gunshots

Filed under: Indexing,Similarity,Sound — Patrick Durusau @ 6:37 pm

Sound, Digested: New Software Tool Provides Unprecedented Searches of Sound, from Musical Riffs to Gunshots

From the post:

Audio engineers have developed a novel artificial intelligence system for understanding and indexing sound, a unique tool for both finding and matching previously un-labeled audio files.

Having concluded beta testing with one of the world’s largest Hollywood sound studios and leading media streaming and hosting services, Imagine Research of San Francisco, Calif., is now releasing MediaMined™ (http://www.imagine-research.com/node/51) for applications ranging from music composition to healthcare.

….

One of the key innovations of the new technology is the ability to perform sound-similarity searches. Now, when a musician wants a track with a matching feel to mix into a song, or an audio engineer wants a slightly different sound effect to work into a film, the process can be as simple as uploading an example file and browsing the detected matches.

“There are many tools to analyze and index sound, but the novel, machine-learning approach of MediaMined™ was one reason we felt the technology could prove important,” says Errol Arkilic, the NSF program director who helped oversee the Imagine Research grants. “The software enables users to go beyond finding unique objects, allowing similarity searches–free of the burden of keywords–that generate previously hidden connections and potentially present entirely new applications.”

Or from the Imagine Research Applications page:

Organize Sound

Automatically index the acoustic content of video, audio, and live streams across a companies web services. Analyze web-crawled data, user-generated content, professional broadcast content, and live streaming events.

Benefits:

Millions of minutes of content are now searchable

Recommending related content increases audience and viewer consumption

Better content discovery, intelligent navigation within media files

Search audio content with ease and accuracy

Audio content-aware targeted ads – improves ad performance and revenue

Search Sound

Perform sound-similarity searches for sounds and music by using example sounds.
Search for production music that matches a given track.
Perform the rhythmic similarity searches

Benefits:

Recommending related content increases audience and viewer consumption

Music/Audio licensing portals provide a unique-selling point: find content based on an input seed track.

Improved monetization of existing content with similarity-search and recommendations

And if you have a topic map of music, producers, studios, albums, etc., this could supplement your topic map based on similarity measures every time a new recording is released, or music is uploaded to a website or posted to YouTube. So you know who to contact for whatever reason.

A topic map of videos could serve up matches and links with thumbnails for videos with similar sound content based on a submitted sample, a variation as it were on “more of same.”

A topic map of live news feeds could detect repetition of news stories and with timing information could map the copying of content from one network to the next. Or provide indexing of news accounts without the necessity of actually sitting through the broadcasts. That is an advantage not mentioned above.

Sound recognition isn’t my area so if this is old news or there are alternatives to suggest to topic mappers, please sing out! (Sorry!)

Comments Off

October 26, 2011

Computing Document Similarity using Lucene Term Vectors

Filed under: Lucene,Similarity,Vectors — Patrick Durusau @ 6:58 pm

Computing Document Similarity using Lucene Term Vectors

From the post:

Someone asked me a question recently about implementing document similarity, and since he was using Lucene, I pointed him to the Lucene Term Vector API. I hadn’t used the API myself, but I knew in general how it worked, so when he came back and told me that he could not make it work, I wanted to try it out for myself, to give myself a basis for asking further questions.

I already had a Lucene index (built by SOLR) of about 3000 medical articles for whose content field I had enabled term vectors as part of something I was trying out for highlighting, so I decided to use that. If you want to follow along and have to build your index from scratch, you can either use a field definition in your SOLR schema.xml file similar to this:

Nice walk through on document vectors.

Plus a reminder that “document” similarity can only take you so far. Once you find a relevant document, you still have to search for the subject of interest. Not to mention that you view that subject absent its relationship to other subjects, etc.

Comments Off

October 6, 2011

High Performance Computing with Python – Part 03

Filed under: Python,Similarity — Patrick Durusau @ 5:29 pm

High Performance Computing with Python – Part 03

From the post:

In this series we will analyze how to optimize the statistical Spearman Rank’s Correlation coefficient, which it is a particular measure used to compute the similarity between items in recommender systems and assesses how well the relationship between two variables can be described using a monotonic function. The source code for this metric can be found in the first post.

Great series and all the greater for covering measures of similarity between items. If something becomes similar enough, it could be considered to be the same thing.

Comments Off

October 5, 2011

C-Rank: A Link-based Similarity Measure for Scientific Literature Databases

Filed under: C-Rank,Similarity — Patrick Durusau @ 6:55 pm

C-Rank: A Link-based Similarity Measure for Scientific Literature Databases by Seok-Ho Yoon, Sang-Wook Kim, and Sunju Park.

Abstract:

As the number of people who use scientific literature databases grows, the demand for literature retrieval services has been steadily increased. One of the most popular retrieval services is to find a set of papers similar to the paper under consideration, which requires a measure that computes similarities between papers. Scientific literature databases exhibit two interesting characteristics that are different from general databases. First, the papers cited by old papers are often not included in the database due to technical and economic reasons. Second, since a paper references the papers published before it, few papers cite recently-published papers. These two characteristics cause all existing similarity measures to fail in at least one of the following cases: (1) measuring the similarity between old, but similar papers, (2) measuring the similarity between recent, but similar papers, and (3) measuring the similarity between two similar papers: one old, the other recent. In this paper, we propose a new link-based similarity measure called C-Rank, which uses both in-link and out-link by disregarding the direction of references. In addition, we discuss the most suitable normalization method for scientific literature databases and propose an evaluation method for measuring the accuracy of similarity measures. We have used a database with real-world papers from DBLP and their reference information crawled from Libra for experiments and compared the performance of C-Rank with those of existing similarity measures. Experimental results show that C-Rank achieves a higher accuracy than existing similarity measures.

Reviews other link-based similarity measures compares them to the proposed C-Rank measure both in theory as well as actual experiments.

Interesting use of domain experts to create baseline similarity measures, against which the similarity measures were compared.

I am not quite convinced by the “difference” argument for scientific literature:

First, few papers exist which are referenced by old papers. This is because very old papers are often not included in the database due to technical and economic reasons. Second, since a paper can reference only the papers published before it (and never the papers published after it), there exist few papers which reference recently-published papers.

As far as old papers not being included in the database, the authors should try philosophy articles which cite a wid range of material that is very unlikely to be a literature database. (Google Books may be changing that for “recent” literature.)

On scientific papers not citing recent papers, I suspect that simply isn’t true for scientific papers. David P. Hamilton (Science, 251:25, 1991) in Research Papers: Who’s Uncited Now?, commenting on work by David Pendlebury of the Institute for Scientific Information that demonstrated “Atomic, molecular, and chemical physics, a field in which onlv 9.2% of articles go uncited…” within five years of publication. That sounds like recent papers being cited to me.

If you are interested in citation practices for monographs, see: Citation Characteristics and Intellectual Acceptance of Scholarly Monographs by Rong Tang (Coll. res. libr. July 2008 69:356-369).

If it isn’t already on your regular reading list, College & Research Libraries should be.

I mention all that to point out that exploring the characteristics of information collections may turn up surprising facts, facts that can influence the development of algorithms for search, similarity and ultimately for use by a user.

Comments Off

September 21, 2011

Using Machine Learning to Detect Malware Similarity

Filed under: Machine Learning,Malware,Similarity — Patrick Durusau @ 7:07 pm

Using Machine Learning to Detect Malware Similarity by Sagar Chaki.

From the post:

Malware, which is short for “malicious software,” consists of programming aimed at disrupting or denying operation, gathering private information without consent, gaining unauthorized access to system resources, and other inappropriate behavior. Malware infestation is of increasing concern to government and commercial organizations. For example, according to the Global Threat Report from Cisco Security Intelligence Operations, there were 287,298 “unique malware encounters” in June 2011, double the number of incidents that occurred in March. To help mitigate the threat of malware, researchers at the SEI are investigating the origin of executable software binaries that often take the form of malware. This posting augments a previous posting describing our research on using classification (a form of machine learning) to detect “provenance similarities” in binaries, which means that they have been compiled from similar source code (e.g., differing by only minor revisions) and with similar compilers (e.g., different versions of Microsoft Visual C++ or different levels of optimization).

Interesting study in the development of ways to identify a subject that is trying to hide. Not to mention some hard core disassembly and other techniques.

Comments Off

September 19, 2011

Recommender Systems

Filed under: Recommendation,Similarity — Patrick Durusau @ 7:55 pm

Recommender Systems

This website provides support for “Recommender Systems: An Introduction” and “Recommender Systems Handbook.”

Recommender systems are an important area of research for topic maps because recommendation of necessity involves recognition (or attempted recognition) of subjects similar to an example subject. That recommendation may be captured in relationship to a particular set of user characteristics or it can be used as the basis for identifying a subject.

The site offers pointers to very strong teaching materials (as of 19 September 2011):

Slides

Slides for Recommender Systems: An Introduction (August 2011)

Tutorials

Tutorial slides (presented at IJCAI 2011)

Courses

Francesco Ricci’s Information Search and Retrieval course

If you want to contribute teaching materials, please contact dietmar.jannach (at) udo.edu.

Comments Off

August 25, 2011

SERIMI…. (Have you washed your data?)

Filed under: Linked Data,LOD,RDF,Similarity — Patrick Durusau @ 7:04 pm

SERIMI – Resource Description Similarity, RDF Instance Matching and Interlinking

From the website:

The interlinking of datasets published in the Linked Data Cloud is a challenging problem and a key factor for the success of the Semantic Web. Manual rule-based methods are the most effective solution for the problem, but they require skilled human data publishers going through a laborious, error prone and time-consuming process for manually describing rules mapping instances between two datasets. Thus, an automatic approach for solving this problem is more than welcome. We propose a novel interlinking method, SERIMI, for solving this problem automatically. SERIMI matches instances between a source and a target datasets, without prior knowledge of the data, domain or schema of these datasets. Experiments conducted with benchmark collections demonstrate that our approach considerably outperforms state-of-the-art automatic approaches for solving the interlinking problem on the Linked Data Cloud.

SERIMI-TECH-REPORT-v2.pdf

From the Results section:

The poor performance of SERIMI in the Restaurant1-Reataurant2 is mainly due to missing alignment in the reference set. The poor performance in the Person21-Person22 pair is due to the nature of the data. These datasets where built by adding spelling mistakes to the properties and literals values of their original datasets. Also only instances of class Person were retrieved into the pseudo-homonym sets during the interlinking process.

Impressive work overall but isn’t dirty data really the test? Just about any process can succeed with clean data.

Or is that really the weakness of the Semantic Web? That it requires clean data?

Comments (2)

August 7, 2011

The joy of algorithms and NoSQL: a MongoDB example

Filed under: Cheminformatics,Similarity — Patrick Durusau @ 7:05 pm

The joy of algorithms and NoSQL: a MongoDB example

From the post:

In one of my previous blog posts, I debated the superficial idea that you should own billions of data records before you are eligible to use NoSQL/Big Data technologies. In this article, I try to illustrate my point, by employing NoSQL, and more specifically MongoDB, to solve a specific Chemoinformatics problem in a truly elegant and efficient way. The complete source code can be found on the Datablend public GitHub repository.

1. Molecular similarity theory

Molecular similarity refers to the similarity of chemical compounds with respect to their structural and/or functional qualities. By calculating molecular similarities, Chemoinformatics is able to help in the design of new drugs by screening large databases for potentially interesting chemical compounds. (This by applying the hypothesis that similar compounds generally exhibit similar biological activities.) Unfortunately, finding substructures in chemical compounds is a NP-complete problem. Hence, calculating similarities for a particular target compound can take a very long time when considering millions of input compounds. Scientist solved this problem by introducing the notion of structural keys and fingerprints.

If similarity is domain specific, what are the similarity measures in your favorite domain?

Comments Off

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 25, 2012

November 16, 2012

November 14, 2012

October 15, 2012

October 12, 2012

September 11, 2012

September 10, 2012

August 25, 2012

August 19, 2012

August 12, 2012

June 25, 2012

April 14, 2012

February 25, 2012

January 14, 2012

December 24, 2011

December 23, 2011

December 17, 2011

December 13, 2011

November 29, 2011

November 18, 2011

November 10, 2011

Organize Sound

Search Sound

October 26, 2011

October 6, 2011

October 5, 2011

September 21, 2011

September 19, 2011

August 25, 2011

August 7, 2011