Archive for the ‘Similarity’ Category

Introducing arxiv-sanity

Sunday, September 18th, 2016

Only a small part of Arxiv appears at: but it is enough to show the feasibility of this approach.

What captures my interest is the potential to substitute/extend the program to use other similarity measures.

Bearing in mind that searching is only the first step towards the acquisition and preservation of knowledge.

PS: I first saw this in a tweet by Data Science Renee.

Deep Learning: Image Similarity and Beyond (Webinar, May 10, 2016)

Friday, May 6th, 2016

Deep Learning: Image Similarity and Beyond (Webinar, May 10, 2016)

From the registration page:

Deep Learning is a powerful machine learning method for image tagging, object recognition, speech recognition, and text analysis. In this demo, we’ll cover the basic concept of deep learning and walk you through the steps to build an application that finds similar images using an already-trained deep learning model.

Recommended for:

  • Data scientists and engineers
  • Developers and technical team managers
  • Technical product managers

What you’ll learn:

  • How to leverage existing deep learning models
  • How to extract deep features and use them using GraphLab Create
  • How to build and deploy an image similarity service using Dato Predictive Services

What we’ll cover:

  • Using an already-trained deep learning model
  • Extracting deep features
  • Building and deploying an image similarity service for pictures 

Deep learning has difficulty justifying its choices, just like human judges of similarity, but could it play a role in assisting topic map authors in constructing explicit decisions for merging?

Once trained, could deep learning suggest properties and/or values to consider for merging it has not yet experienced?

I haven’t seen any webinars recently so I am ready to gamble on this being an interesting one.


Similar Pages for Wikipedia – Lateral – Indexing Practices

Saturday, April 23rd, 2016

Similar Pages for Wikipedia (Chrome extension)

I started looking at this software with a mis-impression that I hope you can avoid.

I installed the extension and as advertised, if I am on a Wikipedia page, it recommends “similar” Wikipedia pages.

Unless I’m billing time, plowing through page after page of tangentially related material isn’t my idea of a good time.

Ah, but I confused “document” with “page.”

I discovered that error while reading Adding Documents at Lateral, which gives the following example:


Ah! So “document” means as much or as little text as I choose to use when I add the document.

Which means if I were creating a document store of graph papers, I would capture only the new material and not the inevitable a “graph consists of nodes and edges….”

There are pre-populatd data sets, News 350,000+ news and blog articles, updated every 15 mins; arXiv 1M+ papers (all), updated daily; PubMed 6M+ medical journals from before July 2014; SEC 6,000+ yearly financial reports / 10-K filings from 2014; Wikipedia 463,000 pages which had 20+ page views in 2013.

I suspect the granularity on the pre-populated data sets is “document” in the usual sense size.

Glad to see the option to define a “document” to be an arbitrary span of text.

I don’t need to find more “documents” (in the usual sense) but more relevant snippets that are directly on point.

Hmmm, perhaps indexing at the level of paragraphs instead of documents (usual sense)?

Which makes me wonder why we index at the level of documents (usual sense) anyway? Is it simply tradition from when indexes were prepared by human indexers? And indexes were limited by physical constraints?

Webinar: Image Similarity: Deep Learning and Beyond (January 12th/Register for Recording)

Monday, January 11th, 2016

Webinar: Image Similarity: Deep Learning and Beyond by Dato.

From the webpage:

In this talk, we will extract features from the convolutional networks applied to real estate images to build a similarity graph and then do label propagation on the images to label different images in our dataset.

Recommended for:

  • Data scientists and engineers
  • Developers and technical team managers
  • Technical product managers

What you’ll learn:

  • How to extract features from a convolutional network using GraphLab Create
  • How to build similarity graphs using nearest neighbors
  • How to implement graph algorithms such as PageRank using GraphLab Create

What we’ll cover:

  • Extracting features from convolutional networks
  • Building similarity graphs using nearest neighbors
  • Clustering: kmeans and beyond
  • Graph algorithms: PageRank and label propagation

I had mixed results with webinars in 2015.

Looking forward to this one because of the coverage of similarity graphs.

From a subject identity perspective, how much similarity do you need to be the “same” subject?

If I have two books, one printed within the copyright period and another copy printed after the work came into the public domain, are they the same subject?

For some purposes yes and for other purposes not.

The strings we give web browsers, usually starting with “https://” these days, are crude measures of subject identity, don’t you think?

I say “the strings we give web browsers” as the efforts of TBL and his cronies to use popularity as a measure of success, continue their efforts to conflate URI, IRI, and URL into only URL. The simplification doesn’t bother me as much as the attempts to conceal it.

It’s one way to bolster a claim to have anyways been right, just re-write the records that anyone is likely to remember. I prefer my history with warts and all.

arXiv Sanity Preserver

Sunday, November 29th, 2015

arXiv Sanity Preserver by Andrej Karpathy.

From the webpage:

There are way too many arxiv papers, so I wrote a quick webapp that lets you search and sort through the mess in a pretty interface, similar to my pretty conference format.

It’s super hacky and was written in 4 hours. I’ll keep polishing it a bit over time perhaps but it serves its purpose for me already. The code uses Arxiv API to download the most recent papers (as many as you want – I used the last 1100 papers over last 3 months), and then downloads all papers, extracts text, creates tfidf vectors for each paper, and lastly is a flask interface for searching through and filtering similar papers using the vectors.

Main functionality is a search feature, and most useful is that you can click “sort by tfidf similarity to this”, which returns all the most similar papers to that one in terms of tfidf bigrams. I find this quite useful.


You can see this rather remarkable tool online at:

Beyond its obvious utility for researchers, this could be used as a framework for experimenting with other similarity measures.


I first saw this in a tweet by Lynn Cherny.

Infinite Dimensional Word Embeddings [Variable Representation, Death to Triples]

Thursday, November 19th, 2015

Infinite Dimensional Word Embeddings by Eric Nalisnick and Sachin Ravi.


We describe a method for learning word embeddings with stochastic dimensionality. Our Infinite Skip-Gram (iSG) model specifies an energy-based joint distribution over a word vector, a context vector, and their dimensionality, which can be defined over a countably infinite domain by employing the same techniques used to make the Infinite Restricted Boltzmann Machine (Cote & Larochelle, 2015) tractable. We find that the distribution over embedding dimensionality for a given word is highly interpretable and leads to an elegant probabilistic mechanism for word sense induction. We show qualitatively and quantitatively that the iSG produces parameter-efficient representations that are robust to language’s inherent ambiguity.

Even better from the introduction:

To better capture the semantic variability of words, we propose a novel embedding method that produces vectors with stochastic dimensionality. By employing the same mathematical tools that allow the definition of an Infinite Restricted Boltzmann Machine (Côté & Larochelle, 2015), we describe ´a log-bilinear energy-based model–called the Infinite Skip-Gram (iSG) model–that defines a joint distribution over a word vector, a context vector, and their dimensionality, which has a countably infinite domain. During training, the iSGM allows word representations to grow naturally based on how well they can predict their context. This behavior enables the vectors of specific words to use few dimensions and the vectors of vague words to elongate as needed. Manual and experimental analysis reveals this dynamic representation elegantly captures specificity, polysemy, and homonymy without explicit definition of such concepts within the model. As far as we are aware, this is the first word embedding method that allows representation dimensionality to be variable and exhibit data-dependent growth.

Imagine a topic map model that “allow[ed] representation dimensionality to be variable and exhibit data-dependent growth.

Simple subjects, say the sort you find at, can have simple representations.

More complex subjects, say the notion of “person” in U.S. statutory law (no, I won’t attempt to list them here), can extend its dimensional representation as far as is necessary.

Of course in this case, the dimensions are learned from a corpus but I don’t see any barrier to the intentional creation of dimensions for subjects and/or a combined automatic/directed creation of dimensions.

Or as I put it in the title, Death to All Triples.

More precisely, not just triples but any pre-determined limit on representation.

Looking forward to taking a slow read on this article and those it cites. Very promising.

Editors’ Choice: An Introduction to the Textreuse Package [+ A Counter Example]

Tuesday, November 10th, 2015

Editors’ Choice: An Introduction to the Textreuse Package by Lincoln Mullen.

From the post:

A number of problems in digital history/humanities require one to calculate the similarity of documents or to identify how one text borrows from another. To give one example, the Viral Texts project, by Ryan Cordell, David Smith, et al., has been very successful at identifying reprinted articles in American newspapers. Kellen Funk and I have been working on a text reuse problem in nineteenth-century legal history, where we seek to track how codes of civil procedure were borrowed and modified in jurisdictions across the United States.

As part of that project, I have recently released the textreuse package for R to CRAN. (Thanks to Noam Ross for giving this package a very thorough open peer review for rOpenSci, to whom I’ve contributed the package.) This package is a general purpose implementation of several algorithms for detecting text reuse, as well as classes and functions for investigating a corpus of texts. Put most simply, full text goes in and measures of similarity come out. (emphasis added)

Kudos to Lincoln on this important contribution to the digital humanities! Not to mention the package will also be useful for researchers who want to compare the “similarity” of texts as “subjects” for purposes of elimination of duplication (called merging in some circles) for presentation to a reader.

I highlighted

Put most simply, full text goes in and measures of similarity come out.

to offer a cautionary tale about the assumption that a high measure of similarity is an indication of the “source” of a text.

Louisiana, my home state, is the only civilian jurisdiction in the United States. Louisiana law, more at one time than now, is based upon Roman law.

Roman law and laws based upon it have a very deep and rich history that I won’t even attempt to summarize.

It is sufficient for present purposes to say the Digest of the Civil Laws now in Force in the Territory of Orleans (online version, English/French) was enacted in 1808.

A scholarly dispute arose (1971-1972) between Professor Batiza (Tulane), who considered the Digest to reflect the French civil code and Professor Pascal (LSU), who argued that despite quoting the French civil code quite liberally, that the redactors intended to codify the Spanish civil law in force at the time of the Louisiana Purchase.

The Batiza vs. Pascal debate was carried out at length and in public:

Batiza, The Louisiana Civil Code of 1808: Its Actual Sources and Present Relevance, 46 TUL. L. REV. 4 (1971); Pascal, Sources of the Digest of 1808: A Reply to Professor Batiza, 46 TUL.L.REV. 603 (1972); Sweeney, Tournament of Scholars over the Sources of the Civil Code of 1808, 46 TUL. L. REV. 585 (1972); Batiza, Sources of the Civil Code of 1808, Facts and Speculation: A Rejoinder, 46 TUL. L. REV. 628 (1972).

I could not find any freely available copies of those articles online. (Don’t encourage paywalls accessing such material. Find it at your local law library.)

There are a couple of secondary articles that discuss the dispute: A.N. Yiannopoulos, The Civil Codes of Louisiana, 1 CIV. L. COMMENT. 1, 1 (2008) at, and John W. Cairns, The de la Vergne Volume and the Digest of 1808, 24 Tulane European & Civil Law Forum 31 (2009), which are freely available online.

You won’t get the full details from the secondary articles but they do capture some of the flavor of the original dispute. I can report (happily) that over time, Pascal’s position has prevailed. Textual history is more complex than rote counting techniques can capture.

A far more complex case of “text similarity” than Lincoln addresses in the Textreuse package, but once you move beyond freshman/doctoral plagiarism, the “interesting cases” are all complicated.

Fast k-NN search

Wednesday, September 23rd, 2015

Fast k-NN search by Ville Hyvönen, Teemu Pitkänen, Sotiris Tasoulis, Liang Wang, Teemu Roos, Jukka Corander.


Random projection trees have proven to be effective for approximate nearest neighbor searches in high dimensional spaces where conventional methods are not applicable due to excessive usage of memory and computational time. We show that building multiple trees on the same data can improve the performance even further, without significantly increasing the total computational cost of queries when executed in a modern parallel computing environment. Our experiments identify suitable parameter values to achieve accurate searches with extremely fast query times, while also retaining a feasible complexity for index construction.

Not a quick read but an important one if you want to use multiple dimensions for calculation of similarity or sameness between two or more topics.

The technique requires you to choose a degree of similarity that works for your use case.

This paper makes a nice jumping off point for discussing how much precision does a particular topic map application need? Absolute precision is possible, but only in a limited number of cases and I suspect at high cost.

For some use cases, such as searching for possible suspects in crimes, some lack of precision is necessary to build up a large enough pool of suspects to include the guilty parties.

Any examples of precision and topic maps that come to mind?

Breaking the Similarity Bottleneck

Saturday, May 9th, 2015

Ultra-Fast Data-Mining Hardware Architecture Based on Stochastic Computing by Antoni Morro, et al.


Minimal hardware implementations able to cope with the processing of large amounts of data in reasonable times are highly desired in our information-driven society. In this work we review the application of stochastic computing to probabilistic-based pattern-recognition analysis of huge database sets. The proposed technique consists in the hardware implementation of a parallel architecture implementing a similarity search of data with respect to different pre-stored categories. We design pulse-based stochastic-logic blocks to obtain an efficient pattern recognition system. The proposed architecture speeds up the screening process of huge databases by a factor of 7 when compared to a conventional digital implementation using the same hardware area.

I haven’t included the hyperlinks, but:

In this work we present a highly efficient methodology for data mining based on probabilistic processing. High dimensional data is inherently complex in clustering, classification and similarity search [15]. The proposed approach is evaluated showing its application to a similarity search over a huge database. Most data mining algorithms use similarity search as a subroutine core [16–18], and thus the time taken for this task is the bottleneck of virtually all data mining algorithms [19]. Similarity search plays a fundamental role in many data mining and machine learning problems, e.g. text categorization [20], collaborative filtering [21], time-series analysis [22,23], protein sequencing [24] or any application-specific task as petroglyphs comparison [25]. At the same time, the mining of huge datasets implies the use of large computer clusters [26,27]. The proposed approach based on the use of probabilistic processing shows large improvements in terms of hardware resources when compared with conventional solutions.

Sorry they omitted topic maps but what is a merging criteria if it isn’t a type of “similarity?”

From the conclusion:

This implementation uses less hardware resources than conventional digital methodologies (based on binary and not probabilistic logic) and is able to process the order of 13GBytes of information per second (in contrast to the estimated 2GBytes/s of speed that could be achieved by the conventional implementation using the same hardware area). With the 12-dimensional space used to allocate each vector in the example shown in this paper we obtain the order of 1 billion of comparisons per second. A patent application has been done for this new mining methodology [32].

The patent was filed in Spanish but English and French auto-translations are available.

Hopefully the patent will be used in such a way as to promote widespread implementation of this technique.

I could stand 1 billion comparisons a second, quite easily. Interactive development of merging algorithms anyone?

I first saw this in a tweet by Stefano Bertolo.

Topic Extraction and Bundling of Related Scientific Articles

Wednesday, May 6th, 2015

Topic Extraction and Bundling of Related Scientific Articles by Shameem A Puthiya Parambath.


Automatic classification of scientific articles based on common characteristics is an interesting problem with many applications in digital library and information retrieval systems. Properly organized articles can be useful for automatic generation of taxonomies in scientific writings, textual summarization, efficient information retrieval etc. Generating article bundles from a large number of input articles, based on the associated features of the articles is tedious and computationally expensive task. In this report we propose an automatic two-step approach for topic extraction and bundling of related articles from a set of scientific articles in real-time. For topic extraction, we make use of Latent Dirichlet Allocation (LDA) topic modeling techniques and for bundling, we make use of hierarchical agglomerative clustering techniques.

We run experiments to validate our bundling semantics and compare it with existing models in use. We make use of an online crowdsourcing marketplace provided by Amazon called Amazon Mechanical Turk to carry out experiments. We explain our experimental setup and empirical results in detail and show that our method is advantageous over existing ones.

On “bundling” from the introduction:

Effective grouping of data requires a precise definition of closeness between a pair of data items and the notion of closeness always depend on the data and the problem context. Closeness is defined in terms of similarity of the data pairs which in turn is measured in terms of dissimilarity or distance between pair of items. In this report we use the term similarity,dissimilarity and distance to denote the measure of closeness between data items. Most of the bundling scheme start with identifying the common attributes(metadata) of the data set, here scientific articles, and create bundling semantics based on the combination of these attributes. Here we suggest a two step algorithm to bundle scientific articles. In the first step we group articles based on the latent topics in the documents and in the second step we carry out agglomerative hierarchical clustering based on the inter-textual distance and co-authorship similarity between articles. We run experiments to validate the bundling semantics and to compare it with content only based similarity. We used 19937 articles related to Computer Science from arviv [htt12a] for our experiments.

Is a “bundle” the same thing as a topic that represents “all articles on subject X?”

I have seen a number of topic map examples that use the equivalent proper noun, a proper subject, that is a singular and unique subject.

But there is no reason why I could not have a topic that represents all the articles on deep learning written in 2014, for example. Methods such as the bundling techniques described here could prove to be quite useful in such cases.

When Similarity Breaks Down

Sunday, January 4th, 2015

From the description:

Clustering data is a fundamental technique in data mining and machine learning. The basic problem can be specified as follows: “Given a set of data, partition the data into a set of groups so that each member of a given group is as similar as possible to the other members of that group and as dissimilar as possible to members of other groups”. In this talk I will try to unpack some of the complexities inherent in this seemingly straightforward description. Specifically, I will discuss some of the issues involved in measuring similarity and try to provide some intuitions into the decisions that need to be made when using such metrics to cluster data.

iPython notebook, useful because the slides don’t show up well in the video. (Be aware the github link is broken.)

Nothing new but Bart does a good job of pointing out that similarity breaks down with high dimensionality. The more dimensions that you are representing in Euclidian space, the less distance there will be along some of those dimensions.

At our evening meal today, I was telling my wife about how similarity breaks down under high dimensionality and she recalled a story about how an autistic child was able to distinguish dogs from cats.

The child had struggled to make the distinction for a number of years and then one day succeeded and was able to distinguish dogs from cats.

Think of distinguishing dogs and cats as a high dimensionality problem:

Dimension Cat Dog
4 legs Y Y
2 eyes Y Y
2 ears Y Y
fur Y Y
tail Y Y
color Y Y
mammals Y Y
pet Y Y
jumps Y Y
runs Y Y
plays Y Y
toys Y Y

There’s twelve that a child would notice. It could go a lot higher with more technical information.

The high number of dimensions won’t help an autistic child distinguish dogs from cats. Not likely to help a machine algorithm either.

When the child did learn how to distinguish dogs from cats, they were asked what had helped them tell the difference between dogs and cats?

The child responded that they looked only at the end of the nose of any animal presented as a dog or a cat to tell the difference. In machine learning lingo, they reduced all the dimensions above and many others to a single one. That one dimension was enough to enable them to reliably tell dogs from cats.

Like you, I had to go find images to see if this was possible:



Do you see it?

The shape of a cat’s nose, just the end of it, is a distinct T shape.

The shape of a dog’s nose, just the end of it, is round.

Every time. T versus round.

I can’t think of a better illustration of breaking down of similarity in high dimensions or of the power of dimensional reduction.

A topic for another day but it also makes me wonder about dynamically choosing dimensions along which to identify subjects.

ADW (Align, Disambiguate and Walk) [Semantic Similarity]

Tuesday, October 14th, 2014

ADW (Align, Disambiguate and Walk) version 1.0 by Mohammad Taher Pilehvar.

From the webpage:

This package provides a Java implementation of ADW, a state-of-the-art semantic similarity approach that enables the comparison of lexical items at different lexical levels: from senses to texts. For more details about the approach please refer to:

The abstract for the paper reads:

Semantic similarity is an essential component of many Natural Language Processing applications. However, prior methods for computing semantic similarity often operate at different levels, e.g., single words or entire documents, which requires adapting the method for each data type. We present a unified approach to semantic similarity that operates at multiple levels, all the way from comparing word senses to comparing text documents. Our method leverages a common probabilistic representation over word senses in order to compare different types of linguistic data. This unified representation shows state-of-the-art performance on three tasks: semantic textual similarity, word similarity, and word sense coarsening.

Online Demo.

The strength of this approach is the use of multiple levels of semantic similarity. It relies on WordNet but the authors promise to extend their approach to named entities and other tokens not appearing in WordNet (like your company or industry’s internal vocabulary).

The bibliography of the paper cites much of the recent work in this area so that will be an added bonus for perusing the paper.

I first saw this in a tweet by Gregory Piatetsky.

Twitter open sourced a recommendation algorithm for massive datasets

Wednesday, September 24th, 2014

Twitter open sourced a recommendation algorithm for massive datasets by Derrick Harris.

From the post:

Late last month, Twitter open sourced an algorithm that’s designed to ease the computational burden on systems trying to recommend content — contacts, articles, products, whatever — across seemingly endless sets of possibilities. Called DIMSUM, short for Dimension Independent Matrix Square using MapReduce (rolls off the tongue, no?), the algorithm trims the list of potential combinations to a reasonable number, so other recommendation algorithms can run in a reasonable amount of time.

Reza Zadeh, the former Twitter data scientist and current Stanford consulting professor who helped create the algorithm, describes it in terms of the famous handshake problem. Two people in a room? One handshake; no problem. Ten people in a room? Forty-five handshakes; still doable. However, he explained, “The number of handshakes goes up quadratically … That makes the problem very difficult when x is a million.”

Twitter claims 271 million active users.

DIMSUM works primarily in two different areas: (1) matching promoted ads with the right users, and (2) suggesting similar people to follow after users follow someone. Running through all the possible combinations would take days even on a large cluster of machines, Zadeh said, but sampling the user base using DIMSUM takes significantly less time and significantly fewer machines.

The “similarity” of two or more people or bits of content is a variation on the merging rules of the TMDM.

In recommendation language, two or more topics are “similar” if:

  • at least one equal string in their [subject identifiers] properties,
  • at least one equal string in their [item identifiers] properties,
  • at least one equal string in their [subject locators] properties,
  • an equal string in the [subject identifiers] property of the one topic item and the [item identifiers] property of the other, or
  • the same information item in their [reified] properties.

TMDM 5.3.5 Properties

The TMDM says “equal” and not “similar” but the point being that you can arbitrarily decide on how “similar” two or more topics must be in order to trigger merging.

That realization opens up the entire realm of “similarity” and “recommendation” algorithms and techniques for application to topic maps.

Which brings us back to the algorithm just open sourced by Twitter.

With DIMSUM, you don’t have to do a brute force topic by topic comparison for merging purposes. Some topics will not meet a merging “threshold” and not be considered by merging routines.

Of course, with the TMDM, merging being either true or false, you may be stuck with brute force. Suggestions?

But if you have other similarity measures, you may be able to profit from DIMSUM.

BTW, I would not follow #dimsum on Twitter because it is apparently a type of dumpling. 😉

Update: All-pairs similarity via DIMSUM DIMSUM has been implemented in Spark!

Improving sparse word similarity models…

Tuesday, August 26th, 2014

Improving sparse word similarity models with asymmetric measures by Jean Mark Gawron.


We show that asymmetric models based on Tversky (1977) improve correlations with human similarity judgments and nearest neighbor discovery for both frequent and middle-rank words. In accord with Tversky’s discovery that asymmetric similarity judgments arise when comparing sparse and rich representations, improvement on our two tasks can be traced to heavily weighting the feature bias toward the rarer word when comparing high- and mid- frequency words.

From the introduction:

A key assumption of most models of similarity is that a similarity relation is symmetric. This assumption is foundational for some conceptions, such as the idea of a similarity space, in which similarity is the inverse of distance; and it is deeply embedded into many of the algorithms that build on a similarity relation among objects, such as clustering algorithms. The symmetry assumption is not, however, universal, and it is not essential to all applications of similarity, especially when it comes to modeling human similarity judgments.

What assumptions underlie your “similarity” measures?

Not that we can get away from “assumptions” but are your assumptions based on evidence or are they unexamined assumptions?

Do you know of any general techniques for discovering assumptions in algorithms?

[O]ne Billion Tweets

Saturday, May 31st, 2014

Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing by Narayanan Sundaram, et al.


Finding nearest neighbors has become an important operation on databases, with applications to text search, multimedia indexing, and many other areas. One popular algorithm for similarity search, especially for high dimensional data (where spatial indexes like kd-trees do not perform well) is Locality Sensitive Hashing (LSH), an approximation algorithm for finding similar objects.

In this paper, we describe a new variant of LSH, called Parallel LSH (PLSH) designed to be extremely efficient, capable of scaling out on multiple nodes and multiple cores, and which supports high-throughput streaming of new data. Our approach employs several novel ideas, including: cache-conscious hash table layout, using a 2-level merge algorithm for hash table construction; an efficient algorithm for duplicate elimination during hash-table querying; an insert-optimized hash table structure and efficient data expiration algorithm for streaming data; and a performance model that accurately estimates performance of the algorithm and can be used to optimize parameter settings. We show that on a workload where we perform similarity search on a dataset of
> 1 Billion tweets, with hundreds of millions of new tweets per day, we can achieve query times of 1–2.5 ms. We show that this is an order of magnitude faster than existing indexing schemes, such as inverted indexes. To the best of our knowledge, this is the fastest implementation of LSH, with table construction times up to 3:7x faster and query times that are 8:3x faster than a basic implementation.

In the introduction, the authors report “…typical queries taking 1-2.5ms. In comparison to other text search schemes, such as inverted indexes, our approach is an order of magnitude faster.”

I looked but did not find any open-source code for PLSH.

Caution: If you search for other research, the string “PLSH” is unlikely to be helpful. One my first search I found:

  • PL/sh is a procedural language handler for PostgreSQL
  • Partia Liberale Shqiptare (Albanian Liberal Party, Kosovo)
  • Pet Loss Support Hotline
  • Promised Land Spanish Horses
  • Polish courses (Abbreviation at Brown University)
  • Point Loma High School

The history of the vector space model

Wednesday, May 14th, 2014

The history of the vector space model by Suresh Venkatasubramanian.

From the post:

Gerald Salton is generally credited with the invention of the vector space model: the idea that we could represent a document as a vector of keywords and use things like cosine similarity and dimensionality reduction to compare documents and represent them.

But the path to this modern interpretation was a lot twistier than one might think. David Dubin wrote an article in 2004 titled ‘The Most Influential Paper Gerard Salton Never Wrote‘. In it, he points out that most citations that refer to the vector space model refer to a paper that doesn’t actually exist (hence the title). Taking that as a starting point, he then traces the lineage of the ideas in Salton’s work.

Suresh summarizes some of the discoveries made by Dubin in his post but this sounds like an interesting research project. To take Dubin’s article as a starting point and follow the development of the vector space model.

Particularly since it is used so often for “similarity.” Understanding the mathematics is good, understanding how that particular model was arrived at would be even better.


…Locality Sensitive Hashing for Unstructured Data

Friday, May 9th, 2014

Practical Applications of Locality Sensitive Hashing for Unstructured Data by Jake Drew.

From the post:

The purpose of this article is to demonstrate how the practical Data Scientist can implement a Locality Sensitive Hashing system from start to finish in order to drastically reduce the search time typically required in high dimensional spaces when finding similar items. Locality Sensitive Hashing accomplishes this efficiency by exponentially reducing the amount of data required for storage when collecting features for comparison between similar item sets. In other words, Locality Sensitive Hashing successfully reduces a high dimensional feature space while still retaining a random permutation of relevant features which research has shown can be used between data sets to determine an accurate approximation of Jaccard similarity [2,3].

Complete with code and references no less!

How “similar” do two items need to be to count as the same item?

If two libraries own a physical copy of the same book, for some purposes they are distinct items but for annotations/reviews, you could treat them as one item.

If that sounds like a topic map-like question, your right!

What measures of similarity are you applying to what subjects?

Open and transparent altmetrics for discovery

Sunday, February 9th, 2014

Open and transparent altmetrics for discovery by Peter Kraker.

From the post:

Altmetrics are a hot topic in scientific community right now. Classic citation-based indicators such as the impact factor are amended by alternative metrics generated from online platforms. Usage statistics (downloads, readership) are often employed, but links, likes and shares on the web and in social media are considered as well. The altmetrics promise, as laid out in the excellent manifesto, is that they assess impact quicker and on a broader scale.

The main focus of altmetrics at the moment is evaluation of scientific output. Examples are the article-level metrics in PLOS journals, and the Altmetric donut. ImpactStory has a slightly different focus, as it aims to evaluate the oeuvre of an author rather than an individual paper.

This is all good and well, but in my opinion, altmetrics have a huge potential for discovery that goes beyond rankings of top papers and researchers. A potential that is largely untapped so far.

How so? To answer this question, it is helpful to shed a little light on the history of citation indices.

Peter observes that co-citation is a measure of subject similarity, without the need to use the same terminology (Science Citation Index). Peter discovered in his PhD research that co-readership is also an indicator of subject similarity.

But more research is needed on co-readership to make it into a reproducible and well understood measure.

Peter is appealing for data sets suitable for this research.

It is subject similarity at the document level but if as useful as co-citation analysis has proven to be, it will be well worth the effort.

Help out if you are able.

I first saw this in a tweet by Jason Priem.

Parallel implementation of 3D protein structure similarity searches…

Monday, February 3rd, 2014

Parallel implementation of 3D protein structure similarity searches using a GPU and the CUDA by Dariusz Mrozek, Milosz Brozek, and Bozena Malysiak-Mrozek.


Searching for similar 3D protein structures is one of the primary processes employed in the field of structural bioinformatics. However, the computational complexity of this process means that it is constantly necessary to search for new methods that can perform such a process faster and more efficiently. Finding molecular substructures that complex protein structures have in common is still a challenging task, especially when entire databases containing tens or even hundreds of thousands of protein structures must be scanned. Graphics processing units (GPUs) and general purpose graphics processing units (GPGPUs) can perform many time-consuming and computationally demanding processes much more quickly than a classical CPU can. In this paper, we describe the GPU-based implementation of the CASSERT algorithm for 3D protein structure similarity searching. This algorithm is based on the two-phase alignment of protein structures when matching fragments of the compared proteins. The GPU (GeForce GTX 560Ti: 384 cores, 2GB RAM) implementation of CASSERT (“GPU-CASSERT”) parallelizes both alignment phases and yields an average 180-fold increase in speed over its CPU-based, single-core implementation on an Intel Xeon E5620 (2.40GHz, 4 cores). In this paper, we show that massive parallelization of the 3D structure similarity search process on many-core GPU devices can reduce the execution time of the process, allowing it to be performed in real time. GPU-CASSERT is available at:

Seventeen pages of heavy sledding but an average of 180-fold increase in speed? That’s worth the effort.

Sorry, I got distracted. How difficult did you say your subject similarity/identity problem was? 😉

Using Lucene Similarity in Item-Item Recommenders

Saturday, December 14th, 2013

Using Lucene Similarity in Item-Item Recommenders by Sujit Pal.

From the post:

Last week, I implemented 4 (of 5) recommenders from the Programming Assignments of the Introduction to Recommender Systems course on Coursera, but using Apache Mahout and Scala instead of Lenskit and Java. This week, I implement an Item Item Collaborative Filtering Recommender that uses Lucene (more specifically, Lucene’s More Like This query) as the item similarity provider.

By default, Lucene stores document vectors keyed by terms, but can be configured to store term vectors by setting the field attribute TermVector.YES. In case of text documents, words (or terms) are the features which are used to compute similarity between documents. I am using the same dataset as last week, where movies (items) correspond to documents and movie tags correspond to the words. So we build a movie “document” by preprocessing the tags to form individual tokens and concatenating them into a tags field in the index.

Three scenarios are covered. The first two are similar to the scenarios covered with the item-item collaborative filtering recommender from last week, where the user is on a movie page, and we need to (a) predict the rating a user would given a specified movie and (b) find movies similar to a given movie. The third scenario is recommending movies to a given user. We describe each algorithm briefly, and how Lucene fits in.

I’m curious how easy/difficult it would be to re-purpose similarity algorithms to detect common choices in avatar characteristics, acquisitions, interaction with others, goals, etc.?

Thinking that while obvious repetitions are easy enough to avoid, gender, age, names, etc., there are other, more subtle characteristics of interaction with others that would be far harder to be aware of. Much less to mask effectively.

It would require a lot of data on interaction but I assume that isn’t all that difficult to whistle up on any of the major systems.

If you have any pointers to that sort of research, forward them along. I will be posting a collection of pointers and will credit anyone who wants to be credited.

Similarity in Elasticsearch

Wednesday, November 27th, 2013

Similarity in Elasticsearch by Konrad G. Beiske.

From the post:

A similarity model is a set of abstractions and metrics to define to what extent things are similar. That’s quite a general definition. In this article I will only consider textual similarity. In this context, the uses of similarity models can be divided into two categories: classification of documents, with a finite set of categories where the categories are known; and information retrieval where the problem can be defined as ‘find the the most relevant documents to a given query’. In this article I will look into the latter category.

Elasticsearch provides the following similarity models: default, bm25, drf and ib. I have limited the scope of this article to default and bm25. The divergence from randomness and information based similarities may feature in a future article.

Konrad goes on to talk about the default similarity model in Elasticsearch, Tf/idf and BM25 (aka Okapi BM25), a probabilistic model.

He also points the reader to: The Probabilistic Relevance Framework: BM25 and Beyond for further details on BM25.

A good post if you want to learn more about tuning similarity in Elasticsearch.

BTW, documentation on similarity module for 0.90.

While the build-in similarity models offer a lot of mileage no doubt, I am more intrigued by the potential for creating a custom similarity model.

As you know, some people think English words are just English words. Search engines tend to ignore time, social class, context of use, etc., in returning all the “instances” of an English word.

That is to say the similarity model for one domain or period could be quite different from the similarity model for another.

Domain or period specific similarity models would be difficult to construct and certainly less general.

Given the choice, of being easy, general and less accurate versus being harder, less general and more accurate, which would you choose?

Does your answer change if you are a consumer looking for the best results or a developer trying to sell “good enough” results?

Similarity maps…

Thursday, September 26th, 2013

Similarity maps – a visualization strategy for molecular fingerprints and machine-learning methods by Sereina Riniker and Gregory A Landrum.


Fingerprint similarity is a common method for comparing chemical structures. Similarity is an appealing approach because, with many fingerprint types, it provides intuitive results: a chemist looking at two molecules can understand why they have been determined to be similar. This transparency is partially lost with the fuzzier similarity methods that are often used for scaffold hopping and tends to vanish completely when molecular fingerprints are used as inputs to machine-learning (ML) models. Here we present similarity maps, a straightforward and general strategy to visualize the atomic contributions to the similarity between two molecules or the predicted probability of a ML model. We show the application of similarity maps to a set of dopamine D3 receptor ligands using atom-pair and circular fingerprints as well as two popular ML methods: random forests and na?ve Bayes. An open-source implementation of the method is provided.

If you are doing topic maps in areas where molecular fingerprints are relevant, this could be quite useful.

Despite my usual warnings that semantics are continuous versus the discrete structures treated here, this may also be useful in “fuzzier” areas where topic maps are found.

Sherlock’s Last Case

Sunday, July 14th, 2013

Sherlock’s Last Case by Joe Armstrong.

Joe states the Sherlock problem as given one X and millions of Yi’s, “Which Yi is “nearer to X?”

For some measure of “nearer,” or as we prefer, similarity.

One solution is given in Programming Erlang: Software for a Concurrent World, 2nd ed., 2013, by Joe Armstrong.

Joe describes two possibly better solutions in this lecture.

Great lecture even if he omits a fundamental weakness in TF-IDF.

From the Wikipedia entry:

Suppose we have a set of English text documents and wish to determine which document is most relevant to the query “the brown cow”. A simple way to start out is by eliminating documents that do not contain all three words “the”, “brown”, and “cow”, but this still leaves many documents. To further distinguish them, we might count the number of times each term occurs in each document and sum them all together; the number of times a term occurs in a document is called its term frequency.

However, because the term “the” is so common, this will tend to incorrectly emphasize documents which happen to use the word “the” more frequently, without giving enough weight to the more meaningful terms “brown” and “cow”. The term “the” is not a good keyword to distinguish relevant and non-relevant documents and terms, unlike the less common words “brown” and “cow”. Hence an inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.

For example, TF-IDF would not find a document with “the brown heifer,” for a query of “the brown cow.”

TF-IDF does not account for relationships between terms, such as synonymy or polysemy.

Juam Ramos states as much in describing the limitations of TF-IDF in: Using TF-IDF to Determine Word Relevance in Document Queries:

Despite its strength, TF-IDF has its limitations. In terms of synonyms, notice that TF-IDF does not make the jump to the relationship between words. Going back to (Berger & Lafferty, 1999), if the user wanted to find information about, say, the word ‘priest’, TF-IDF would not consider documents that might be relevant to the query but instead use the word ‘reverend’. In our own experiment, TF-IDF could not equate the word ‘drug’ with its plural ‘drugs’, categorizing each instead as separate words and slightly decreasing the word’s wd value. For large document collections, this could present an escalating problem.

Ramos cites Information Retrieval as Statistical Translation by Adam Berger and John Lafferty to support his comments on synonymy or polysemy.

The Berger and Lafferty treat synonymy and polysemy, issues that TF-IDF misses, as statistical translation issues:

Ultimately document retrieval systems must be sophisticated enough to handle polysemy and synonymyto know for instance that pontiff and pope are related terms The eld of statistical translation concerns itself with how to mine large text databases to automatically discover such semantic relations Brown et al [3, 4] showed for instance how a system can learn to associate French terms with their English translations given only a collection of bilingual FrenchEnglish sentences We shall demonstrate how in a similar fashion an IR system can from a collection of documents automatically learn which terms are related and exploit these relations to better nd and rank the documents it returns to the user

Merging powered by the results of statistical translation?

The Berger and Lafferty paper is more than a decade old so I will be running the research forward.

K-nearest neighbour search for PostgreSQL [Data Types For Distance Operator]

Wednesday, May 29th, 2013

K-nearest neighbour search for PostgreSQL by Oleg Bartunov and Teodor Sigaev.

Excellent presentation from PGCon-2010 on the KNN index type in Postgres.

And an exception to the rule about wordy slides.

Or at least wordy slides are better for non-attendees.

KNN uses the <-> distance operator.

And the slides say:

distance operator, should be provided for data type

Looking at the Postgre documentation (9.3, but same for 9.1 and 9.2), I read:

In addition to the typical B-tree search operators, btree_gist also provides index support for <> (“not equals”). This may be useful in combination with an exclusion constraint, as described below.

Also, for data types for which there is a natural distance metric, btree_gist defines a distance operator <->, and provides GiST index support for nearest-neighbor searches using this operator. Distance operators are provided for int2, int4, int8, float4, float8, timestamp with time zone, timestamp without time zone, time without time zone, date, interval, oid, and money.

What collective subjects would you like to gather up using the defined data types for the distance operator?

How would you represent properties to take advantage of the defined data types?

Are there other data types that you would define for the distance operator?

Similarity for Topic Maps?

Tuesday, May 28th, 2013

The Weisfeiler-Lehman algorithm and estimation on graphs by Alex Smola.

From the post:

Imagine you have two graphs \(G\) and \(G’\) and you’d like to check how similar they are. If all vertices have unique attributes this is quite easy:

FOR ALL vertices \(v \in G \cup G’\) DO

  • Check that \(v \in G\) and that \(v \in G’\)
  • Check that the neighbors of v are the same in \(G\) and \(G’\)

This algorithm can be carried out in linear time in the size of the graph, alas many graphs do not have vertex attributes, let alone unique vertex attributes. In fact, graph isomorphism, i.e. the task of checking whether two graphs are identical, is a hard problem (it is still an open research question how hard it really is). In this case the above algorithm cannot be used since we have no idea which vertices we should match up.

The Weisfeiler-Lehman algorithm is a mechanism for assigning fairly unique attributes efficiently. Note that it isn’t guaranteed to work, as discussed in this paper by Douglas – this would solve the graph isomorphism problem after all. The idea is to assign fingerprints to vertices and their neighborhoods repeatedly. We assume that vertices have an attribute to begin with. If they don’t then simply assign all of them the attribute 1. Each iteration proceeds as follows:


Something a bit more sophisticated than comparing canonical representations in syntax.

DELTACON: A Principled Massive-Graph Similarity Function

Saturday, April 20th, 2013

DELTACON: A Principled Massive-Graph Similarity Function by Danai Koutra, Joshua T. Vogelstein, Christos Faloutsos.


How much did a network change since yesterday? How different is the wiring between Bob’s brain (a left-handed male) and Alice’s brain (a right-handed female)? Graph similarity with known node correspondence, i.e. the detection of changes in the connectivity of graphs, arises in numerous settings. In this work, we formally state the axioms and desired properties of the graph similarity functions, and evaluate when state-of-the-art methods fail to detect crucial connectivity changes in graphs. We propose DeltaCon, a principled, intuitive, and scalable algorithm that assesses the similarity between two graphs on the same nodes (e.g. employees of a company, customers of a mobile carrier). Experiments on various synthetic and real graphs showcase the advantages of our method over existing similarity measures. Finally, we employ DeltaCon to real applications: (a) we classify people to groups of high and low creativity based on their brain connectivity graphs, and (b) do temporal anomaly detection in the who-emails-whom Enron graph.

How different is your current topic map from a prior version?

Could be an interesting marketing ploy to colorize the distinct portions of the graph.

Not to mention using “similarity” to mean the same subject for some purposes. Group subjects come to mind.

And for other types of analysis.

Learning Hash Functions Using Column Generation

Monday, March 11th, 2013

Learning Hash Functions Using Column Generation by Xi Li, Guosheng Lin, Chunhua Shen, Anton van den Hengel, Anthony Dick.


Fast nearest neighbor searching is becoming an increasingly important tool in solving many large-scale problems. Recently a number of approaches to learning data-dependent hash functions have been developed. In this work, we propose a column generation based method for learning data-dependent hash functions on the basis of proximity comparison information. Given a set of triplets that encode the pairwise proximity comparison information, our method learns hash functions that preserve the relative comparison relationships in the data as well as possible within the large-margin learning framework. The learning procedure is implemented using column generation and hence is named CGHash. At each iteration of the column generation procedure, the best hash function is selected. Unlike most other hashing methods, our method generalizes to new data points naturally; and has a training objective which is convex, thus ensuring that the global optimum can be identified. Experiments demonstrate that the proposed method learns compact binary codes and that its retrieval performance compares favorably with state-of-the-art methods when tested on a few benchmark datasets.

Interesting review of hashing techniques.

Raises the question of customized similarity (read sameness) hashing algorithms for topic maps.

I first saw this in a tweet by Stefano Bertolo.

Similarity Search and Applications

Friday, January 18th, 2013

International Conference on Similarity Search and Applications (SISAP 2013)

From the webpage:

The International Conference on Similarity Search and Applications (SISAP) is an annual forum for researchers and application developers in the area of similarity data management. It aims at the technological problems shared by numerous application domains, such as data mining, information retrieval, computer vision, pattern recognition, computational biology, geography, biometrics, machine learning, and many others that need similarity searching as a necessary supporting service.

The SISAP initiative ( aims to become a forum to exchange real-world, challenging and innovative examples of applications, new indexing techniques, common test-beds and benchmarks, source code and up-to-date literature through its web page, serving the similarity search community. Traditionally, SISAP puts emphasis on the distance-based searching, but in general the conference concerns both the effectiveness and efficiency aspects of any similarity search problem.


Paper submission: April 2013
Notification: June 2013
Final version: July 2013
Conference: October 2, 3, and 4, 2013

The specific topics include, but are not limited to:

  • Similarity queries – k-NN, range, reverse NN, top-k, etc.
  • Similarity operations – joins, ranking, classification, categorization, filtering, etc.
  • Evaluation techniques for similarity queries and operations
  • Merging/combining multiple similarity modalities
  • Cost models and analysis for similarity data processing
  • Scalability issues and high-performance similarity data management
  • Feature extraction for similarity-based data findability
  • Test collections and benchmarks
  • Performance studies, benchmarks, and comparisons
  • Similarity Search over outsourced data repositories
  • Similarity search cloud services
  • Languages for similarity databases
  • New modes of similarity for complex data understanding
  • Applications of similarity-based operations
  • Image, video, voice, and music (multimedia) retrieval systems
  • Similarity for forensics and security

You should be able to find one or more topics that interest you. 😉

How similar must two or more references to an entity be before they are identifying the same entity?

Or for that matter, is similarity an association between two or more references?

PubChem3D: conformer ensemble accuracy

Sunday, January 13th, 2013

PubChem3D: conformer ensemble accuracy by Sunghwan Kim, Evan E Bolton and Stephen H Bryant. (Journal of Cheminformatics 2013, 5:1 doi:10.1186/1758-2946-5-1)



PubChem is a free and publicly available resource containing substance descriptions and their associated biological activity information. PubChem3D is an extension to PubChem containing computationally-derived three-dimensional (3-D) structures of small molecules. All the tools and services that are a part of PubChem3D rely upon the quality of the 3-D conformer models. Construction of the conformer models currently available in PubChem3D involves a clustering stage to sample the conformational space spanned by the molecule. While this stage allows one to downsize the conformer models to more manageable size, it may result in a loss of the ability to reproduce experimentally determined “bioactive” conformations, for example, found for PDB ligands. This study examines the extent of this accuracy loss and considers its effect on the 3-D similarity analysis of molecules.


The conformer models consisting of up to 100,000 conformers per compound were generated for 47,123 small molecules whose structures were experimentally determined, and the conformers in each conformer model were clustered to reduce the size of the conformer model to a maximum of 500 conformers per molecule. The accuracy of the conformer models before and after clustering was evaluated using five different measures: root-mean-square distance (RMSD), shape-optimized shape-Tanimoto (STST-opt) and combo-Tanimoto (ComboTST-opt), and color-optimized color-Tanimoto (CTCT-opt) and combo-Tanimoto (ComboTCT-opt). On average, the effect of clustering decreased the conformer model accuracy, increasing the conformer ensemble’s RMSD to the bioactive conformer (by 0.18 +/- 0.12 A), and decreasing the STST-opt, ComboTST-opt, CTCT-opt, and ComboTCT-opt scores (by 0.04 +/- 0.03, 0.16 +/- 0.09, 0.09 +/- 0.05, and 0.15 +/- 0.09, respectively).


This study shows the RMSD accuracy performance of the PubChem3D conformer models is operating as designed. In addition, the effect of PubChem3D sampling on 3-D similarity measures shows that there is a linear degradation of average accuracy with respect to molecular size and flexibility. Generally speaking, one can likely expect the worst-case minimum accuracy of 90% or more of the PubChem3D ensembles to be 0.75, 1.09, 0.43, and 1.13, in terms of STST-opt, ComboTST-opt, CTCT-opt, and ComboTCT-opt, respectively. This expected accuracy improves linearly as the molecule becomes smaller or less flexible.

If I were to say, potential shapes of a subject, would that the importance of this work clearer?

Wikipedia has this two-liner that may also help:

A macromolecule is usually flexible and dynamic. It can change its shape in response to changes in its environment or other factors; each possible shape is called a conformation, and a transition between them is called a conformational change. A macromolecular conformational change may be induced by many factors such as a change in temperature, pH, voltage, ion concentration, phosphorylation, or the binding of a ligand.

Subjects and the manner of their identification is a very deep and rewarding field of study.

An identification method in isolation is no better or worse than any other identification method.

Only your requirements (which are also subjects) can help with the process of choosing one or more identification methods over others.

Music Network Visualization

Tuesday, December 11th, 2012

Music Network Visualization by Dimiter Toshkov.

From the post:

My music interests have always been rather, hmm…, eclectic. Somehow IDM, ambient, darkwave, triphop, acid jazz, bossa nova, qawali, Mali blues and other more or less obscure genres have managed to happily co-exist in my music collection. The sheer diversity always invited the question whether there is some structure to the collection, or each genre is an island of its own. Sounds like a job for network visualization!

Now, there are plenty of music network viz applications on the web. But they don’t show my collection, and just seem unsatisfactory for various reasons. So I decided to craft my own visualization using R and igraph.

Interesting for the visualization but also the use of similarity measures.

The test for identity of a subject, particularly collective subjects, artists “similar” to X, is as unlimited as your imagination.