Archive for the ‘Similarity Retrieval’ Category

Webinar: Image Similarity: Deep Learning and Beyond (January 12th/Register for Recording)

Monday, January 11th, 2016

Webinar: Image Similarity: Deep Learning and Beyond by Dato.

From the webpage:

In this talk, we will extract features from the convolutional networks applied to real estate images to build a similarity graph and then do label propagation on the images to label different images in our dataset.

Recommended for:

  • Data scientists and engineers
  • Developers and technical team managers
  • Technical product managers

What you’ll learn:

  • How to extract features from a convolutional network using GraphLab Create
  • How to build similarity graphs using nearest neighbors
  • How to implement graph algorithms such as PageRank using GraphLab Create

What we’ll cover:

  • Extracting features from convolutional networks
  • Building similarity graphs using nearest neighbors
  • Clustering: kmeans and beyond
  • Graph algorithms: PageRank and label propagation

I had mixed results with webinars in 2015.

Looking forward to this one because of the coverage of similarity graphs.

From a subject identity perspective, how much similarity do you need to be the “same” subject?

If I have two books, one printed within the copyright period and another copy printed after the work came into the public domain, are they the same subject?

For some purposes yes and for other purposes not.

The strings we give web browsers, usually starting with “https://” these days, are crude measures of subject identity, don’t you think?

I say “the strings we give web browsers” as the efforts of TBL and his cronies to use popularity as a measure of success, continue their efforts to conflate URI, IRI, and URL into only URL. The simplification doesn’t bother me as much as the attempts to conceal it.

It’s one way to bolster a claim to have anyways been right, just re-write the records that anyone is likely to remember. I prefer my history with warts and all.

arXiv Sanity Preserver

Sunday, November 29th, 2015

arXiv Sanity Preserver by Andrej Karpathy.

From the webpage:

There are way too many arxiv papers, so I wrote a quick webapp that lets you search and sort through the mess in a pretty interface, similar to my pretty conference format.

It’s super hacky and was written in 4 hours. I’ll keep polishing it a bit over time perhaps but it serves its purpose for me already. The code uses Arxiv API to download the most recent papers (as many as you want – I used the last 1100 papers over last 3 months), and then downloads all papers, extracts text, creates tfidf vectors for each paper, and lastly is a flask interface for searching through and filtering similar papers using the vectors.

Main functionality is a search feature, and most useful is that you can click “sort by tfidf similarity to this”, which returns all the most similar papers to that one in terms of tfidf bigrams. I find this quite useful.


You can see this rather remarkable tool online at:

Beyond its obvious utility for researchers, this could be used as a framework for experimenting with other similarity measures.


I first saw this in a tweet by Lynn Cherny.

Editors’ Choice: An Introduction to the Textreuse Package [+ A Counter Example]

Tuesday, November 10th, 2015

Editors’ Choice: An Introduction to the Textreuse Package by Lincoln Mullen.

From the post:

A number of problems in digital history/humanities require one to calculate the similarity of documents or to identify how one text borrows from another. To give one example, the Viral Texts project, by Ryan Cordell, David Smith, et al., has been very successful at identifying reprinted articles in American newspapers. Kellen Funk and I have been working on a text reuse problem in nineteenth-century legal history, where we seek to track how codes of civil procedure were borrowed and modified in jurisdictions across the United States.

As part of that project, I have recently released the textreuse package for R to CRAN. (Thanks to Noam Ross for giving this package a very thorough open peer review for rOpenSci, to whom I’ve contributed the package.) This package is a general purpose implementation of several algorithms for detecting text reuse, as well as classes and functions for investigating a corpus of texts. Put most simply, full text goes in and measures of similarity come out. (emphasis added)

Kudos to Lincoln on this important contribution to the digital humanities! Not to mention the package will also be useful for researchers who want to compare the “similarity” of texts as “subjects” for purposes of elimination of duplication (called merging in some circles) for presentation to a reader.

I highlighted

Put most simply, full text goes in and measures of similarity come out.

to offer a cautionary tale about the assumption that a high measure of similarity is an indication of the “source” of a text.

Louisiana, my home state, is the only civilian jurisdiction in the United States. Louisiana law, more at one time than now, is based upon Roman law.

Roman law and laws based upon it have a very deep and rich history that I won’t even attempt to summarize.

It is sufficient for present purposes to say the Digest of the Civil Laws now in Force in the Territory of Orleans (online version, English/French) was enacted in 1808.

A scholarly dispute arose (1971-1972) between Professor Batiza (Tulane), who considered the Digest to reflect the French civil code and Professor Pascal (LSU), who argued that despite quoting the French civil code quite liberally, that the redactors intended to codify the Spanish civil law in force at the time of the Louisiana Purchase.

The Batiza vs. Pascal debate was carried out at length and in public:

Batiza, The Louisiana Civil Code of 1808: Its Actual Sources and Present Relevance, 46 TUL. L. REV. 4 (1971); Pascal, Sources of the Digest of 1808: A Reply to Professor Batiza, 46 TUL.L.REV. 603 (1972); Sweeney, Tournament of Scholars over the Sources of the Civil Code of 1808, 46 TUL. L. REV. 585 (1972); Batiza, Sources of the Civil Code of 1808, Facts and Speculation: A Rejoinder, 46 TUL. L. REV. 628 (1972).

I could not find any freely available copies of those articles online. (Don’t encourage paywalls accessing such material. Find it at your local law library.)

There are a couple of secondary articles that discuss the dispute: A.N. Yiannopoulos, The Civil Codes of Louisiana, 1 CIV. L. COMMENT. 1, 1 (2008) at, and John W. Cairns, The de la Vergne Volume and the Digest of 1808, 24 Tulane European & Civil Law Forum 31 (2009), which are freely available online.

You won’t get the full details from the secondary articles but they do capture some of the flavor of the original dispute. I can report (happily) that over time, Pascal’s position has prevailed. Textual history is more complex than rote counting techniques can capture.

A far more complex case of “text similarity” than Lincoln addresses in the Textreuse package, but once you move beyond freshman/doctoral plagiarism, the “interesting cases” are all complicated.

[O]ne Billion Tweets

Saturday, May 31st, 2014

Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing by Narayanan Sundaram, et al.


Finding nearest neighbors has become an important operation on databases, with applications to text search, multimedia indexing, and many other areas. One popular algorithm for similarity search, especially for high dimensional data (where spatial indexes like kd-trees do not perform well) is Locality Sensitive Hashing (LSH), an approximation algorithm for finding similar objects.

In this paper, we describe a new variant of LSH, called Parallel LSH (PLSH) designed to be extremely efficient, capable of scaling out on multiple nodes and multiple cores, and which supports high-throughput streaming of new data. Our approach employs several novel ideas, including: cache-conscious hash table layout, using a 2-level merge algorithm for hash table construction; an efficient algorithm for duplicate elimination during hash-table querying; an insert-optimized hash table structure and efficient data expiration algorithm for streaming data; and a performance model that accurately estimates performance of the algorithm and can be used to optimize parameter settings. We show that on a workload where we perform similarity search on a dataset of
> 1 Billion tweets, with hundreds of millions of new tweets per day, we can achieve query times of 1–2.5 ms. We show that this is an order of magnitude faster than existing indexing schemes, such as inverted indexes. To the best of our knowledge, this is the fastest implementation of LSH, with table construction times up to 3:7x faster and query times that are 8:3x faster than a basic implementation.

In the introduction, the authors report “…typical queries taking 1-2.5ms. In comparison to other text search schemes, such as inverted indexes, our approach is an order of magnitude faster.”

I looked but did not find any open-source code for PLSH.

Caution: If you search for other research, the string “PLSH” is unlikely to be helpful. One my first search I found:

  • PL/sh is a procedural language handler for PostgreSQL
  • Partia Liberale Shqiptare (Albanian Liberal Party, Kosovo)
  • Pet Loss Support Hotline
  • Promised Land Spanish Horses
  • Polish courses (Abbreviation at Brown University)
  • Point Loma High School

Parallel implementation of 3D protein structure similarity searches…

Monday, February 3rd, 2014

Parallel implementation of 3D protein structure similarity searches using a GPU and the CUDA by Dariusz Mrozek, Milosz Brozek, and Bozena Malysiak-Mrozek.


Searching for similar 3D protein structures is one of the primary processes employed in the field of structural bioinformatics. However, the computational complexity of this process means that it is constantly necessary to search for new methods that can perform such a process faster and more efficiently. Finding molecular substructures that complex protein structures have in common is still a challenging task, especially when entire databases containing tens or even hundreds of thousands of protein structures must be scanned. Graphics processing units (GPUs) and general purpose graphics processing units (GPGPUs) can perform many time-consuming and computationally demanding processes much more quickly than a classical CPU can. In this paper, we describe the GPU-based implementation of the CASSERT algorithm for 3D protein structure similarity searching. This algorithm is based on the two-phase alignment of protein structures when matching fragments of the compared proteins. The GPU (GeForce GTX 560Ti: 384 cores, 2GB RAM) implementation of CASSERT (“GPU-CASSERT”) parallelizes both alignment phases and yields an average 180-fold increase in speed over its CPU-based, single-core implementation on an Intel Xeon E5620 (2.40GHz, 4 cores). In this paper, we show that massive parallelization of the 3D structure similarity search process on many-core GPU devices can reduce the execution time of the process, allowing it to be performed in real time. GPU-CASSERT is available at:

Seventeen pages of heavy sledding but an average of 180-fold increase in speed? That’s worth the effort.

Sorry, I got distracted. How difficult did you say your subject similarity/identity problem was? 😉

Similarity in Elasticsearch

Wednesday, November 27th, 2013

Similarity in Elasticsearch by Konrad G. Beiske.

From the post:

A similarity model is a set of abstractions and metrics to define to what extent things are similar. That’s quite a general definition. In this article I will only consider textual similarity. In this context, the uses of similarity models can be divided into two categories: classification of documents, with a finite set of categories where the categories are known; and information retrieval where the problem can be defined as ‘find the the most relevant documents to a given query’. In this article I will look into the latter category.

Elasticsearch provides the following similarity models: default, bm25, drf and ib. I have limited the scope of this article to default and bm25. The divergence from randomness and information based similarities may feature in a future article.

Konrad goes on to talk about the default similarity model in Elasticsearch, Tf/idf and BM25 (aka Okapi BM25), a probabilistic model.

He also points the reader to: The Probabilistic Relevance Framework: BM25 and Beyond for further details on BM25.

A good post if you want to learn more about tuning similarity in Elasticsearch.

BTW, documentation on similarity module for 0.90.

While the build-in similarity models offer a lot of mileage no doubt, I am more intrigued by the potential for creating a custom similarity model.

As you know, some people think English words are just English words. Search engines tend to ignore time, social class, context of use, etc., in returning all the “instances” of an English word.

That is to say the similarity model for one domain or period could be quite different from the similarity model for another.

Domain or period specific similarity models would be difficult to construct and certainly less general.

Given the choice, of being easy, general and less accurate versus being harder, less general and more accurate, which would you choose?

Does your answer change if you are a consumer looking for the best results or a developer trying to sell “good enough” results?

Similarity Search and Applications

Friday, January 18th, 2013

International Conference on Similarity Search and Applications (SISAP 2013)

From the webpage:

The International Conference on Similarity Search and Applications (SISAP) is an annual forum for researchers and application developers in the area of similarity data management. It aims at the technological problems shared by numerous application domains, such as data mining, information retrieval, computer vision, pattern recognition, computational biology, geography, biometrics, machine learning, and many others that need similarity searching as a necessary supporting service.

The SISAP initiative ( aims to become a forum to exchange real-world, challenging and innovative examples of applications, new indexing techniques, common test-beds and benchmarks, source code and up-to-date literature through its web page, serving the similarity search community. Traditionally, SISAP puts emphasis on the distance-based searching, but in general the conference concerns both the effectiveness and efficiency aspects of any similarity search problem.


Paper submission: April 2013
Notification: June 2013
Final version: July 2013
Conference: October 2, 3, and 4, 2013

The specific topics include, but are not limited to:

  • Similarity queries – k-NN, range, reverse NN, top-k, etc.
  • Similarity operations – joins, ranking, classification, categorization, filtering, etc.
  • Evaluation techniques for similarity queries and operations
  • Merging/combining multiple similarity modalities
  • Cost models and analysis for similarity data processing
  • Scalability issues and high-performance similarity data management
  • Feature extraction for similarity-based data findability
  • Test collections and benchmarks
  • Performance studies, benchmarks, and comparisons
  • Similarity Search over outsourced data repositories
  • Similarity search cloud services
  • Languages for similarity databases
  • New modes of similarity for complex data understanding
  • Applications of similarity-based operations
  • Image, video, voice, and music (multimedia) retrieval systems
  • Similarity for forensics and security

You should be able to find one or more topics that interest you. 😉

How similar must two or more references to an entity be before they are identifying the same entity?

Or for that matter, is similarity an association between two or more references?

On nonmetric similarity search problems in complex domains

Saturday, February 25th, 2012

On nonmetric similarity search problems in complex domains by Tomáš Skopal and Benjamin Bustos.


The task of similarity search is widely used in various areas of computing, including multimedia databases, data mining, bioinformatics, social networks, etc. In fact, retrieval of semantically unstructured data entities requires a form of aggregated qualification that selects entities relevant to a query. A popular type of such a mechanism is similarity querying. For a long time, the database-oriented applications of similarity search employed the definition of similarity restricted to metric distances. Due to its topological properties, metric similarity can be effectively used to index a database which can then be queried efficiently by so-called metric access methods. However, together with the increasing complexity of data entities across various domains, in recent years there appeared many similarities that were not metrics—we call them nonmetric similarity functions. In this article we survey domains employing nonmetric functions for effective similarity search, and methods for efficient nonmetric similarity search. First, we show that the ongoing research in many of these domains requires complex representations of data entities. Simultaneously, such complex representations allow us to model also complex and computationally expensive similarity functions (often represented by various matching algorithms). However, the more complex similarity function one develops, the more likely it will be a nonmetric. Second, we review state-of-the-art techniques for efficient (fast) nonmetric similarity search, concerning both exact and approximate search. Finally, we discuss some open problems and possible future research trends.

The first paragraph of the conclusion of this survey on nonmetric similarity is an argument for topic maps (or at least the result of using a topic map):

In this article, we have surveyed the current situation concerning the employment of nonmetric similarity functions for effective and efficient similarity search in complex domains. One of the main results of the article is a surprising revelation that nonmetric similarity measuring is widely used in isolated domains, spanning many areas of interdisciplinary research. This includes multimedia databases, time series, and medical, scientific, chemical, and bioinformatic tasks, among others. (emphasis added)

True enough, survey articles such as this one may tempt a few researchers and possibly graduate students to peek over the discipline walls, however briefly. But research articles need to routinely cite the literature of other disciplines, betraying a current awareness of other fields. To take advantage of advances in other fields as well as to serve as an example for the next generation of researchers.