Archive for the ‘K-Nearest-Neighbors’ Category

Fast k-NN search

Wednesday, September 23rd, 2015

Fast k-NN search by Ville Hyvönen, Teemu Pitkänen, Sotiris Tasoulis, Liang Wang, Teemu Roos, Jukka Corander.


Random projection trees have proven to be effective for approximate nearest neighbor searches in high dimensional spaces where conventional methods are not applicable due to excessive usage of memory and computational time. We show that building multiple trees on the same data can improve the performance even further, without significantly increasing the total computational cost of queries when executed in a modern parallel computing environment. Our experiments identify suitable parameter values to achieve accurate searches with extremely fast query times, while also retaining a feasible complexity for index construction.

Not a quick read but an important one if you want to use multiple dimensions for calculation of similarity or sameness between two or more topics.

The technique requires you to choose a degree of similarity that works for your use case.

This paper makes a nice jumping off point for discussing how much precision does a particular topic map application need? Absolute precision is possible, but only in a limited number of cases and I suspect at high cost.

For some use cases, such as searching for possible suspects in crimes, some lack of precision is necessary to build up a large enough pool of suspects to include the guilty parties.

Any examples of precision and topic maps that come to mind?

Buffer k-d Trees: Processing Massive Nearest Neighbor Queries on GPUs

Saturday, February 1st, 2014

Buffer k-d Trees: Processing Massive Nearest Neighbor Queries on GPUs by Fabian Gieseke, Justin Heinermann, Cosmin Oancea, Christian Igel. (Journal of Machine Learning Research W&CP 32 (1): 172-180, 2014)


We present a new approach for combining k-d trees and graphics processing units for nearest neighbor search. It is well known that a direct combination of these tools leads to a non-satisfying performance due to conditional computations and suboptimal memory accesses. To alleviate these problems, we propose a variant of the classical k-d tree data structure, called buffer k-d tree, which can be used to reorganize the search. Our experiments show that we can take advantage of both the hierarchical subdivision induced by k-d trees and the huge computational resources provided by today’s many-core devices. We demonstrate the potential of our approach in astronomy, where hundreds of million nearest neighbor queries have to be processed.

The complexity of your data may or may not be a barrier to this technique. The authors report that for feature spaces < 4, specialized solutions exist and benefits exist for feature spaces up to 27.

Pure speculation on my part but it could be the case that some level of size or complexity of data, a general solution that is equally applicable to every data set, isn’t possible.

I first saw this in a tweet by Stefano Bertolo

Easy k-NN Document Classification with Solr and Python

Thursday, October 3rd, 2013

Easy k-NN Document Classification with Solr and Python by John Berryman.

From the post:

You’ve got a problem: You have 1 buzzillion documents that must all be classified. Naturally, tagging them by hand is completely infeasible. However you are fortunate enough to have several thousand documents that have already been tagged. So why not…

Build a k-Nearest Neighbors Classifier!

The concept of a k-NN document classifier is actually quite simple. Basically, given a new document, find the k most similar documents within the tagged collection, retrieve the tags from those documents, and declare the input document to have the same tag as that which was most common among the similar documents. Now, taking a page from Taming Text (page 189 to be precise), do you know of any opensource products that are really good at similarity-based document retrieval? That’s right, Solr! Basically, given a new input document, all we have to do is scoop out the “statistically interesting” terms, submit a search composed of these terms, and count the tags that come back. And it even turns out that Solr takes care of identifying the “statistically interesting” terms. All we have to do is submit the document to the Solr MoreLikeThis handler. MoreLikeThis then scans through the document and extracts “Goldilocks” terms – those terms that are not too long, not too short, not too common, and not too rare… they’re all just right.

I don’t know how timely John’s post is for you but it is very timely for me. 😉

I was being asked yesterday about devising a rough cut over a body of texts.

Looking forward to putting this approach through its paces.

K-nearest neighbour search for PostgreSQL [Data Types For Distance Operator]

Wednesday, May 29th, 2013

K-nearest neighbour search for PostgreSQL by Oleg Bartunov and Teodor Sigaev.

Excellent presentation from PGCon-2010 on the KNN index type in Postgres.

And an exception to the rule about wordy slides.

Or at least wordy slides are better for non-attendees.

KNN uses the <-> distance operator.

And the slides say:

distance operator, should be provided for data type

Looking at the Postgre documentation (9.3, but same for 9.1 and 9.2), I read:

In addition to the typical B-tree search operators, btree_gist also provides index support for <> (“not equals”). This may be useful in combination with an exclusion constraint, as described below.

Also, for data types for which there is a natural distance metric, btree_gist defines a distance operator <->, and provides GiST index support for nearest-neighbor searches using this operator. Distance operators are provided for int2, int4, int8, float4, float8, timestamp with time zone, timestamp without time zone, time without time zone, date, interval, oid, and money.

What collective subjects would you like to gather up using the defined data types for the distance operator?

How would you represent properties to take advantage of the defined data types?

Are there other data types that you would define for the distance operator?

K-Nearest Neighbors: dangerously simple

Saturday, April 6th, 2013

K-Nearest Neighbors: dangerously simple by Cathy O’Neil.

From the post:

I spend my time at work nowadays thinking about how to start a company in data science. Since there are tons of companies now collecting tons of data, and they don’t know what do to do with it, nor who to ask, part of me wants to design (yet another) dumbed-down “analytics platform” so that business people can import their data onto the platform, and then perform simple algorithms themselves, without even having a data scientist to supervise.

After all, a good data scientist is hard to find. Sometimes you don’t even know if you want to invest in this whole big data thing, you’re not sure the data you’re collecting is all that great or whether the whole thing is just a bunch of hype. It’s tempting to bypass professional data scientists altogether and try to replace them with software.

I’m here to say, it’s not clear that’s possible. Even the simplest algorithm, like k-Nearest Neighbor (k-NN), can be naively misused by someone who doesn’t understand it well. Let me explain.

The devil is all in the detail of what you mean by close. And to make things trickier, as in easier to be deceptively easy, there are default choices you could make (and which you would make) which would probably be totally stupid. Namely, the raw numbers, and Euclidean distance.

Read and think about Cathy’s post.

All those nice, clean, clear number values and a simple math equation, muddied by meaning.

Undocumented meaning.

And undocumented relationships between the variables the number values represent.

You could document your meaning and the relationships between variables and still make dumb decisions.

The hope is you or your successor will use documented meaning and relationships to make better decisions.

For documentation you can:

  • Try to remember the meaning of “close” and the relationships for all uses of K-Nearest Neighbors where you work.
  • Write meaning and relationships down on sticky notes collected in your desk draw.
  • Write meaning and relationships on paper or in electronic files, the latter somewhere on the server.
  • Document meaning and relationships with a topic map, so you can leverage on information already known. Including identifiers for the VP who ordered you to use particular values, for example. (Along with digitally signed copies of the email(s) in question.)

Which one are you using?

PS: This link was forwarded to me by Sam Hunting.

Kaggle Digit Recognizer: A K-means attempt

Wednesday, October 24th, 2012

Kaggle Digit Recognizer: A K-means attempt by Michael Needham.

From the post:

Over the past couple of months Jen and I have been playing around with the Kaggle Digit Recognizer problem – a ‘competition’ created to introduce people to Machine Learning.

The goal in this competition is to take an image of a handwritten single digit, and determine what that digit is.

You are given an input file which contains multiple rows each containing 784 pixel values representing a 28×28 pixel image as well as a label indicating which number that image actually represents.

One of the algorithms that we tried out for this problem was a variation on the k-means clustering one whereby we took the values at each pixel location for each of the labels and came up with an average value for each pixel.

The results of machine learning are likely to be direct or indirect input into your topic maps.

Useful evaluation of that input will depend your understanding of machine learning.

K-Nearest-Neighbors and Handwritten Digit Classification

Monday, August 27th, 2012

K-Nearest-Neighbors and Handwritten Digit Classification by Jeremy Kun.

From the post:

The Recipe for Classification

One important task in machine learning is to classify data into one of a fixed number of classes. For instance, one might want to discriminate between useful email and unsolicited spam. Or one might wish to determine the species of a beetle based on its physical attributes, such as weight, color, and mandible length. These “attributes” are often called “features” in the world of machine learning, and they often correspond to dimensions when interpreted in the framework of linear algebra. As an interesting warm-up question for the reader, what would be the features for an email message? There are certainly many correct answers.

The typical way of having a program classify things goes by the name of supervised learning. Specifically, we provide a set of already-classified data as input to a training algorithm, the training algorithm produces an internal representation of the problem (a model, as statisticians like to say), and a separate classification algorithm uses that internal representation to classify new data. The training phase is usually complex and the classification algorithm simple, although that won’t be true for the method we explore in this post.

More often than not, the input data for the training algorithm are converted in some reasonable way to a numerical representation. This is not as easy as it sounds. We’ll investigate one pitfall of the conversion process in this post, but in doing this we separate the data from the application domain in a way that permits mathematical analysis. We may focus our questions on the data and not on the problem. Indeed, this is the basic recipe of applied mathematics: extract from a problem the essence of the question you wish to answer, answer the question in the pure world of mathematics, and then interpret the results.

We’ve investigated data-oriented questions on this blog before, such as, “is the data linearly separable?” In our post on the perceptron algorithm, we derived an algorithm for finding a line which separates all of the points in one class from the points in the other, assuming one exists. In this post, however, we make a different structural assumption. Namely, we assume that data points which are in the same class are also close together with respect to an appropriate metric. Since this is such a key point, it bears repetition and elevation in the typical mathematical fashion. The reader should note the following is not standard terminology, and it is simply a mathematical restatement of what we’ve already said.

Modulo my concerns about assigning non-metric data to metric spaces, this is a very good post on classification.