When Similarity Breaks Down « Another Word For It

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 4, 2015

When Similarity Breaks Down

Filed under: Clustering,Similarity — Patrick Durusau @ 7:21 pm

From the description:

Clustering data is a fundamental technique in data mining and machine learning. The basic problem can be specified as follows: “Given a set of data, partition the data into a set of groups so that each member of a given group is as similar as possible to the other members of that group and as dissimilar as possible to members of other groups”. In this talk I will try to unpack some of the complexities inherent in this seemingly straightforward description. Specifically, I will discuss some of the issues involved in measuring similarity and try to provide some intuitions into the decisions that need to be made when using such metrics to cluster data.

iPython notebook, useful because the slides don’t show up well in the video. (Be aware the github link is broken.)

Nothing new but Bart does a good job of pointing out that similarity breaks down with high dimensionality. The more dimensions that you are representing in Euclidian space, the less distance there will be along some of those dimensions.

At our evening meal today, I was telling my wife about how similarity breaks down under high dimensionality and she recalled a story about how an autistic child was able to distinguish dogs from cats.

The child had struggled to make the distinction for a number of years and then one day succeeded and was able to distinguish dogs from cats.

Think of distinguishing dogs and cats as a high dimensionality problem:

Dimension	Cat	Dog
4 legs	Y	Y
2 eyes	Y	Y
2 ears	Y	Y
fur	Y	Y
tail	Y	Y
color	Y	Y
mammals	Y	Y
pet	Y	Y
jumps	Y	Y
runs	Y	Y
plays	Y	Y
toys	Y	Y

There’s twelve that a child would notice. It could go a lot higher with more technical information.

The high number of dimensions won’t help an autistic child distinguish dogs from cats. Not likely to help a machine algorithm either.

When the child did learn how to distinguish dogs from cats, they were asked what had helped them tell the difference between dogs and cats?

The child responded that they looked only at the end of the nose of any animal presented as a dog or a cat to tell the difference. In machine learning lingo, they reduced all the dimensions above and many others to a single one. That one dimension was enough to enable them to reliably tell dogs from cats.

Like you, I had to go find images to see if this was possible:

Do you see it?

The shape of a cat’s nose, just the end of it, is a distinct T shape.

The shape of a dog’s nose, just the end of it, is round.

Every time. T versus round.

I can’t think of a better illustration of breaking down of similarity in high dimensions or of the power of dimensional reduction.

A topic for another day but it also makes me wonder about dynamically choosing dimensions along which to identify subjects.

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 4, 2015

When Similarity Breaks Down

No Comments