Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 6, 2013

K-Nearest Neighbors: dangerously simple

Filed under: Data Mining,K-Nearest-Neighbors,Marketing,Topic Maps — Patrick Durusau @ 10:31 am

K-Nearest Neighbors: dangerously simple by Cathy O’Neil.

From the post:

I spend my time at work nowadays thinking about how to start a company in data science. Since there are tons of companies now collecting tons of data, and they don’t know what do to do with it, nor who to ask, part of me wants to design (yet another) dumbed-down “analytics platform” so that business people can import their data onto the platform, and then perform simple algorithms themselves, without even having a data scientist to supervise.

After all, a good data scientist is hard to find. Sometimes you don’t even know if you want to invest in this whole big data thing, you’re not sure the data you’re collecting is all that great or whether the whole thing is just a bunch of hype. It’s tempting to bypass professional data scientists altogether and try to replace them with software.

I’m here to say, it’s not clear that’s possible. Even the simplest algorithm, like k-Nearest Neighbor (k-NN), can be naively misused by someone who doesn’t understand it well. Let me explain.

The devil is all in the detail of what you mean by close. And to make things trickier, as in easier to be deceptively easy, there are default choices you could make (and which you would make) which would probably be totally stupid. Namely, the raw numbers, and Euclidean distance.

Read and think about Cathy’s post.

All those nice, clean, clear number values and a simple math equation, muddied by meaning.

Undocumented meaning.

And undocumented relationships between the variables the number values represent.

You could document your meaning and the relationships between variables and still make dumb decisions.

The hope is you or your successor will use documented meaning and relationships to make better decisions.

For documentation you can:

  • Try to remember the meaning of “close” and the relationships for all uses of K-Nearest Neighbors where you work.
  • Write meaning and relationships down on sticky notes collected in your desk draw.
  • Write meaning and relationships on paper or in electronic files, the latter somewhere on the server.
  • Document meaning and relationships with a topic map, so you can leverage on information already known. Including identifiers for the VP who ordered you to use particular values, for example. (Along with digitally signed copies of the email(s) in question.)

Which one are you using?

PS: This link was forwarded to me by Sam Hunting.

2 Comments

  1. […] Works for same company as the basis for a recommendation to play against Rik? Remember the perils of K-Nearest Neighbors: dangerously simple. […]

    Pingback by Graphs for Gaming [Neo4j] « Another Word For It — April 6, 2013 @ 4:52 pm

  2. […] Assignment of meaning is fraught with peril, as we saw in K-Nearest Neighbors: dangerously simple. […]

    Pingback by Big Data Is Not the New Oil « Another Word For It — April 7, 2013 @ 3:05 pm

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress