K-Nearest Neighbors: dangerously simple by Cathy O’Neil (aka mathbabe).
From the post:
I spend my time at work nowadays thinking about how to start a company in data science. Since there are tons of companies now collecting tons of data, and they don’t know what do to do with it, nor who to ask, part of me wants to design (yet another) dumbed-down “analytics platform” so that business people can import their data onto the platform, and then perform simple algorithms themselves, without even having a data scientist to supervise.
After all, a good data scientist is hard to find. Sometimes you don’t even know if you want to invest in this whole big data thing, you’re not sure the data you’re collecting is all that great or whether the whole thing is just a bunch of hype. It’s tempting to bypass professional data scientists altogether and try to replace them with software.
I’m here to say, it’s not clear that’s possible. Even the simplest algorithm, like k-Nearest Neighbor (k-NN), can be naively misused by someone who doesn’t understand it well. Let me explain.
…
Cathy’s post is a real hoot! You may not roll out of your chair but memories of prior similar episodes will flash by.
She makes a compelling case that the “democratization of data science” effort is not only mis-guided, it is dangerous to boot. Dangerous at least to users who take advantage of data democracy services.
Or should I say that data democracy services are taking advantage of users? 😉
The only reason to be concerned is that users may blame data science rather than their own incompetence with data tools for their disasters. (That seems like the most likely outcome.)
Suggested counters to the “data democracy for everyone” rhetoric?
PS: Sam Hunting reminded me of this post from Cathy O’Neil.