OKCupid data and Scientific Censorship

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 12, 2016

OKCupid data and Scientific Censorship

Filed under: Cybersecurity,Privacy,Security — Patrick Durusau @ 2:40 pm

Scientific consent, data, and doubling down on the internet by Oliver Keyes.

From the post:

There is an excellent Tim Minchin song called If You Open Your Mind Too Much, Your Brain Will Fall Out. I’m sad to report that the same is also true of your data and your science.

At this point in the story I’d like to introduce you to Emil Kirkegaard, a self-described “polymath” at the University of Aarhus who has neatly managed to tie every single way to be irresponsible and unethical in academic publishing into a single research project. This is going to be a bit long, so here’s a TL;DR: linguistics grad student with no identifiable background in sociology or social computing doxes 70,000 people so he can switch from publishing pseudoscientific racism to publishing pseudoscientific homophobia in the vanity journal that he runs.

Yeah, it’s just as bad as it sounds.

The Data

Yesterday morning I woke up to a Twitter friend pointing me to a release of OKCupid data, by Kirkegaard. Having now spent some time exploring the data, and reading both public statements on the work and the associated paper: this is without a doubt one of the most grossly unprofessional, unethical and reprehensible data releases I have ever seen.

There are two reasons for that. The first is very simple; Kirkegaard never asked anyone. He didn’t ask OKCupid, he didn’t ask the users covered by the dataset – he simply said ‘this is public so people should expect it’s going to be released’.

This is bunkum. A fundamental underpinning of ethical and principled research – which is not just an ideal but a requirement in many nations and in many fields – is informed consent. The people you are studying or using as a source should know that you are doing so and why you are doing so.

And the crucial element there is “informed”. They need to know precisely what is going on. It’s not enough to simply say ‘hey, I handed them a release buried in a pile of other paperwork and they signed it’: they need to be explicitly and clearly informed.

Studying OKCupid data doesn’t allow me to go through that process. Sure: the users “put it on the internet” where everything tends to end up public (even when it shouldn’t). Sure: the users did so on a site where the terms of service explicitly note they can’t protect your information from browsing. But the fact of the matter is that I work in this field and I don’t read the ToS, and most people have a deeply naive view of how ‘safe’ online data is and how easy it is to backtrace seemingly-meaningless information to a real life identity.

In fact, gathering of the data began in 2014, meaning that a body of the population covered had no doubt withdrawn their information from the site – and thus had a pretty legitimate reason to believe that information was gone – when Kirkegaard published. Not only is there not informed consent, there’s good reason to believe there’s an implicit refusal of consent.

The actual data gathered is extensive. It covers gender identity, sexuality, race, geographic location; it covers BDSM interests, it covers drug usage and similar criminal activity, it covers religious beliefs and their intensity, social and political views. And it does this for seventy thousand different people. Hell, the only reason it doesn’t include profile photos, according to the paper, is that it’d take up too much hard-drive space.

Which nicely segues into the second reason this is a horrifying data dump: it is not anonymised in any way. There’s no aggregation, there’s no replacement-of-usernames-with-hashes, nothing: this is detailed demographic information in a context that we know can have dramatic repercussions for subjects.

This isn’t academic: it’s willful obtuseness from a place of privilege. Every day, marginalised groups are ostracised, excluded and persecuted. People made into the Other by their gender identity, sexuality, race, sexual interests, religion or politics. By individuals or by communities or even by nation states, vulnerable groups are just that: vulnerable.

This kind of data release pulls back the veil from those vulnerable people – it makes their outsider interests or traits clear and renders them easily identifiable to their friends and communities. It’s happened before. This sort of release is nothing more than a playbook and checklist for stalkers, harassers, rapists.

It’s the doxing of 70,000 people for a fucking paper.
…

I offer no defense for the Emil Kirkegaard’s paper, its methods or conclusions.

I have more sympathy for Oliver’s concerns over consent and anonymised data than say the International Consortium of Investigative Journalists (ICIJ) and their concealment of the details from the Panama Papers, but only just.

It is in the very nature of data “leaks” that no consent is asked of or given by those exposed by the “leak.”

Moreover, anonymised data sounds suspiciously like ICIJ saying they can protect the privacy of the “innocents” in the Panama Papers leak.

I don’t know, hiding from the tax man doesn’t raise a presumption of innocence to me. You?

Someone has to decide who are “innocents,” or who merits protection of anonymised data. To claim either one, means you have someone in mind to fill that august role.

In our gender-skewed academic systems, would that be your more than likely male department head?

My caveat to Oliver’s post is even with good intentions, the power to censor data releases is a very dangerous one. One that reinforces the power of those who possess it.

The less dangerous strategy is to teach users if information is recorded, it is leaked. Perhaps not today, maybe not tomorrow, but certainly by the day after that.

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 12, 2016

OKCupid data and Scientific Censorship

No Comments