On Taxis and Rainbows: Lessons from NYC’s improperly anonymized taxis logs by Vijay Pandurangan.
From the post:
Recently, thanks to a Freedom of Information request, Chris Whong received and made public a complete dump of historical trip and fare logs from NYC taxis. It’s pretty incredible: there are over 20GB of uncompressed data comprising more than 173 million individual trips. Each trip record includes the pickup and dropoff location and time, anonymized hack licence number and medallion number (i.e. the taxi’s unique id number, 3F38, in my photo above), and other metadata.
These data are a veritable trove for people who love cities, transit, and data visualization. But there’s a big problem: the personally identifiable information (the driver’s licence number and taxi number) hasn’t been anonymized properly — what’s worse, it’s trivial to undo, and with other publicly available data, one can even figure out which person drove each trip. In the rest of this post, I’ll describe the structure of the data, what the person/people who released the data did wrong, how easy it is to deanonymize, and the lessons other agencies should learn from this. (And yes, I’ll also explain how rainbows fit in).
I mention this because you may be interested in the data in large chunks or small chunks.
The other reason to mention this data set is the concern over “proper” anonymization of the data. As if failing to do that, resulted in a loss of privacy for the drivers.
I see no loss of privacy for the drivers.
I say that because the New York City Taxi and Limousine Commission already had the data. The question was: Will members of the public have access to the same data? Whatever privacy a taxi driver had was breached when the data went to the NYC Taxi and Limousine Commission.
That’s an important distinction. “Privacy” will be a regular stick the government trots out to defend its possessing data and not sharing it with you.
The government has no real interest in your privacy. Witness the rogue intelligence agencies in Washington if you have any doubts on that issue. The government wants to conceal your information, which it gained by fair and/or foul methods, from both you and the rest of us.
Why? I don’t know with any certainly. But based on my observations in both the “real world” and academia, most of it stems from “I know something you don’t,” and that makes them feel important.
I can’t imagine any sadder basis for feeling important. The NSA could print out a million pages of its most secret files and stack them outside my office. I doubt I would be curious enough to turn over the first page.
The history of speculation, petty office rivalries, snide remarks about foreign government officials, etc. are of no interest to me. I already assumed they were spying on everyone so having “proof” of that is hardly a big whoop.
But we should not be deterred by calls for privacy as we force government to disgorge data it has collected, including that of the NSA. Perhaps even licensing chunks of the NSA data for use in spy novels. That offers some potential for return on the investment in the NSA.
[…] This is the same data set I mentioned in: On Taxis and Rainbows […]
Pingback by Graphing 173 Million Taxi Rides « Another Word For It — June 26, 2014 @ 6:43 pm