Sane Data Updates Are Harder Than You Think by Adrian Holovaty.
From the post:
This is the first in a series of three case studies about data-parsing problems from a journalist’s perspective. This will be meaty, this will be hairy, this will be firmly in the weeds.
We’re in the middle of an open-data renaissance. It’s easier than ever for somebody with basic tech skills to find a bunch of government data, explore it, combine it with other sources, and republish it. See, for instance, the City of Chicago Data Portal, which has hundreds of data sets available for immediate download.
But the simplicity can be deceptive. Sure, the mechanics of getting data are easy, but once you start working with it, you’ll likely face a variety of rather subtle problems revolving around data correctness, completeness, and freshness.
Here I’ll examine some of the most deceptively simple problems you may face, based on my eight years’ experience dealing with government data in a journalistic setting —most recently as founder of EveryBlock, and before that as creator of chicagocrime.org and web developer at washingtonpost.com. EveryBlock, which was shut down by its parent company NBC News in February 2013, was a site that gathered and sorted dozens of civic data sets geographically. It gave you a “news feed for your block”—a frequently updated feed of news and discussions relevant to your home address. In building this huge public-data-parsing machine, we dealt with many different data situations and problems, from a wide variety of sources.
My goal here is to raise your awareness of various problems that may not be immediately obvious and give you reasonable solutions. My first theme in this series is getting new or changed records.
A great introduction to deep problems that are lurking just below the surface of any available data set.
Not only do data sets change but reactions to and criticisms of data sets change.
What would you offer as an example of “stable” data?
I tried to think of one for this post and came up empty.
You could claim the text of the King Jame Bible is “stable” data.
But only from a very narrow point of view.
The printed text is stable but the opinions, criticisms, commentaries, all on the King James Bible have been anything but stable.
Imagine that you have a stock price ticker application and all it reports are the current prices for some stock X.
Is that sufficient or would it be more useful if it reported the price over the last four hours as a percentage of change?
Perhaps we need a modern data Heraclitus to proclaim:
“No one ever reads the same data twice”