Opening data: Have you checked your pipes? by Bob Lannon.
From the post:
Code for America alum Dave Guarino had a post recently entitled “ETL for America”. In it, he highlights something that open data practitioners face with every new project: the problem of Extracting data from old databases, Transforming it to suit a new application or analysis and Loading it into the new datastore that will support that new application or analysis. Almost every technical project (and every idea for one) has this process as an initial cost. This cost is so pervasive that it’s rarely discussed by anyone except for the wretched “data plumber” (Dave’s term) who has no choice but to figure out how to move the important resources from one place to another.
Why aren’t we talking about it?
The up-front costs of ETL don’t come up very often in the open data and civic hacking community. At hackathons, in funding pitches, and in our definitions of success, we tend to focus on outputs (apps, APIs, visualizations) and treat the data preparation as a collateral task, unavoidable and necessary but not worth “getting into the weeds” about. Quoting Dave:
The fact that I can go months hearing about “open data” without a single
mention of ETL is a problem. ETL is the pipes of your house: it’s how you
open data.
It’s difficult to point to evidence that this is really the case, but I personally share Dave’s experience. To me, it’s still the elephant in the room during the proceedings of any given hackathon or open data challenge. I worry that the open data community is somehow under the false impression that, eventually in the sunny future, data will be released in a more clean way and that this cost will decrease over time.
It won’t. Open data might get cleaner, but no data source can evolve to the point where it serves all possible needs. Regardless of how easy it is to read, the data published by government probably wasn’t prepared with your new app idea in mind.
Data transformation will always be necessary, and it’s worth considering apart from the development of the next cool interface. It’s a permanent cost of developing new things in our space, so why aren’t we putting more resources toward addressing it as a problem in its own right? Why not spend some quality time (and money) focused on data preparation itself, and then let a thousand apps bloom?
…
If you only take away this line:
Open data might get cleaner, but no data source can evolve to the point where it serves all possible needs. (emphasis added)
From Bob’s entire post, reading it has been time well spent.
Your “clean data” will at times be my “dirty data” and vice versa.
Documenting the semantics we “see” in data and that drives our transformations into “clean” data for us, stands a chance of helping the next person in the line to use that data.
Think of it as an accumulation of experience with a data sets and the results obtained from it.
Or you can just “wing it” with ever data set you encounter and so shall we all.
Your call.
I first saw this in a tweet by Dave Guarino.