Five Stages of Data Grief by Jeni Tennison.
From the post:
As organisations come to recognise how important and useful data could be, they start to think about using the data that they have been collecting in new ways. Often data has been collected over many years as a matter of routine, to drive specific processes or sometimes just for the sake of it. Suddenly that data is repurposed. It is probed, analysed and visualised in ways that haven’t been tried before.
Data analysts have a maxim:
If you don’t think you have a quality problem with your data, you haven’t looked at it yet.
Every dataset has its quirks, whether it’s data that has been wrongly entered in the first place, automated processing that has introduced errors, irregularities that come from combining datasets into a consistent structure or simply missing information. Anyone who works with data knows that far more time is needed to clean data into something that can be analysed, and to understand what to leave out, than in actually performing the analysis itself. They also know that analysis and visualisation of data will often reveal bugs that you simply can’t see by staring at a spreadsheet.
But for the people who have collected and maintained such data — or more frequently their managers, who don’t work with the data directly — this realisation can be a bit of a shock. In our last ODI Board meeting, Sir Tim Berners-Lee suggested that the data curators need to go through was something like the five stages of grief described by the Kübler-Ross model.
Jeni covers the five stages of grief from a data quality standpoint and offers a sixth stage. (No spoilers follow, read her post.)
Correcting input/transformation errors is one level of data cleaning.
But the near-collapse of HealthCare.gov shows how streams of “clean” data can combine into a large pool of “dirty” data.
Every contributor supplied ‘clean’ data but when combined with other “clean” data, confusion was the result.
“Clean” data is an ongoing process at two separate levels:
Level 1: Traditional correction of input/transformation errors (as per Jeni).
Level 2: Preparation of data for transformation into “clean” data for new purposes.
The first level is familiar.
The second we all know as ad-hoc ETL.
Enough knowledge is gained to make a transformation work, but that knowledge isn’t passed on with the data or more generally.
Or as we all learned from television: “Lather, rinse, repeat.”
A good slogan if you are trying to maximize sales of shampoo, but a wasteful one when describing ETL for data.
What if data curators captured the knowledge required for ETL, making every subsequent ETL less resource intensive and less error prone?
I think that would qualify as data cleaning.
You?