Dancing With Dirty Data Thanks to SAP Visual Intelligence by Timo Elliott.
From the post:
Here’s my entry for the SAP Ultimate Data Geek Challenge, a contest designed to “show off your inner geek and let the rest of world know your data skills are second to none.” There have already been lots of great submissions with people using the new SAP Visual Intelligence data discovery product.
I thought I’d focus on one of the things I find most powerful: the ability to create visualizations quickly and easily even from real-life, messy data sources. Since it’s election season in the US, I thought I’d use some polling data on whether voters believe the country is “headed in the right direction.” There is lots of different polling data on this (and other topics) available at pollingreport.com.
Below you can see the data set I grabbed: as you can see, the polling date field is particularly messy, since it has extra letters (e.g. RV for “registered voter”), includes polls that were carried out over several days, and is not consistent (the month is not always included, sometimes spaces around the middle dash, sometimes not…).
Take a closer look at Timo’s definition of “dirty” data: “…polling date field is particularly messy, since it has extra letters (e.g. RV for “registered voter”), includes polls that were carried out over several days, and is not consistent….”
Sure, that’s “dirty” data all right, but only one form of dirty data. It is dirty data that arises from typographical inconsistency. Inconsistency that prevents reliable automated processing.
Another form of dirty data arises from identifier inconsistency. That is one or more identifiers are used for the same subject, and/or the same identifier is used for different subjects.
I take the second form, identifier inconsistency to be distinct from typographical inconsistency. Can turn out to overlap but conceptually I find it helpful to distinguish the two.
Resolution of either form of inconsistency requires judgement about the reference being made by the identifiers.
Question: If you are resolving typographical inconsistency, do you keep a map of the resolution? If not, why not?
Question: Same questions for identifier inconsistency.