Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 22, 2012

Dancing With Dirty Data Thanks to SAP Visual Intelligence [Kinds of Dirty?]

Filed under: Identifiers,SAP,SAP Visual Intelligence — Patrick Durusau @ 2:19 pm

Dancing With Dirty Data Thanks to SAP Visual Intelligence by Timo Elliott.

From the post:

(graphic omitted)

Here’s my entry for the SAP Ultimate Data Geek Challenge, a contest designed to “show off your inner geek and let the rest of world know your data skills are second to none.” There have already been lots of great submissions with people using the new SAP Visual Intelligence data discovery product.

I thought I’d focus on one of the things I find most powerful: the ability to create visualizations quickly and easily even from real-life, messy data sources. Since it’s election season in the US, I thought I’d use some polling data on whether voters believe the country is “headed in the right direction.” There is lots of different polling data on this (and other topics) available at pollingreport.com.

Below you can see the data set I grabbed: as you can see, the polling date field is particularly messy, since it has extra letters (e.g. RV for “registered voter”), includes polls that were carried out over several days, and is not consistent (the month is not always included, sometimes spaces around the middle dash, sometimes not…).

Take a closer look at Timo’s definition of “dirty” data: “…polling date field is particularly messy, since it has extra letters (e.g. RV for “registered voter”), includes polls that were carried out over several days, and is not consistent….”

Sure, that’s “dirty” data all right, but only one form of dirty data. It is dirty data that arises from typographical inconsistency. Inconsistency that prevents reliable automated processing.

Another form of dirty data arises from identifier inconsistency. That is one or more identifiers are used for the same subject, and/or the same identifier is used for different subjects.

I take the second form, identifier inconsistency to be distinct from typographical inconsistency. Can turn out to overlap but conceptually I find it helpful to distinguish the two.

Resolution of either form of inconsistency requires judgement about the reference being made by the identifiers.

Question: If you are resolving typographical inconsistency, do you keep a map of the resolution? If not, why not?

Question: Same questions for identifier inconsistency.

The Ultimate Data Geek Challenge

Filed under: Challenges,SAP,SAP Visual Intelligence — Patrick Durusau @ 1:59 pm

The Ultimate Data Geek Challenge by Nic Smith.

From the post:

Are You the Ultimate Data Geek?

The time has come to show off your inner geek and let the rest of world know your data skills are second to none.

We’re excited to announce the Ultimate Data Geek Challenge. Grab your data and share your visual creation in a video, screen capture, or blog post on the SCN. Once you enter, you’ll have a chance to be crowned the Ultimate Data Geek.

How Do I Enter?

It’s easy – just four simple steps:

Important note: Challenge entries will be accepted up until November 30, 2012, at 11:59 p.m. Pacific.

There are videos and other materials to help you learn SAP Visual Intelligence.

Another tool to find subjects and data about subjects. I haven’t looked at SAP Visual Intelligence so would appreciate a shout if you have.

I first saw this at: Dancing With Dirty Data Thanks to SAP Visual Intelligence

Powered by WordPress