Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 25, 2012

Data Preparation: Know Your Records!

Filed under: Data,Data Quality,Semantics — Patrick Durusau @ 10:25 am

Data Preparation: Know Your Records! by Dean Abbott.

From the post:

Data preparation in data mining and predictive analytics (dare I also say Data Science?) rightfully focuses on how the fields in ones data should be represented so that modeling algorithms either will work properly or at least won’t be misled by the data. These data preprocessing steps may involve filling missing values, reigning in the effects of outliers, transforming fields so they better comply with algorithm assumptions, binning, and much more. In recent weeks I’ve been reminded how important it is to know your records. I’ve heard this described in many ways, four of which are:
the unit of analysis
the level of aggregation
what a record represents
unique description of a record

A bit further on Dean reminds us:

What isn’t always obvious is when our assumptions about the data result in unexpected results. What if we expect the unit of analysis to be customerID/Session but there are duplicates in the data? Or what if we had assumed customerID/Session data but it was in actuality customerID/Day data (where ones customers typically have one session per day, but could have a dozen)? (emphasis added)

Obvious once Dean says it, but how often do you question assumptions about data?

Do you know what impact incorrect assumptions about data will have on your operations?

If you investigate your assumptions about data, where do you record your observations?

Or will you repeat the investigation with every data dump from a particular source?

Describing data “in situ” could benefit you six months from now or your successor. (The data and or its fields would be treated as subjects in a topic map.)

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress