Sanity Checks

Being paranoid about data accuracy! by Kunal Jain.

Kunal knew a long meeting was developing after this exchange at its beginning:

Kunal: How many rows do you have in the data set?

Analyst 1: (After going through the data set) X rows

Kunal: How many rows do you expect?

Analyst 1 & 2: Blank look at their faces

Kunal: How many events / data points do you expect in the period / every month?

Analyst 1 & 2: …. (None of them had a clue)
The number of rows in the data set looked higher to me. The analysts had missed it clearly, because they did not benchmark it against business expectation (or did not have it in the first place). On digging deeper, we found that some events had multiple rows in the data sets and hence the higher number of rows.

You have probably seen them before but Kunal has seven (7) sanity check rules that should be applied to every data set.

Unless, of course, the inability to answer to simple questions about your data sets* is tolerated by your employer.

*Data sets become “yours” when you are asked to analyze them. Better to spot and report problems before they become evident in your results.

Comments are closed.