Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 7, 2013

Nasty data corruption getting exponentially worse…

Filed under: BigData,Data Quality — Patrick Durusau @ 3:52 pm

Nasty data corruption getting exponentially worse with the size of your data by Vincent Granville.

From the post:

The issue with truly big data is that you will end up with field separators that are actually data values (text data). What are the chances to find a double tab in a one GB file? Not that high. In an 100 TB file, the chance is very high. Now the question is: is it a big issue, or maybe it’s fine as long as less than 0.01% of the data is impacted. In some cases, once the glitch occurs, ALL the data after the glitch is corrupted, because it is not read correctly – this is especially true when a data value contains text that is identical to a row or field separator, such as CR / LF (carriage return / line feed). The problem gets worse when data is exported from UNIX or MAC to WINDOWS, or even from ACCESS to EXCEL.

Vincent has a number of suggestions for checking data.

What would you add to his list?

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress