Archive for the ‘Purge’ Category

Dedupe, Merge, and Purge: the Art of Normalization

Friday, October 28th, 2011

Dedupe, Merge, and Purge: the Art of Normalization by Tyler Bell and Leo Polovets.

From the description:

Big Noise always accompanies Big Data, especially when extracting entities from the tangle of duplicate, partial, fragmented and heterogeneous information we call the Internet. The ~17m physical businesses in the US, for example, are found on over 1 billion webpages and endpoints across 5 million domains and applications. Organizing such a disparate collection of pages into a canonical set of things requires a combination of distributed data processing and human-based domain knowledge. This presentation stresses the importance of entity resolution within a business context and provides real-world examples and pragmatic insight into the process of canonicalization.

I like the Big Noise line. That may have some traction. Certainly will be the case that when users started having contact with unfiltered big data, they are likely to be as annoyed as they are with web searching.

The answer to their questions is likely “out there” but it lies just beyond their grasp. Failing they won’t blame the Big Noise or their own lack of skill but the inoffensive (and ineffective) tools at hand. Guess it is a good thing search engines are free save for advertising.

PS: The slides have a number of blanks. I have written to the authors, well, to their company, asking for corrected slides to be posted.