Lars Marius Garshol walks through finding duplicate records in data records.
As Lars notes, there are commercial products for this same task but I think this is a useful exercise.
Isn’t that hard to imagine the creation of test data sets with a variety of conditions to underscore lessons about detecting duplicate records.
I suspect such training data may already be available.
Will have to see what I can find and post about it.
*****
PS: Lars is primary editor of the TMDM, working on TMCL and several other parts of the topic maps standard.
The question of training data is an interesting one. It’s rare to find published data that’s anywhere near as dirty as the data real users produce. I can’t think offhand of any data set that has lots of duplicates in it, but like you say I’m sure they exist.
In any case, you could use this mechanism to bridge any two data sets. Even basic geographic databases. You could match countries by their names, continents, areas (fuzzily, obviously), etc etc etc
Comment by Lars Marius Garshol — February 11, 2011 @ 2:49 pm
Perhaps I should sponsor a “dirty data hunt” with public balloting on the reported “dirty data” sets. That could actually be pretty funny.
What do you think? What sort of prize? Gift certificate to some custom brewery?
How frequently? I suspect with all the government and other data sets that are appearing on the WWW, monthly would not be too often.
What about user data with identity stuff removed? Should that qualify?
Comment by Patrick Durusau — February 11, 2011 @ 5:51 pm