Concord: A Tool That Automates the Construction of Record Linkage Systems by Christopher Dozier, Hugo Molina Salgado, Merine Thomas, Sriharsha Veeramachaneni, 2010.
From the webpage:
Concord is a system provided by Thomson Reuters R&D to enable the rapid creation of record resolution systems (RRS). Concord allows software developers to interactively configure a RRS by specifying match feature functions, master record retrieval blocking functions, and unsupervised machine learning methods tuned to a specific resolution problem. Based on a developer’s configuration process, the Concord system creates a Java based RRS that generates training data, learns a matching model and resolves record information contained in files of the same types used for training and configuration.
A nice way to start off the week! Deeply interesting paper and a new name for record linkage.
Several features of Concord that merit your attention (among many):
A choice of basic comparison operations with the ability to extend seems like a good design to me. No sense overwhelming users with all the general comparison operators, to say nothing of the domain specific ones.
The blocking functions, which operate just as you suspect, narrows the potential set of records for matching down, is also appealing. Sometimes you may be better at saying what doesn’t match than what does. This gives you two bites at a successful match.
Surrogate learning, although I have located the paper cited on this subject and will be covering it in another post.
I have written to ThomsonReuters inquiring about availability of Concord, its ability to interchange mapping settings between instances of Concord or beyond. Will update when I hear back from them.