Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 14, 2011

Deduplication

Filed under: Duplicates,Lucene,Record Linkage — Patrick Durusau @ 7:25 am

Deduplication

Lars Marius Garshol slides from an internal Bouvet conference on deduplication of data.

And, DUplicate KillEr, DUKE.

As Lars points out, people have been here before.

I am not sure I share Lars’ assessment of the current state of record linkage software.

Consider for example, FRIL – Fine-Grained Record Integration and Linkage Tool, which is described as:

FRIL is FREE open source tool that enables fast and easy record linkage. The tool extends traditional record linkage tools with a richer set of parameters. Users may systematically and iteratively explore the optimal combination of parameter values to enhance linking performance and accuracy.
Key features of FRIL include:

  • Rich set of user-tunable parameters
  • Advanced features of schema/data reconciliation
  • User-tunable search methods (e.g. sorted neighborhood method, blocking method, nested loop join)
  • Transparent support for multi-core systems
  • Support for parameters configuration
  • Dynamic analysis of parameters
  • And many, many more…

I haven’t used FRIL but do note that it has documentation, videos, etc. for user instruction.

I have reservations about record linkage in general, but those are concerns about re-use of semantic mappings and not record linkage per se.

3 Comments

  1. I found FRIL and thought it interesting enough to put the user manual on my Kindle. It’s still in my reading queue, though. It’s possible we could have used FRIL, but by the time I found it work on Duke had already started.

    As I note in the presentation, the number of names for deduplication is a problem. I searched for “identity resolution” tools initially. I only stumbled across the record linkage concept later.

    Anyway, I will study FRIL carefully.

    Comment by larsga@garshol.priv.no — April 14, 2011 @ 2:22 pm

  2. I need to dig out a paper I did with Newcomb on record linkage. I don’t think we found all the names but it was 20+ as I recall.

    Including two full mathematical models for it, developed completely independently.

    Talk about needing a topic map! 😉

    Comment by Patrick Durusau — April 14, 2011 @ 2:48 pm

  3. Amusingly, the field of record linkage apparently originated with a Dr. Newcombe. 🙂

    And, yes, it is sort of ironic that the concept of deduplication should be plagued with so many duplicate names.

    Comment by larsga@garshol.priv.no — April 15, 2011 @ 12:33 pm

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress