Lars Marius Garshol on Duke 0.1:
Duke is a fast and flexible deduplication (or entity resolution, or record linkage) engine written in Java on top of Lucene. At the moment (2011-04-07) it can process 1,000,000 records in 11 minutes on a standard laptop in a single thread.
Version 0.1 has been released, consisting of a command-line tool which can read CSV, JDBC, SPARQL, and NTriples data. There is also an API for programming incremental processing and storing the result of processing in a relational database.
The GettingStarted page explains how to get started and has links to further documentation. This blog post describes the basic approach taken to match records. It does not deal with the Lucene-based lookup, but describes an early, slow O(n^2) prototype. This presentation describes the ideas behind the engine and the intended architecture.
If you have questions, please contact the developer, Lars Marius Garshol, larsga at garshol.priv.no.
I will look around for sample data files.