From the project page:
Duke is a fast and flexible deduplication (or entity resolution, or record linkage) engine written in Java on top of Lucene. The latest version is 1.0 (see ReleaseNotes).
- High performance.
- Highly configurable.
- Support for CSV, JDBC, SPARQL, and NTriples DataSources.
- Many built-in comparators.
- Plug in your own data sources, comparators, and cleaners.
- Command-line client for getting started.
- API for embedding into any kind of application.
- Support for batch processing and continuous processing.
- Can maintain database of links found via JNDI/JDBC.
- Can run in multiple threads.
The GettingStarted page explains how to get started and has links to further documentation. This blog post describes the basic approach taken to match records. It does not deal with the Lucene-based lookup, but describes an early, slow O(n^2) prototype. This early presentation describes the ideas behind the engine and the intended architecture; a later and more up to date presentation has more practical detail and examples. There's also the ExamplesOfUse page, which lists real examples of using Duke, complete with data and configurations.
Excellent news on the data depulication front!
And for topic map authors as well (see the examples).
Kudos to Lars Marius Garshol!