New release of deduplication software written in Java on top of Lucene by Lars Marius Garshol.
From the release notes:
This version of Duke introduces:
- Added JNDI data source for connecting to databases via JNDI (thanks to FMitzlaff).
- In-memory data source added (thanks to FMitzlaff).
- Record linkage mode now more flexible: can implement different strategies for choosing optimal links (with FMitzlaff).
- Record linkage API refactored slightly to be more flexible (with FMitzlaff).
- Added utilities for building equivalence classes from Duke output.
- Made the XML config loader more robust.
- Added a special cleaner for English person names.
- Fixed bug in NumericComparator ( issue 66 )
- Uses own Lucene query parser to avoid issues with search strings.
- Upgraded to Lucene 3.5.0.
- Added many more tests.
- Many small bug fixes to core, NTriples reader, ec.
BTW, the documentation is online only: http://code.google.com/p/duke/wiki/GettingStarted.