Archive for the ‘Duke’ Category

Duke 1.0 Release!

Monday, March 4th, 2013

Duke 1.0 Release!

From the project page:

Duke is a fast and flexible deduplication (or entity resolution, or record linkage) engine written in Java on top of Lucene. The latest version is 1.0 (see ReleaseNotes).

Features

  • High performance.
  • Highly configurable.
  • Support for CSV, JDBC, SPARQL, and NTriples DataSources.
  • Many built-in comparators.
  • Plug in your own data sources, comparators, and cleaners.
  • Command-line client for getting started.
  • API for embedding into any kind of application.
  • Support for batch processing and continuous processing.
  • Can maintain database of links found via JNDI/JDBC.
  • Can run in multiple threads.

The GettingStarted page explains how to get started and has links to further documentation. This blog post describes the basic approach taken to match records. It does not deal with the Lucene-based lookup, but describes an early, slow O(n^2) prototype. This early presentation describes the ideas behind the engine and the intended architecture; a later and more up to date presentation has more practical detail and examples. There's also the ExamplesOfUse page, which lists real examples of using Duke, complete with data and configurations.

Excellent news on the data depulication front!

And for topic map authors as well (see the examples).

Kudos to Lars Marius Garshol!

Duke 0.1 Release

Thursday, May 19th, 2011

Duke 0.1 Release

Lars Marius Garshol on Duke 0.1:

Duke is a fast and flexible deduplication (or entity resolution, or record linkage) engine written in Java on top of Lucene. At the moment (2011-04-07) it can process 1,000,000 records in 11 minutes on a standard laptop in a single thread.

Version 0.1 has been released, consisting of a command-line tool which can read CSV, JDBC, SPARQL, and NTriples data. There is also an API for programming incremental processing and storing the result of processing in a relational database.

The GettingStarted page explains how to get started and has links to further documentation. This blog post describes the basic approach taken to match records. It does not deal with the Lucene-based lookup, but describes an early, slow O(n^2) prototype. This presentation describes the ideas behind the engine and the intended architecture.

If you have questions, please contact the developer, Lars Marius Garshol, larsga at garshol.priv.no.

I will look around for sample data files.