Lars Marius Garshol on Duke 0.1:
Duke is a fast and flexible deduplication (or entity resolution, or record linkage) engine written in Java on top of Lucene. At the moment (2011-04-07) it can process 1,000,000 records in 11 minutes on a standard laptop in a single thread.
Version 0.1 has been released, consisting of a command-line tool which can read CSV, JDBC, SPARQL, and NTriples data. There is also an API for programming incremental processing and storing the result of processing in a relational database.
The GettingStarted page explains how to get started and has links to further documentation. This blog post describes the basic approach taken to match records. It does not deal with the Lucene-based lookup, but describes an early, slow O(n^2) prototype. This presentation describes the ideas behind the engine and the intended architecture.
If you have questions, please contact the developer, Lars Marius Garshol, larsga at garshol.priv.no.
I will look around for sample data files.
Sample data files would be deeply appreciated. I’ve found one example (the million song dataset), but it’s 280 GB, and so doesn’t fit on my hard drive … 🙁
Comment by larsga@garshol.priv.no — May 20, 2011 @ 2:50 am
Will have to look around. Most of the data sets that I have seen are in medical care and so have privacy concerns. But I am sure there are others.
I am familiar with the million song dataset. More on that anon.
Comment by Patrick Durusau — May 20, 2011 @ 3:07 pm
[…] Marius asked about some test data files for his Duke 0.1 […]
Pingback by Integrated Public Use Microdata Series (IPUMS-USA) « Another Word For It — May 20, 2011 @ 4:08 pm