Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 16, 2014

Duke 1.2 Released!

Filed under: Duke,Entity Resolution,Record Linkage — Patrick Durusau @ 8:34 pm

Lars Marius Garshol has released Duke 1.2!

From the homepage:

Duke is a fast and flexible deduplication (or entity resolution, or record linkage) engine written in Java on top of Lucene. The latest version is 1.2 (see ReleaseNotes).

Duke can find duplicate customer records, or other kinds of records in your database. Or you can use it to connect records in one data set with other records representing the same thing in another data set. Duke has sophisticated comparators that can handle spelling differences, numbers, geopositions, and more. Using a probabilistic model Duke can handle noisy data with good accuracy.

Features

  • High performance.
  • Highly configurable.
  • Support for CSV, JDBC, SPARQL, and NTriples.
  • Many built-in comparators.
  • Plug in your own data sources, comparators, and cleaners.
  • Genetic algorithm for automatically tuning configurations.
  • Command-line client for getting started.
  • API for embedding into any kind of application.
  • Support for batch processing and continuous processing.
  • Can maintain database of links found via JNDI/JDBC.
  • Can run in multiple threads.

The GettingStarted page explains how to get started and has links to further documentation. The examples of use page lists real examples of using Duke, complete with data and configurations. This presentation has more the big picture and background.

Excellent!

Until you know which two or more records are talking about the same subject, it’s very difficult to know what to map together.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress