Archive for the ‘Duke’ Category

Identifiers vs. Identifications?

Monday, June 1st, 2015

One problem with topic map rhetoric has been its focus on identifiers (the flat ones):

identifier2

rather than saying topic are managing subject identifications, that is, making explicit what is represented by an expectant identifier:

identifier-pregnant

For processing purposes it is handy to map between identifiers, to query identifiers, access by identifiers, to mention only a few tasks, and all of them are machine facing.

However efficient it may be to use flat identifiers (even by humans), having access to bundle of properties thought to identify a subject is useful as well.

Topic maps already capture identifiers but their syntaxes need to be extended to support the capturing of subject identifications along with identifiers.

Years of reading has gone into the realization about identifiers and their relationship to identifications, but I would be remiss if I didn’t call out the work of Lars Marius Garshol on Duke.

From the GitHub page:

Duke is a fast and flexible deduplication (or entity resolution, or record linkage) engine written in Java on top of Lucene. The latest version is 1.2 (see ReleaseNotes).

Duke can find duplicate customer records, or other kinds of records in your database. Or you can use it to connect records in one data set with other records representing the same thing in another data set. Duke has sophisticated comparators that can handle spelling differences, numbers, geopositions, and more. Using a probabilistic model Duke can handle noisy data with good accuracy.

In an early post on Duke Lars observes:


The basic idea is almost ridiculously simple: you pick a set of properties which provide clues for establishing identity. To compare two records you compare the properties pairwise and for each you conclude from the evidence in that property alone the probability that the two records represent the same real-world thing. Bayesian inference is then used to turn the set of probabilities from all the properties into a single probability for that pair of records. If the result is above a threshold you define, then you consider them duplicates.

Bayesian identity resolution

Only two quibbles with Lars on that passage:

I would delete “same real-world thing” and substitute, “any subject you want to talk about.”

I would point out that Bayesian inference is only one means of determining if two or more sets of properties represent the same subject. Defining sets of matching properties comes to mind. Inferencing based on relationships (associations). “Ask Steve,” is another.

But, I have never heard a truer statement from Lars than:

The basic idea is almost ridiculously simple: you pick a set of properties which provide clues for establishing identity.

Many questions remain, such as how to provide for collections of sets “of properties which provide clues for establishing identity?,” how to make those collections extensible?, how to provide for constraints on such sets?, where to record “matching” (read “merging”) rules?, what other advantages can be offered?

In answering those questions, I think we need to keep in mind that identifiers and identifications lie along a continuum that runs from where we “know” what is meant by an identifier to where we ourselves need a full identification to know what is being discussed. A useful answer won’t be one or the other, but a pairing that suits a particular circumstance and use case.

Jellyfish

Saturday, September 6th, 2014

Jellyfish by James Turk and Michael Stephens.

From the webpage:

Jellyfish is a python library for doing approximate and phonetic matching of strings.

String comparison:

  • Levenshtein Distance
  • Damerau-Levenshtein Distance
  • Jaro Distance
  • Jaro-Winkler Distance
  • Match Rating Approach Comparison
  • Hamming Distance

Phonetic encoding:

  • American Soundex
  • Metaphone
  • NYSIIS (New York State Identification and Intelligence System)
  • Match Rating Codex

You might want to consider the string matching offered by Duke (written on top of Lucene):

String comparators

  • Levenshtein
  • WeightedLevenshtein
  • JaroWinkler
  • QGramComparator

Simple comparators

  • ExactComparator
  • DifferentComparator

Specialized comparators

  • GeopositionComparator
  • NumericComparator
  • PersonNameComparator

Phonetic comparators

  • SoundexComparator
  • MetaphoneComparator
  • NorphoneComparator

Token set comparators

  • DiceCoefficientComparator
  • JaccardIndexComparator

Enjoy!

Duke 1.2 Released!

Sunday, February 16th, 2014

Lars Marius Garshol has released Duke 1.2!

From the homepage:

Duke is a fast and flexible deduplication (or entity resolution, or record linkage) engine written in Java on top of Lucene. The latest version is 1.2 (see ReleaseNotes).

Duke can find duplicate customer records, or other kinds of records in your database. Or you can use it to connect records in one data set with other records representing the same thing in another data set. Duke has sophisticated comparators that can handle spelling differences, numbers, geopositions, and more. Using a probabilistic model Duke can handle noisy data with good accuracy.

Features

  • High performance.
  • Highly configurable.
  • Support for CSV, JDBC, SPARQL, and NTriples.
  • Many built-in comparators.
  • Plug in your own data sources, comparators, and cleaners.
  • Genetic algorithm for automatically tuning configurations.
  • Command-line client for getting started.
  • API for embedding into any kind of application.
  • Support for batch processing and continuous processing.
  • Can maintain database of links found via JNDI/JDBC.
  • Can run in multiple threads.

The GettingStarted page explains how to get started and has links to further documentation. The examples of use page lists real examples of using Duke, complete with data and configurations. This presentation has more the big picture and background.

Excellent!

Until you know which two or more records are talking about the same subject, it’s very difficult to know what to map together.

elasticsearch-entity-resolution

Tuesday, December 24th, 2013

elasticsearch-entity-resolution

From the webpage:

This project is an interactive entity resolution plugin for Elasticsearch based on Duke. Basically, it uses Bayesian probabilities to compute probability. You can pretty much use it an interactive deduplication engine.

To understand basics, go to Duke project documentation.

A list of available comparators is available here.

Interesting pairing of Duke (entity resolution/record linkage software by Lars Marius Garshol) with ElasticSearch.

Strings and user search behavior can only take an indexing engine so far. This is a step in the right direction.

A step more likely be followed with an Apache License as opposed to its current LGPLv3.

Active learning, almost black magic

Tuesday, October 22nd, 2013

Active learning, almost black magic by Lars Marius Garshol.

From the post:

I’ve written Duke, an engine for figuring out which records represent the same thing. It works fine, but people find it difficult to configure correctly, which is not so strange. Getting the configurations right requires estimating probabilities and choosing between comparators like Levenshtein, Jaro-Winkler, and Dice coefficient. Can we get the computer to do something people cannot? It sounds like black magic, but it’s actually pretty simple.

I implemented a genetic algorithm that can set up a good configuration automatically. The genetic algorithm works by making lots of configurations, then removing the worst and making more of the best. The configurations that are kept are tweaked randomly, and the process is repeated over and over again. It’s dead simple, but it works fine. The problem is: how is the algorithm to know which configurations are the best? The obvious solution is to have test data that tells you which records should be linked, and which ones should not be linked.

But that leaves us with a bootstrapping problem. If you can create a set of test data big enough for this to work, and find all the correct links in that set, then you’re fine. But how do you find all the links? You can use Duke, but if you can set up Duke well enough to do that you don’t need the genetic algorithm. Can you do it in other ways? Maybe, but that’s hard work, quite possibly harder than just battling through the difficulties and creating a configuration.

So, what to do? For a year or so I was stuck here. I had something that worked, but it wasn’t really useful to anyone.

Then I came across a paper where Axel Ngonga described how to solve this problem with active learning. Basically, the idea is to pick some record pairs that perhaps should be linked, and ask the user whether they should be linked or not. There’s an enormous number of pairs we could ask the user about, but most of these pairs provide very little information. The trick is to select those pairs which teach the algorithm the most.

This great stuff.

Particularly since I have a training problem that lacks a training set. ­čśë

Looking forward to trying this on “real-world problems” as Lars says.

Elasticsearch Entity Resolution

Thursday, September 12th, 2013

elasticsearch-entity-resolution by Yann Barraud.

From the webpage:

This project is an interactive entity resolution plugin for Elasticsearch based on Duke. Basically, it uses Bayesian probabilities to compute probability. You can pretty much use it an interactive deduplication engine.

It is usable as is, though cleaners are not yet implemented.

To understand basics, go to Duke project documentation.

A list of available comparators is available here.

Intereactive deduplication? Now that sounds very useful for topic map authoring.

Appropriate that I saw this in a Tweet by Duke‘s author, Lars Marius Garshol.

Duke 1.0 Release!

Monday, March 4th, 2013

Duke 1.0 Release!

From the project page:

Duke is a fast and flexible deduplication (or entity resolution, or record linkage) engine written in Java on top of Lucene. The latest version is 1.0 (see ReleaseNotes).

Features

  • High performance.
  • Highly configurable.
  • Support for CSV, JDBC, SPARQL, and NTriples DataSources.
  • Many built-in comparators.
  • Plug in your own data sources, comparators, and cleaners.
  • Command-line client for getting started.
  • API for embedding into any kind of application.
  • Support for batch processing and continuous processing.
  • Can maintain database of links found via JNDI/JDBC.
  • Can run in multiple threads.

The GettingStarted page explains how to get started and has links to further documentation. This blog post describes the basic approach taken to match records. It does not deal with the Lucene-based lookup, but describes an early, slow O(n^2) prototype. This early presentation describes the ideas behind the engine and the intended architecture; a later and more up to date presentation has more practical detail and examples. There's also the ExamplesOfUse page, which lists real examples of using Duke, complete with data and configurations.

Excellent news on the data depulication front!

And for topic map authors as well (see the examples).

Kudos to Lars Marius Garshol!

Duke 0.1 Release

Thursday, May 19th, 2011

Duke 0.1 Release

Lars Marius Garshol on Duke 0.1:

Duke is a fast and flexible deduplication (or entity resolution, or record linkage) engine written in Java on top of Lucene. At the moment (2011-04-07) it can process 1,000,000 records in 11 minutes on a standard laptop in a single thread.

Version 0.1 has been released, consisting of a command-line tool which can read CSV, JDBC, SPARQL, and NTriples data. There is also an API for programming incremental processing and storing the result of processing in a relational database.

The GettingStarted page explains how to get started and has links to further documentation. This blog post describes the basic approach taken to match records. It does not deal with the Lucene-based lookup, but describes an early, slow O(n^2) prototype. This presentation describes the ideas behind the engine and the intended architecture.

If you have questions, please contact the developer, Lars Marius Garshol, larsga at garshol.priv.no.

I will look around for sample data files.