Identifiers vs. Identifications?

One problem with topic map rhetoric has been its focus on identifiers (the flat ones):

identifier2

rather than saying topic are managing subject identifications, that is, making explicit what is represented by an expectant identifier:

identifier-pregnant

For processing purposes it is handy to map between identifiers, to query identifiers, access by identifiers, to mention only a few tasks, and all of them are machine facing.

However efficient it may be to use flat identifiers (even by humans), having access to bundle of properties thought to identify a subject is useful as well.

Topic maps already capture identifiers but their syntaxes need to be extended to support the capturing of subject identifications along with identifiers.

Years of reading has gone into the realization about identifiers and their relationship to identifications, but I would be remiss if I didn’t call out the work of Lars Marius Garshol on Duke.

From the GitHub page:

Duke is a fast and flexible deduplication (or entity resolution, or record linkage) engine written in Java on top of Lucene. The latest version is 1.2 (see ReleaseNotes).

Duke can find duplicate customer records, or other kinds of records in your database. Or you can use it to connect records in one data set with other records representing the same thing in another data set. Duke has sophisticated comparators that can handle spelling differences, numbers, geopositions, and more. Using a probabilistic model Duke can handle noisy data with good accuracy.

In an early post on Duke Lars observes:


The basic idea is almost ridiculously simple: you pick a set of properties which provide clues for establishing identity. To compare two records you compare the properties pairwise and for each you conclude from the evidence in that property alone the probability that the two records represent the same real-world thing. Bayesian inference is then used to turn the set of probabilities from all the properties into a single probability for that pair of records. If the result is above a threshold you define, then you consider them duplicates.

Bayesian identity resolution

Only two quibbles with Lars on that passage:

I would delete “same real-world thing” and substitute, “any subject you want to talk about.”

I would point out that Bayesian inference is only one means of determining if two or more sets of properties represent the same subject. Defining sets of matching properties comes to mind. Inferencing based on relationships (associations). “Ask Steve,” is another.

But, I have never heard a truer statement from Lars than:

The basic idea is almost ridiculously simple: you pick a set of properties which provide clues for establishing identity.

Many questions remain, such as how to provide for collections of sets “of properties which provide clues for establishing identity?,” how to make those collections extensible?, how to provide for constraints on such sets?, where to record “matching” (read “merging”) rules?, what other advantages can be offered?

In answering those questions, I think we need to keep in mind that identifiers and identifications lie along a continuum that runs from where we “know” what is meant by an identifier to where we ourselves need a full identification to know what is being discussed. A useful answer won’t be one or the other, but a pairing that suits a particular circumstance and use case.

Comments are closed.