Using Graph Structure Record Linkage on Irish Census Data with Neo4j by Brian Underwood.
From the post:
For just over a year I’ve been obsessed on-and-off with a project ever since I stayed in the town of Skibbereen, Ireland. Taking data from the 1901 and 1911 Irish censuses, I hoped I would be able to find a way to reliably link resident records from the two together to identify the same residents.
Since then I’ve learned a bit about master data management and record linkage and so I thought I would give it another stab.
Here I’d like to talk about how I’ve been matching records based on the local data space around objects to improve my record linkage scoring.
…
An interesting issue that has currency with intelligence agencies slurping up digital debris at every opportunity. So you have trillions of records. Which ones have you been able to reliably match up?
From a topic map perspective, I could not help but notice that in the 1901 census, the categories for Marriage were:
- Married
- Widower
- Widow
- Not Married
Whereas the 1911 census records:
- Married
- Widower
- Widow
- Single
As you know, one of the steps in record linkage is normalization of the data headings and values before you apply the standard techniques to link records together.
In traditional record linkage, the shift from “not married” to “single” is lost in the normalization.
May not be meaningful for your use case but could be important for someone studying shifts in marital relationship language. Or shifts in religious, ethnic, or racist language.
Or for that matter, shifts in the names of database column headers and/or tables. (Like anyone thinks those are stable.)
Pay close attention to how Brian models similarity candidates.
Once you move beyond string equivalent identifiers (TMDM), you are going to be facing the same issues.