Efficient Multidimensional Blocking for Link Discovery without losing Recall
Jack Park did due diligence on the SILK materials before I did and forwarded a link to this paper.
Abstract:
Over the last three years, an increasing number of data providers have started to publish structured data according to the Linked Data principles on the Web. The resulting Web of Data currently consists of over 28 billion RDF triples. As the Web of Data grows, there is an increasing need for link discovery tools which scale to very large datasets. In record linkage, many partitioning methods have been proposed which substantially reduce the number of required entity comparisons. Unfortunately, most of these methods either lead to a decrease in recall or only work on metric spaces. We propose a novel blocking method called Multi-Block which uses a multidimensional index in which similar objects are located near each other. In each dimension the entities are indexed by a different property increasing the efficiency of the index significantly. In addition, it guarantees that no false dismissals can occur. Our approach works on complex link specifications which aggregate several different similarity measures. MultiBlock has been implemented as part of the Silk Link Discovery Framework. The evaluation shows a speedup factor of several 100 for large datasets compared to the full evaluation without losing recall.
From deeper in the paper:
If the similarity between two entities exceeds a threshold $\theta$, a link between these two entities is generated. $sim$ is computed by evaluating a link specification $s$ (in record linkage typically called linkage decision rule [23]) which specifies the conditions two entities must fulfill in order to be interlinked.
If I am reading this paper correctly, there isn’t a requirement (as in record linkage) that we normalized the data to a common format before writing the rule for comparisons. That in and of itself is a major boon. To say nothing of the other contributions of this paper.