SERIMI – Resource Description Similarity, RDF Instance Matching and Interlinking
From the website:
The interlinking of datasets published in the Linked Data Cloud is a challenging problem and a key factor for the success of the Semantic Web. Manual rule-based methods are the most effective solution for the problem, but they require skilled human data publishers going through a laborious, error prone and time-consuming process for manually describing rules mapping instances between two datasets. Thus, an automatic approach for solving this problem is more than welcome. We propose a novel interlinking method, SERIMI, for solving this problem automatically. SERIMI matches instances between a source and a target datasets, without prior knowledge of the data, domain or schema of these datasets. Experiments conducted with benchmark collections demonstrate that our approach considerably outperforms state-of-the-art automatic approaches for solving the interlinking problem on the Linked Data Cloud.
From the Results section:
The poor performance of SERIMI in the Restaurant1-Reataurant2 is mainly due to missing alignment in the reference set. The poor performance in the Person21-Person22 pair is due to the nature of the data. These datasets where built by adding spelling mistakes to the properties and literals values of their original datasets. Also only instances of class Person were retrieved into the pseudo-homonym sets during the interlinking process.
Impressive work overall but isn’t dirty data really the test? Just about any process can succeed with clean data.
Or is that really the weakness of the Semantic Web? That it requires clean data?
A really interesting find! I’ve downloaded and queued the paper for reading.
I agree that if it requires clean data to work there isn’t really going to be many datasets that it can be applied to, but there might still be useful techniques in the paper that can be borrowed.
Comment by larsga@garshol.priv.no — August 26, 2011 @ 4:55 am
Hard to say how “clean” the data has to be, a better term would have been “close,” as not too semantically distant.
But I suspect that is always going to be the case. That is machines can be trained to do that easy cases but if the data is really “dirty,” then it will require the human touch.
Comment by Patrick Durusau — August 26, 2011 @ 8:32 am