SERIMI…. (Have you washed your data?)

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 25, 2011

SERIMI…. (Have you washed your data?)

Filed under: Linked Data,LOD,RDF,Similarity — Patrick Durusau @ 7:04 pm

SERIMI – Resource Description Similarity, RDF Instance Matching and Interlinking

From the website:

The interlinking of datasets published in the Linked Data Cloud is a challenging problem and a key factor for the success of the Semantic Web. Manual rule-based methods are the most effective solution for the problem, but they require skilled human data publishers going through a laborious, error prone and time-consuming process for manually describing rules mapping instances between two datasets. Thus, an automatic approach for solving this problem is more than welcome. We propose a novel interlinking method, SERIMI, for solving this problem automatically. SERIMI matches instances between a source and a target datasets, without prior knowledge of the data, domain or schema of these datasets. Experiments conducted with benchmark collections demonstrate that our approach considerably outperforms state-of-the-art automatic approaches for solving the interlinking problem on the Linked Data Cloud.

SERIMI-TECH-REPORT-v2.pdf

From the Results section:

The poor performance of SERIMI in the Restaurant1-Reataurant2 is mainly due to missing alignment in the reference set. The poor performance in the Person21-Person22 pair is due to the nature of the data. These datasets where built by adding spelling mistakes to the properties and literals values of their original datasets. Also only instances of class Person were retrieved into the pseudo-homonym sets during the interlinking process.

Impressive work overall but isn’t dirty data really the test? Just about any process can succeed with clean data.

Or is that really the weakness of the Semantic Web? That it requires clean data?

Comments (2)

2 Comments

A really interesting find! I’ve downloaded and queued the paper for reading.

I agree that if it requires clean data to work there isn’t really going to be many datasets that it can be applied to, but there might still be useful techniques in the paper that can be borrowed.

Comment by larsga@garshol.priv.no — August 26, 2011 @ 4:55 am
Hard to say how “clean” the data has to be, a better term would have been “close,” as not too semantically distant.

But I suspect that is always going to be the case. That is machines can be trained to do that easy cases but if the data is really “dirty,” then it will require the human touch.

Comment by Patrick Durusau — August 26, 2011 @ 8:32 am

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.