Improving Schema Matching with Linked Data by Ahmad Assaf, Eldad Louw, Aline Senart, Corentin Follenfant, Raphaël Troncy, and David Trastour.
Abstract:
With today’s public data sets containing billions of data items, more and more companies are looking to integrate external data with their traditional enterprise data to improve business intelligence analysis. These distributed data sources however exhibit heterogeneous data formats and terminologies and may contain noisy data. In this paper, we present a novel framework that enables business users to semi-automatically perform data integration on potentially noisy tabular data. This framework offers an extension to Google Refine with novel schema matching algorithms leveraging Freebase rich types. First experiments show that using Linked Data to map cell values with instances and column headers with types improves significantly the quality of the matching results and therefore should lead to more informed decisions.
Personally I don’t find mapping Airport -> Airport Code all that convincing a demonstration.
The other problem I have is what happens after a user “accepts” a mapping?
Now what?
I can contribute my expertise to mappings between diverse schemas all day, even public ones.
What happens to all that human effort?
It is what I call the “knowledge toilet” approach to information retrieval/integration.
Software runs (I can’t count the number of times integration software has been run on Citeseer. Can you?) and a user corrects the results as best they are able.
Now what?
Oh, yeah, the next user or group of users does it all over again.
Why?
Because the user before them flushed the knowledge toilet.
The information had been mapped. Possibly even hand corrected by one or more users. Then it is just tossed away.
That has to seem wrong at some very fundamental level. Whatever semantic technology you choose to use.
I’m open to suggestions.
How do we stop flushing the knowledge toilet?
Isn’t this the same sort of problem version control systems were designed to solve?
Comment by marijane — May 15, 2012 @ 5:29 pm
Interesting suggestion.
My impression (sans any research, that will be tomorrow) is that version control was designed to allow for revision control, branching, etc., under fairly well defined situations.
It was not designed to capture semantic decision making such as you would have while pursuing a search. What you consider to be the same, different, why, that sort of thing.
No search interface that I know of (leaving a large margin to simply be wrong) captures that sort of semantic detail.
Appreciate the suggestion. I suspect there is a good bit to be uncovered in the history of versioning.
Thanks!
Patrick
Comment by Patrick Durusau — May 15, 2012 @ 7:33 pm
I guess I was thinking specifically about version control for the mappings you spoke of in the post. I assumed that they’re being stored somewhere, so why not store changes over time as well? But perhaps that isn’t an appropriate assumption to make.
Comment by marijane — May 16, 2012 @ 4:07 pm
Oh, well, two issues:
1) The usual mapping of schemas gives you the result, not the starting points or why you got the result. That is if I have two different database schemas and map them together, where do I put the two starting schemas or why they merged together?
Not that it isn’t possible to put that sort of process under version control (which I think is your question) but I don’t think that is common practice.
2) Even if you stored the two schemas and even the reasons for them merging, a version control system is designed to track changes, not to create a map of the changes. That is it can say which six lines changed and/or where they moved, but that doesn’t automatically tie to an explanation. You can enter comments and careful users do. But, the mapping would be by the user and not assisted by the software.
Good idea though. Still trying to take a “look see” on versioning software.
Comment by Patrick Durusau — May 16, 2012 @ 6:50 pm