Using Similarity-based Operations for Resolving Data-Level Conflicts (2003)
Abstract:
Dealing with discrepancies in data is still a big challenge in data integration systems. The problem occurs both during eliminating duplicates from semantic overlapping sources as well as during combining complementary data from different sources. Though using SQL operations like grouping and join seems to be a viable way, they fail if the attribute values of the potential duplicates or related tuples are not equal but only similar by certain criteria. As a solution to this problem, we present in this paper similarity-based variants of grouping and join operators. The extended grouping operator produces groups of similar tuples, the extended join combines tuples satisfying a given similarity condition. We describe the semantics of these operators, discuss efficient implementations for the edit distance similarity and present evaluation results. Finally, we give examples how the operators can be used in given application scenarios.
No, the title of the post is not a mistake.
The authors of this paper, in 2003, conclude:
In this paper we presented database operators for finding related data and identifying duplicates based on user-specific similarity criteria. The main application area of our work is the integration of heterogeneous data where the likelihood of occurrence of data objects representing related or the same real-world objects though containing discrepant values is rather high. Intended as an extended grouping operation and by combining it with aggregation functions for merging/reconciling groups of conflicting values our grouping operator fits well into the relational algebra framework and the SQL query processing model. In a similar way, an extended join operator takes similarity predicates used for both operators into consideration. These operators can be utilized in ad-hoc queries as part of more complex data integration and cleaning tasks.
In addition to a theoretical background, the authors illustrate an implementation of their techniques, using Oracle 8i. (Oracle 11i is the current version.)
Don’t despair! 😉
Leaves a lot to be done, including:
- Interchange between relational database stores
- Semantic integration in non-relational database stores
- Interchange in mixed relational/non-relational environments
- Identifying bases for semantic integration in particular data sets (the tough nut)
- Others? (your comments can extend this list)
The good news for topic maps is that Oracle has some name recognition in IT contexts. 😉
There is a world of difference between a CIO saying to the CEO:
“That was a great presentation about how we can use our data more effectively with topic maps and some software, what did he say the name was?”
and,
“That was a great presentation about using our Oracle database more effectively!”
Yes?
Big iron for your practice of topic maps. A present for your holiday tradition.
Aside to Matt O’Donnell. Yes, I am going to be covering actual examples of using these operators for topic map purposes.
Right now I am sifting through a 400 document collection on “multi-dimensional indexing” where I discovered this article. Remind me to look at other databases/indexers with similar operators.