Data Integration: The Relational Logic Approach by Michael Genesereth of Stanford University.
Abstract:
Data integration is a critical problem in our increasingly interconnected but inevitably heterogeneous world. There are numerous data sources available in organizational databases and on public information systems like the World Wide Web. Not surprisingly, the sources often use different vocabularies and different data structures, being created, as they are, by different people, at different times, for different purposes.
The goal of data integration is to provide programmatic and human users with integrated access to multiple, heterogeneous data sources, giving each user the illusion of a single, homogeneous database designed for his or her specific need. The good news is that, in many cases, the data integration process can be automated.
This book is an introduction to the problem of data integration and a rigorous account of one of the leading approaches to solving this problem, viz., the relational logic approach. Relational logic provides a theoretical framework for discussing data integration. Moreover, in many important cases, it provides algorithms for solving the problem in a computationally practical way. In many respects, relational logic does for data integration what relational algebra did for database theory several decades ago. A companion web site provides interactive demonstrations of the algorithms.
Interactive edition with working examples: http://logic.stanford.edu/dataintegration/. (As near as I can tell, the entire text. Although referred to as the “companion” website.)
When the author said Datalog, I thought of Lar Marius:
In our examples here and throughout the book, we encode relationships between and among schemas as rules in a language called Datalog. In many cases, the rules are expressed in a simple version of Datalog called Basic Datalog; in other cases, rules are written in more elaborate versions, viz., Functional Datalog and Disjunctive Datalog. In the following paragraphs, we look at Basic Datalog first, then Functional Datalog, and finally Disjunctive Datalog. The presentation here is casual; formal details are given in Chapter 2.
Bottom line is that the author advocates a master schema approach but you should read book for yourself. It makes a number of good points about data integration issues and the limitations of various techniques. Plus you may learn some Datalog along the way!
[…] Data Integration: The Relational Logic Approach pays homage to what is called the N-squared problem. The premise of N-squared for data integration is that every distinct identification must be mapped to every other distinct identification. Here is a graphic of the N-squared problem. […]
Pingback by Semantic Integration: N-Squared to N+1 (and decentralized) « Another Word For It — September 30, 2011 @ 7:02 pm