A vendor “truth” document from Ontotext. Not that being from a vendor is a bad thing, but you should always consider the source of a document when evaluating its claims.
Quite naturally I jumped to: “6. Data Integration & Identity Resolution: Identifying the same entity across disparate data sources.”
With so many different databases and systems existing inside any single organization, how do companies integrate all of their data? How do they recognize that an entity in one database is the same entity in a completely separate database?
Resolving identities across disparate sources can be tricky. First, they need to be identified and then linked.
To do this effectively, you need two things. Earlier, we mentioned that through the use of text analysis, the same entity spelled differently can be recognized. Once this happens, the references to entities need to be stored correctly in the triplestore. The triplestore needs to support predicates that can declare two different Universal Resource Indicators (URIs) as one in the same. By doing this, you can align the same real-world entity used in different data sources. The most standard and powerful predicate used to establish mappings between multiple URIs of a single object is owl:sameAs. In turn, this allows you to very easily merge information from multiple sources including linked open data or proprietary sources. The ability to recognize entities across multiple sources holds great promise helping to manage your data more effectively and pinpointing connections in your data that may be masked by slightly different entity references. Merging this information produces more accurate results, a clearer picture of how entities are related to one another and the ability to improve the speed with which your organization operates.
In case you are unfamiliar with owl:sameAS, here is an example from OWL Web Ontology Language Reference
<rdf:Description rdf:about="#William_Jefferson_Clinton">: <owl:sameAs rdf:resource="#BillClinton"/> </rdf:Description>
The owl:sameAs in this case is opaque because there is no way to express why an author thought #William_Jefferson_Clinton and #BillClinton were about the same subject. You could argue that any prostitute in Columbia would recognize that mapping so let’s try a harder case.
<rdf:Description rdf:about="#United States of America">: <owl:sameAs rdf:resource="#الولايات المتحدة الأمريكية"/> </rdf:Description>
Less confident than you were about the first one?
The problem with owl:sameAs is its opaqueness. You don’t know why an author used owl:sameAs. You don’t know what property or properties they saw that caused them to use one of the various understandings of owl:sameAs.
Without knowing those properties, accepting any owl:sameAs mapping is buying a pig in a poke. Not a proposition that interests me. You?
I first saw this in a tweet by graphityhq.