The (Real) Semantic Web Requires Machine Learning by John O’Neil.
From the longer quote below:
…different people will almost inevitably create knowledge encodings that can’t easily be compared, because they use different — sometimes subtly, maddeningly different — basic definitions and concepts. Another difficult problem is to decide when entity names refer to the “same” real-world thing. Even worse, if the entity names are defined in two separate places, when and how should they be merged?
And the same is true for relationships between entities.(full stop)
The author thinks statistical analysis will be able to distinguish both entities and relationships between them, which I am sure will be true to some degree.
I would characterize that as a topic map authoring aid but it would also be possible to simply accept the statistical results.
It is refreshing to see someone recognize the “semantic web” is the one created by users and not as dictated by other authorities.
From the post:
We think about the semantic web in two complementary (and equivalent) ways. It can be viewed as:
- A large set of subject-verb-object triples, where the verb is a relation and the subject and object are entities
OR
- As a large graph or network, where the nodes of the graph are entities and the graph’s directed edges or arrows are the relations between nodes.
As a reminder, entities are proper names, like people, places, companies, and so on. Relations are meaningful events, outcomes or states, like BORN-IN, WORKS-FOR, MARRIED-TO, and so on. Each entity (like “John O’Neil”, “Attivio” or “Newton, MA”) has a type (like “PERSON”, “COMPANY” or “LOCATION”) and each relation is constrained to only accept certain types of entities. For example, WORKS-FOR may require a PERSON as the subject and a COMPANY as the object.
How semantic web information is organized and transmitted is described by a blizzard of technical standards and XML namespaces. Once you escape from that, the basic goals of the semantic web are (1) to allow a lot of useful information about the world to be simply expressed, in a way that (2) allows computers to do useful things with it.
Almost immediately, some problems crop up. As generations of artificial intelligence researchers have learned, it can be really difficult to encode real-world knowledge into predicate logic, which is more-or-less what the semantic web is. The same AI researchers also learned that different people will almost inevitably create knowledge encodings that can’t easily be compared, because they use different — sometimes subtly, maddeningly different — basic definitions and concepts. Another difficult problem is to decide when entity names refer to the “same” real-world thing. Even worse, if the entity names are defined in two separate places, when and how should they be merged? For example, do an Internet search for “John O’Neil”, and try to decide which of the results refer to how many different people. Believe me, all the results are not for the same person.
idata-semantic-web.jpgAs for relations, it’s difficult to tell when they really mean the same thing across different knowledge encodings. No matter how careful you are, if you want to use relations to infer new facts, you have few resources to check to see if the combined information is valid.
So, when each web site can define its own entities and relations, independently of any other web site, how do you reconcile entities and relations defined by different people?