Understanding Entity Search by Paul Bruemmer.
From the post:
Over the past two decades, the Internet, search engines, and Web users have had to deal with unstructured data, which is essentially any data that has not been organized or classified according to any sort of pre-defined data model. Thus, search engines were able to identify patterns within webpages (keywords) but were not really able to attach meaning to those pages.
Semantic Search provides a method for classifying the data by labeling each piece of information as an entity — this is referred to as structured data. Consider retail product data, which contains enormous amounts of unstructured information. Structured data enables retailers and manufacturers to provide extremely granular and accurate product data for search engines (machines/bots) to consume, understand, classify and link together as a string of verified information.
Semantic or entity search will optimize much more than just retail product data. Take a look at Schema.org’s schema types – these schemas represent the technical language required to create a structured Web of data (entities with unique identifiers) — and this becomes machine-readable. Machine-readable structured data is disambiguated and more reliable; it can be cross-verified when compared with other sources of linked entity data (unique identifiers) on the Web.
Interesting to see unstructured data defined as:
any data that has not been organized or classified according to any sort of pre-defined data model.
I suppose you can say that but is that how any of us write?
We all write with specific entities in minds, entities that represent subjects we could identify with additional properties if required.
So it is more accurate to say that unstructured data can be defined as:
any data that has not been explicitly identified by one or more properties.
Well, that’s the trick isn’t it? We look at an entity and see properties that a machine does not.
Explicit identification is a requirement. But on the other hand, a “unique” identifier is not.
That’s not just a topic map opinion but is in fact in play at the Global Biodiversity Information Facility (GBIF) I posted about yesterday.
GBIF realizes that ongoing identifications are never going to converge on that happy state where every entity has only one unique reference. In part because an on-going system has to account for all existing data as well as new data which could have new identifiers.
There isn’t enough time or resources to find all prior means of identifying an entity and replacing those with an new identifier. Rather than cutting the Gordian knot of multiple identifiers with a URI sword, GBIF understands multiple identifiers for an entity.
Robust entity search capabilities require the capturing of all identifiers for an entity. So no user is disadvantaged by the identification they know for an entity.
The properties of subjects represented by entities and their identifiers serve as the basis for mapping between identifiers.
None of which needs to be exposed to the user. All a user may see is whatever identifier they have for an entity returns the correct entity and information that was recorded using other identifiers (if they look closely).
What else should an interface disclose other than the result desired by the user?
PS: “Better Late Than Never,” refers to Steve Newcomb and Michel Biezunski promotion of the use of properties to identify the subject represented by entities since the 1990’s. The W3C approach is to replace existing identifiers with a URI. How an opaque URI is better than an opaque string isn’t apparent to me.