According to my blogging software this is my 2,000th post!
During the search for content and ideas for this blog I have thought a lot about topic maps and how to explain them.
Or should I say how to explain topic maps without inventing new terminologies or notations? đ
Topic maps deal with a familiar problem:
People use different words when talking about the same subject and the same word when talking about different subjects.
Happens in conversations, newspapers, magazines, movies, videos, tv/radio, texts, and alas, electronic data.
The confusion caused by using different words for the same subject and same word for different subjects is a source of humor. (What does “nothing” stand for in Shakespeare’s “Much Ado About Nothing”?)
In searching electronic data, that confusion causes us to miss some data we want to find (different word for the same subject) and to find some data we don’t want (same word but different subject).
When searching old newspaper archives this can be amusing and/or annoying.
Potential outcomes of failure elsewhere:
medical literature |
injury/death/liability |
financial records |
civil/criminal liability |
patents |
lost opportunities/infringement |
business records |
civil/criminal liability |
Solving the problem of different words for the same subject and the same word but different subjects is important.
But how?
Topic maps and other solutions have one thing in common:
They use words to solve the problem of different words for the same subject and the same word but different subjects.
Oops!
The usual battle cry is “if everyone uses my words, we can end semantic confusion, have meaningful interchange for commerce, research, cultural enlightenment and so on and so forth.”
I hate to be the bearer of bad news but what about all the petabytes of data we already have on hand with zettabytes of previous interpretations? With more being added every day and not universal solution in sight? (If you don’t like any of the current solutions, wait a few months and new proposals, schemas, vocabularies, etc., will surface. Or you can take the most popular approach and start your own.)
Proposals to deal with semantic confusion are also frozen in time and place. Unlike the human semantics they propose to sort out, they do not change and evolve.
We have to use the source of semantic difficulty, words, in crafting a solution and our solution has to evolve over time even as our semantics do.
That’s a tall order.
Part of the solution, if you want to call it that, is to recognize when the benefits of solving semantic confusion outweighs the cost of the solution. We don’t need to solve semantic confusion everywhere and anywhere it occurs. In some cases, perhaps rather large cases, it isn’t worth the effort.
That triage of semantic confusion allows us to concentrate on cases where the investment of time and effort are worthwhile. In searching for the Hilton Hotel in Paris I may get “hits” for someone with underwear control issues but so what? Is that really a problem that needs a solution?
On the other hand, being able to resolve semantic confusion, such as underlies different accounting systems for businesses, could give investors a clearer picture of the potential risks and benefits of particular investments. Or doing the same for financial institutions so that regulators can “look down” into regulated systems with some semantic coherence (without requiring identical systems).
Having chosen some semantic confusion to resolve, we then have to choose a method to resolve it.
One method, probably the most popular one, is the “use my (insert vocabulary)” method for resolving semantic confusion. Works and for some cases, may be all that you need. Databases with gigabyte size tables (and larger) operate quite well using this approach. Can become problematic after acquisitions when migration to other database systems is required. Undocumented semantics can prove to be costly in many situations.
Semantic Web techniques, leaving aside the fanciful notion of unique identifiers, do offer the capability of recording additional properties about terms or rather the subjects that terms represent. Problematically though, they don’t offer the capacity to specify which properties are required to distinguish one term from another.
No, I am not about to launch into a screed about why “my” system works better than all the others.
Recognition that all solutions are composed of semantic ambiguity is the most important lesson of the Topic Maps Reference Model (TMRM).
Keys (of key/value pairs) are pointers to subject representatives (proxies) and values may be such references. Other keys and/or values may point to other proxies that represent the same subjects. Which replicates the current dilemma.
The second important lesson of the TMRM is the use of legends to define what key/value pairs occur in a subject representative (proxy) and how to determine two or more proxies represent the same subject (subject identity).
Neither lesson ends semantic ambiguity, nor do they mandate any particular technology or methodology.
They do enable the creation and analysis of solutions, including legends, with an awareness they are all partial mappings, with costs and benefits.
I will continue the broad coverage of this blog on semantic issues but in the next 1,000 posts I will make a particular effort to cover:
- Ex Parte Declaration of Legends for Data Sources (even using existing Linked Data where available)
- Suggestions for explicit subject identity mapping in open source data integration software
- Advances in graph algorithms
- Sample topic maps using existing and proposed legends
Other suggestions?