Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 10, 2010

Implementing the TMRM (Part 2)

Filed under: TMRM — Patrick Durusau @ 8:52 pm

Implementing the TMRM (Part 2)

I left off in Implementing the TMRM (Part 1) by saying that if the TMRM defined proxies for particular subjects, it would lack the generality needed to enable legends to be written between arbitrary existing systems.

The goal of the TMRM is not to be yet another semantic integration format but to enable users to speak meaningfully of the subjects their systems already represent and to know when the same subjects are being identified differently. The last thing we all need is another semantic integration format. Sorry, back to the main theme:

One reason why it isn’t possible to “implement” the TMRM is the lack of any subject identity equivalence rules.

String matching for IRIs is one test for equivalence of subject identification but not the only one. The TMRM places no restrictions on tests for subject equivalence so any implementation will only have a subset of all the possible subject equivalence tests. (Defining a subset of equivalence tests underlies the capacity for blind interchange of topic maps based on particular legends. More on that later.)

An implementation that compares IRIs for example, would fail if a legend asked it to compare the equivalence of Feynman diagrams generated from the detector output from the Large Hadron Collider. Equivalence of Feynman diagrams being a legitimate test for subject equivalence and well within the bounds of the TMRM.

(It occurs to me that the real question to ask is why we don’t have more generalized legends with ranges of subject identity tests. Sort of like XML parsers only parse part of the universe of markup documents but do quite well within that subset. Apologies for the interruption, that will be yet another post.)

The TMRM is designed to provide the common heuristic through the representation of any subject can be discussed. However, it does not define a processing model, which is another reason why it isn’t possible to “implement” the TMRM, but more on that in Implementing the TMRM (Part 3).

7 Comments

  1. I enjoy the TMRM a lot. I remember that on my first read of the “Essentials” paper published in the proceedings of TMRA 2006, i was wondering why your reference list was empty. Back then – i was still new to the Topic Maps community – i thought you and Steve were just lazy. Not much time passed and i had to revert this assumption realizing that you probably were looking for roots of this idea, but were not able to locate any. I am still silently pursuing the goal to find a hint onto the TMRM in some older literature of various fields 🙂

    I like in particular the idea of explicating the subject identification method in a legend by the means of a hopefully well established helper proxy system. Yet, i wonder whether URI comparison does not suffice since the result of such a subject identification method can be used for URI construction. In the case of data silos, this is pretty straight forward. And i am sure it is possible to map a Feynman diagram onto a string. With this idea in mind, i wonder whether the TMRM isn’t embeddable into other systems, also into TMDM based systems.

    Comment by Robert Cerny — March 11, 2010 @ 4:27 am

  2. I think a stronger point for you to make would be that all comparisons, by computers at any rate, are string comparisons. That some strings are also URIs is just a curiosity. Although they are the basis for confusing identifiers and addresses of resources in RDF.

    The Semantic Web at 10 has finally realized that mistake and rather than correcting it, is attempting a correction by creating 303 overhead traffic. Fixing the rather amateurish confusion of idenifiers with addresses of resources would be cleaner and less burdensome to the Web as a whole. (You know that an error doesn’t become a mistake until you fail to correct it.)

    But even with the stronger question, the answer is no. Two reasons that I will expand on in other postings:

    1) The comparison by a computer is simply a mechanization of the judgment by some user that two or more proxies represent the same subject. How do we make that judgment transferrable between users? Well, the legend declares the basis on which a user (a string for your computer) will form the basis for comparing two or more proxies. That is to say that the basis for identification is disclosed to other users. Who can choose to follow (or not) that basis for identification.

    2) The more fundamental reason is that in every key/value pair, the key is a reference to a proxy which represents a subject. Unlike any other information system we were able to locate, the TMRM presumes that there is a representative for the subject of the key in the key/value pair.

    And that representative of a subject is may be the locus of merging as well. So that simply representing a present value as a string may not capture later merging that occurs at that key (or any of the keys that its key/value pairs may involve).

    Another way to say that subject identification is recursive and dynamic.

    Never fear! Every legend declares quite arbitrarily and for reasons that seem best to it, where that recursion ends and what merging can occur. But, any other legend can extend that recursion or merging, such as in cases where we wish to map between different data sources.

    While it is possible to represent any particular subject identification with a string, it is not the case that such a string can represent the inherently recursive and dynamic nature of subject identification. Perhaps best to say that yes, yes you could use such a string, but only at the cost of cutting off any further recursion and hence additional merging.

    That could be a great operational decision for some purposes, but it would be a very poor decision to tax all users of topic maps with. The purpose of the TMRM is to enable choices, not imprison users in our a priori choices.

    (In terms of prior art, it is the explicit recursion of subject identification and the merging inherent in that recursion that may be original to the TMRM. Every legend declares where it stands, but subject to being extended by another legend. The other systems we have examined all pick equally arbitrary end points but then make them universal for all users. That we sought to avoid.)

    Comment by Patrick Durusau — March 11, 2010 @ 8:34 am

  3. This stuff is very helpful. Especially, I think, the response to Robert Cerny.

    Comment by Steve Newcomb — March 12, 2010 @ 10:18 am

  4. I agree that subject identification is recursive and dynamic by nature. As far as i understand the consequences, this would leave no other choice but to late merge, since two proxies might identify the same subject with legend set A and different subjects with legend set B.

    I wonder if this is related to the unease i have when reusing foreign subject identifiers. This unease made me think of the following alternative late merging approach which has the downside of not explicating the subject identification method, but therefore it works on the TMDM. I distinguish between content topic maps and glue topic maps. Content topic maps are topic maps with one and only one subject identifier per topic which is assigned by the author of the topic map. This restriction is necessary in order to ensure that all statements indeed apply to the subject the author had in mind and thus can be trusted to be meant as they are written down. Glue topic maps, on the other hand, are topic maps which only contain topics with at least two subject identifiers. As a matter of fact, they could alternatively be called collocation topic maps.

    Comment by Robert Cerny — March 14, 2010 @ 1:24 pm

  5. That works if and only if you presume a single author for a topic map. And for that matter that the topic map is small enough for the author to be completely consistent in its authoring.

    To presume one author or that the author is inhumanly consistent seem like unworkable restrictions to me.

    Consider the case where researchers, all of who “trust” each other are collaborating on research on Holocaust archives. Rather than circulate subject identifiers, they simply create them as they do their research, including building associations.

    At periodic points they query the contributed topic maps and add identifiers where necessary to create merged topics which automatically makes associations unknown to the other researchers appear for topics they created. (If you are guessing this is going to appear in more detail in a future post, right in one.)

    While I agree that untrusted content/legend must be treated with caution, I think limiting subjects to one subject identifier (a restriction that the TMDM does not make, it allows sets of identifiers) is overly restrictive.

    I think this is going to be an active area of research for some time to come. For instance, what would a “security” legend look like?

    How much security should be on the topic map side and how much strictly implementation defined? At what level does topic map security fit into a layered security system?

    Comment by Patrick Durusau — March 14, 2010 @ 2:53 pm

  6. Sorry for being unclear. I did not mean that this is to be the only working alternative. The restrictions that i suggest are meant to be for special use cases. Of course you can live with distributed semantic handshakes well. IMHO there are three cases where it does not work well.

    1) if you take a high number of topic map authors, who anarchically reuse subject identifiers and constantly merge topics. This would end up in a Tower of Babel, because any wrong subject identity decision will propagate into the overall system increasing the likelihood of further wrong decisions.

    2) if there is no strong shared context and communication is difficult. Subject indicators only work to a certain point, since no subject indicator actually is about a single subject (no picture, no text). They do connect the subject in question with other subjects. And only because of our shared context we are currently able to identifiy the right subject as the subject in question. I think this is related to the recursive nature of subject identification. Correct me, if i am wrong.

    3) if the cost of a wrong merge is high, e.g. like in a hospital a life could be at stake. By the way, in Germany there is now strict subject identification procedures coming into place when a surgery team starts its work.

    Lastly, i agree that even subject identifiers that are created by one person can become foreign to him or her.

    Comment by Robert Cerny — March 15, 2010 @ 12:54 am

  7. Good point on your #2, which I often overlook when describing subject indicators. They are not free from ambiguity and that is an important point to remember.

    I am not certain that the ambiguity is related to the recursive nature of subject identification so much as it is a matter of the complexity of subject identification, which I think is slightly different.

    Complexity of subject identification is the necessity of a confluence of identity points if you will, in order to make a “positive ID” as they call it in crime novels. If in the Frankfort airport near the date for TMRA in 2011 you see a slightly stooped, male, with glasses, wearing some sort of electrical device (a TENS unit), with long gray hair/beard, along with other features your recognize, you may say: “There is Patrick!”

    But, in your topic map, you only write down a subject identifier and ignore as a practical matter all the other identity points that trigger such a recognition. That is what I would call the complexity of subject identification.

    The recursive nature of subject identification is that each of those “identity points” are subjects in their own right and so may have multiple ways to be identified themselves. For example, I am sure there are equivalent terms in multiple languages for “long gray hair/beard” which were only two of the subjects mentioned above that could form a complex of subject that might identify me.

    Each of those subjects are identified by other subjects and so on. That is what I would term the recursive nature of subject identification.

    Where that causes trouble in most data systems is that we don’t know what subjects were being identified by its primitives and so when we have to migrate or merge those data files (which happens eventually if the data is to be preserved) it is necessary to reconstruct what subjects were meant (time consuming and costly). (Apologies for the length of my response but you raise important questions.)

    Quickly:

    Your #1. If you presume both bad practices and bad system design it would be hard to have any other result.

    Your #3. A point that medical record reform advocates in some countries need to take to understand. I think we are only at the beginning of exploring merging safeguards.

    Comment by Patrick Durusau — March 15, 2010 @ 4:52 am

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress