Archive for March, 2010

One Billion Points of Failure!

Wednesday, March 31st, 2010

In No 303’s for Topic Maps? I mentioned that distinguishing between identifiers and addresses with 303’s has architectural implications.

The most obvious one is the additional traffic that 303 traffic is going to add to the Web.

Another concern is voiced in the Cool URIs for the Semantic Web document when it says:

Content negotiation, with all its details, is fairly complex, but it is a powerful way of choosing the best variant for mixed-mode clients that can deal with HTML and RDF.

Great, more traffic, it isn’t going to be easy to implement, what else could be wrong?

It is missing the one feature that made the Web a successful hypertext system when more complex systems failed. The localization of failure is missing from the Semantic Web.

If you follow a link and a 404 is returned, then what? Failure is localized because your document is still valuable. It can be processed just like before.

What if you need to know if a URL is an identifier for “people, products, places, ideas and concepts such as ontology classes”? If the 303 fails, you don’t get that information.

It is important enough information for the W3C to invent ways to fix the failure of RDF to distinguish between identifiers and resource addresses.

But the 303 fix puts you at the mercy of an unreliable network, unreliable software and unreliable users.

With triples relying on other triples, failure cascades. The system has one billion points of potential failure, the reported number of triples.

The Semantic Web only works if our admittedly imperfect systems, built and maintained by imperfect people, running over imperfect networks, don’t fail, maybe. I would rather take my chances with a technology that works for imperfect users, that would be us. The technology would be topic maps.


Tuesday, March 30th, 2010

Topic map fans will be glad to learn that ISO 13250-6 Topic Maps — Compact Syntax (known to rest of us as CTM) has gone to FDIS!

The compact syntax is a shorthand syntax that is designed to make human authoring of topic maps easier and to be more readable than the XML syntax. It also adds features like templates to assist in authoring.

You will be amused/disapppointed/dismayed if you search for “CTM Examples,” (without the quotes) on popular search engines. “Hits” included:

We really need a lite-weight way to bring subject identity to HTML pages. (More on that in a future post.)

Useful CTM links for topic map fans:

  • Compact Topic Maps Syntax Tutorial A presentation by Lars Heuer, the principal author of CTM.
  • A CTM Tutorial A blog from Lars Marius Garshol. Heuer reports that the %mergemap directive has changed since this tutorial but otherwise is accurate. (I do like the image of happy CTM users. No, I won’t describe it, you will have to go see for yourself.)

As usual, it would be really helpful for others to read and comment on it now rather than after the text is final. Comments can be sent to:, or posted to the CTM project page.

No 303’s for Topic Maps?

Monday, March 29th, 2010

I was puzzled that articles on 303’s, such as Cool URIs for the Semantic Web never mention topic maps. Then I remembered, topic maps don’t need 303’s!

Topic maps distinguish between URIs used as identifiers and URIs which are the addresses of resources.

Even if the Internet is down, a topic map can distinguish between an identifier and the address of a resource.

Topic maps use the URIs identified as identifiers to determine if topics are representing the same subjects.

Even if the Internet is down, a topic map can continue to use those identifiers for comparison purposes.

Topic maps use the URIs identified as subject locators to determine if topics are representing the same resource as a subject.

Even if the Internet is down, a topic map can continue to use those subject locators for comparison purposes.

You know what they say: Once is coincidence, twice is happenstance, three times is an engineering principle.

The engineering principle? And its consequences? Keep watching this space, I want to massage it a bit before posting.


Techies see: kill -9 ‘/dev/cat’ (Robert Barta, one of my co-editors).

Non-Techies/Techies see: Topic Maps Lab.

Spec readers: XTM (syntax), Topic Maps Data Model.

Not all there is so say about topic maps but you have to start somewhere.


Apologies! News on CTM (Compact Topic Map syntax) most likely tomorrow. Apologies for the delay.

TMCL Arrives!

Sunday, March 28th, 2010

The final (hopefully) version of TMCL has arrived!

Please take a look at the latest version, before 25 April 2010.

Comments can be sent to:, or posted to the TMCL project page.

News about CTM coming tomorrow! This could be a banner year for topic map standards!

Processing Topic Maps

Saturday, March 27th, 2010

Trond Pettersen’s posts on Web Application Development with Ontopia are a welcome relief from the initial presentation of topic maps that most of us have experienced.

While reading those posts it occurred to me that for public/static topic maps that a topic map engine might be overkill. If I am sent a fully merged topic map and all I want to do is display it, shouldn’t I be able to store it in an SQL backend, export it to XML and then use Cocoon as a framework for delivery?

I think creating, manipulating and navigating information represented in topic maps should be viewed as separate components in a work flow for topic maps. For example, postings in a blog (to shamelessly steal Trond’s example case), could result in XTM fragments with no topic map “processing” other than production of the fragments.

Another component, dare we say a “topic map engine,” might obtain those fragments from a feed and integrate those into a topic map that is periodically exported to yet another component (and possibly other destinations) for display or other uses.

All of those activities could be centralized in one piece of software, as it is with Ontopia or apparently with DC-X, to name an open source and commercial product.

But there are dangers in consolidated approaches.

One danger is being locked into a particular framework and its limitations.

Another is the potential damage to one’s imagination when every task revolves around one view of the data. Different operations are being performed to produce or upon the data. How it arrives at a common model should be left to the imaginations of developers.

A lot of very clever people are concerned with authoring, merging and delivery of topic maps. Viewing those as separate tasks may lead them to different places than when all roads start in one place.

Concept Hierarchies and Topic Maps

Friday, March 26th, 2010

Concept hierarchies are easy to represent in topic maps and are fundamental to navigation of information resources. So much for the obvious.

Topic maps standards work and debates over arcane issues don’t prepare us to answer the user question: “Excuse me. What concept hierarchy should I use in my topic map?”

The typical response: “Whatever hierarchy you want. Completely unbounded.” That is about as helpful as a poke with a sharp stick.

You don’t want to give your users a copy of this article, but consider reading Deriving concept hierarchies from text by Sanderson and Croft as an introduction to deriving concept hierarchies from the user’s document collection.

Users (aka, paying customers) will appreciate your assistance in developing a hierarchy for their topic map, as opposed to the “well, that’s your problem” approach.

As the links for the authors show, this isn’t the latest word on deriving concept hierarchies. But, it is well written and is a useful starting place. For my part I want to run this backwards to its sources and forward to the latest techniques. More posts coming on this and other techniques that may be useful for building topic maps.

Topic Map News for 25 March 2010

Thursday, March 25th, 2010

WG 3 Meeting in Stockholm

WG 3 just concluded its meetings in Stockholm, Sweden. One of the main items on its agenda was the discussion of the requirements for TMQL (Topic Maps Query Language).

The slides will be available from the SC 34 repository but for those of you who simply can’t wait, TMQL language proposal – apart from Path Language.

Note that FLWR, XML Content and Topic Map Content (slide 27) are proposed to be left out of TMQL 1.0 in the interest of finishing TMQL.

Readers should review these slides and comment on the proposed development of TMQL.

2010 IEEE Intl. Conf. on Information Reuse and Integration

I hope to be Balisage but if you can’t make Balisage, you might want to consider the 2010 IEEE Intl. Conf. on Information Reuse and Integration, August 4-6, 2010.

The call for papers has been extended to 16 April 2010. Whether you submit a paper or just attend, it looks like a valuable experience for anyone interested in topic maps.

I am going to review prior proceedings of this conference to call out items that look especially relevant to topic maps.

There’s (Another) Name For That

Wednesday, March 24th, 2010

Semantic integration research could really benefit from semantic integration!

After years of using Steve Newcomb’s semantic impedance to describe identifying the same subject differently, I run across (another) name for that subject: vocabulary mismatch.

“Mismatch” covers a multitude of reasons, conditions and sins.

I encountered the term reading Search Engines: Information Retrieval in Practice by W. Bruce Croft, Donald Metzler, and Trevor Strohman. More comments on this book to appear in future posts. For now, buy it!

A friend recently remarked that my posts cover a lot of territory. True but subject identity is a big topic.

The broader our reading/research, the better we will be able to assist users in developing solutions that work for them and their subjects.

It is always possible to narrow one’s research/reading for a particular project, but broader vistas await for those who seek them out.

Full-Text Search “Logic”

Tuesday, March 23rd, 2010

We justify full-text searching because users are unable to find a subject an index.

Let’s see:


Users don’t know what terms an indexer used for a subject in an index.


Users search full-text not knowing what terms hundreds if not thousands of people used for a subject.


It may just be me but that sounds like the problem went from bad to worse.

There may be two separate but related saving graces to full-text searching:

  1. As I pointed on in Is 00.7% of Relevant Documents Enough? a user may get lucky and guess a popular term or terms for some subject.
  2. It is very unlikely that any user will enter a full-text search result and get no results.

Some of the questions raised: Is a somewhat useful result more important than a better result? How to measure the distance between the two? How much effort is acceptable to users to obtain a better result?

If you know of any research along those lines please let me know about it.

My suspicion is that the gap between actual and user estimates of retrieval (Size Really Does Matter…) says something very fundamental about users. Something we need to account for in search engines and interfaces.

A Common Model

Monday, March 22nd, 2010

I had someone tell me today that topic map software can’t be written without common model.

That came as news to me. All these years I had thought:

  • XML documents can have different DTDs/Schemas
  • SQL databases can have different schemas

Now I find out that XML and SQL software isn’t possible without a common model.

But XML and SQL software exist and continues to be written.

I wonder what their authors know that the common model advocates don’t?

Casual Users

Sunday, March 21st, 2010

Abraham Bernstein’s Google lecture Making the Semantic Web Accessible to the Casual User (2008) is quite good.

Relevant to topic mappers are his comments on structuration theory and how social structures both make signals meaningful as well as limit what meanings we will see. Topic maps can capture the meaning as seen by multiple parties as well as anyone who can see separate meanings as being attached to the same subject.

An “interactive” search interface tested by Berstein and his group got the highest rating from users. Making users into collaborators in authoring topic maps, asking “Did you mean?,” sort of questions and capturing the results might help capture unanticipated (by some authors) answers as well as increasing user satisfaction.

Context Is A Multi-Splendored Thing

Saturday, March 20th, 2010

Sven’s Dude, where’s my context? illustrates an interesting point about topic maps that is easy to overlook. He proposes to create a topic map that maps co-occurrences of terms to a topic and then uses that information as part of a search process.

Assume we had Sven’s topic map for a set of documents and we also had the results of a probe into the same set by coders who had sketched in some ways used to identify one or more subjects. Perhaps even the results of several probes into the set by different sets of coders. (Can anyone say, “different legal teams looking at the same document collection?”)

Each set of coders or team may be using different definitions of context to identify subjects. And, quite likely, they will be identifying the same subjects, albeit based on different contexts.

If team A discovers that the subject “Patrick Durusau” always uses the term “proxy,” as a technical term from an ISO standard, that information can inform all subsequent searches for that term. That is to say that as contexts are “established” for a document collection, subsequent searches can become more precise.

Expressed as a proposition: Topic maps enable cumulative exploration and mapping of information. (As opposed to where searches start at the beginning, again. You would think that would get tiresome.)

Contra Berman

Friday, March 19th, 2010

Sanford Berman has been a major figure in cataloging for decades. His Prejudices and Antipathies: A tract on the LC Subject Heads Concerning People might sound like a dull tract to be read by bored graduate students, but it’s not!

Published in 1971, this work criticizes Library of Congress subject headings. To give the flavor of Berman’s targets and his recommendations:

  • JEWISH QUESTION, “Remedy: Reconstructions are possible for many other inappropriate terms. Not, however, for this. It richly merits deletion.”
  • YELLOW PERIL, “Remedy: Cancel the head and ensure it does not re-appear even as a See referent to other forms.”
  • IDIOCY, IDIOT ASYLUMS, “Discard both ‘idiot’ forms completely….”

I don’t disagree with any of the changes that Berman recommends for current practice, but I do take issue with his recommendations for deletion.

Sanitizing our records will allow us to rest easy that we are beyond such categories as the “JEWISH QUESTION” or “YELLOW PERIL.” Except now we would say without any hesitation, the “MUSLIM QUESTION,” and “BROWN PERIL.” The latter when discussing immigration from Mexico on the Fox news network.

A well constructed topic map for subject headings should not hide our prior ignorance and prejudice from us. Lest we simply choose new victims in place of the old.

What The World Needs Now…

Thursday, March 18th, 2010

With apologies to Jackie DeShannon I would say: topic maps!

The music is a bit over the top but e-Discovery: Did You Know? makes the case for topic maps now!

My favorite line: “At our current rate of data expansion by just 2011. There will be 2 zettabytes of ESI [Electronically Stored Information] (2 thousand) exabytes), which is as many bytes of information as there are … STARS IN THE UNIVERSE

My takeaway — the amount of mappable territory continues to expand. We can each find our own way or, we can join forces to create and share maps to what we have discovered and places we have been.

As Steve Newcomb foresaw years ago, there is a real economic opportunity in building maps into information territories. That is searchers can monetize their explorations of information resources as topic maps.

You can buy reports from Gartner but with a topic map of an information area, you can merge it with your data and reach your own conclusions.

A killer topic map application would pair itself with data exploration tools for easy creation of topic maps that can be refined as part a topic map creation process. (The tedium of matching up obscure musicians might appeal to ministry of culture types but insights into stock/bond trading, cf. The Big Short, legal discovery, medical research, are more likely to attract important users (the paying kind).)

Authority Record (Another Way To Say PSI?)

Wednesday, March 17th, 2010

I think the Library of Congress has the best definition I have found for an “authority record“:

An authority record is a tool used by librarians to establish forms of names (for persons, places, meetings, and organizations), titles, and subjects used on bibliographic records. Authority records enable librarians to provide uniform access to materials in library catalogs and to provide clear identification of authors and subject headings. For example, works about “movies,” “motion pictures,” “cinema,” and “films” are all entered under the established subject heading “Motion pictures.”

Note that authority records help:

  • …provide uniform access to materials…
  • …provide clear identification…

If rather than “access” you said provide a basis for merging two topics together, I would swear you were talking about a PSI.

If you added that it provides a “clear identification” of a subject, then I would know you were talking about a PSI.

Well, except that PSI are supposed to be resolveable URIs, etc.

Seems to me that we need to re-think the decision to privilege URIs as identifiers for subjects. Libraries around the world, to say nothing of professional organizations, have been creating authority records that act much as PSIs do for many subjects.

Do we really want to re-invent all of those authority records? (Not to mention all the mileage and good will we would gain from using existing sets of authority records.)

Size Really Does Matter…

Tuesday, March 16th, 2010

…when you are evaluating the effectiveness of full-text searching. Twenty-five years Blair and Maron, An evaluation of retrieval effectiveness for a full-text document-retrieval system, established that size effects the predicted usefulness of full text searching.

Blair and Maron used a then state of the art litigation support database containing 40,000 documents for a total of approximately 350,000 pages. Their results differ significantly from earlier, optimistic reports concerning full-text search retrieval. The earlier reports were based on sets of less than 750 documents.

The lawyers using the system, thought they were obtaining at a minimum, 75% of the relevant documents. The participants were astonished to learn they were recovering only 20% of the relevant documents.

One of the reasons cited by Blair and Maron merits quoting:

The belief in the predictability of words and phrases that may be used to discuss a particular subject is a difficult prejudice to overcome….Stated succinctly, is is impossibly difficult for users to predict the exact word, word combinations, and phrases that are used by all (or most) relevant documents and only (or primarily) by those documents….(emphasis in original, page 295)

That sounds to me like users using different ways to talk about the same subjects.

Topic maps won’t help users to predict the “exact word, word combinations, and phrases.” However, they can be used to record mappings into document collections,that collect up the “exact word, word combinations, and phrases” used in relevant documents.

Topic maps can used like the maps of early explorers that become more precise with each new expedition.

Kilroy Was Here

Monday, March 15th, 2010

Have you ever had one of those “Kilroy Was Here” sort of moments? You think that you are exploring some new idea, only to turn the corner and there you see: “Kilroy was here” in bright bold letters? Except that most of the time for me, it doesn’t read “Kilroy was here,” but rather “Librarians were here.”

I was reading Lois Mai Chan’s Cataloging and Classification: An Introduction when I ran across the concept of access points. Or in Chan’s words, “…the ways a given item may be retrieved.” (page 9) If you broaden that out to say the “…ways a given subject may be retrieved from a topic map…” then it sounds very much like useful information for anyone who wants to build a topic map.

Librarians have spent years researching, implementing, testing and improving ways of accessing information. I think the smart money is going to be on using that knowledge and experience in building topic maps. Look for me in the periodical shelves with library journals. I will try to post short notices of anything that looks particularly interesting. Suggestions more than welcome.

Semantic Diversity – The Default Case

Sunday, March 14th, 2010

While constructing the food analogy to semantic diversity, it occurred to me that semantic diversity is the default case.

Despite language suppression, advocates of universal languages, Esperanto, LogLang, and those who would police existing languages, L’Académie française, semantic diversity remains the default case.

There is semantic diversity in the methods to overcome semantic diversity. Even within particular approaches to overcoming semantic diversity. You can observe diversity in ontologies at Swoogle.

I think semantic diversity continues in part because we as human beings are creative, even when addressing issues like semantic diversity. It is part of who we are to be these bubbling fountains of semantic diversity as it were.

Shouldn’t the first question to anyone hawking the latest search widget be: “Can it search effectively using my terms?” Simple enough question.

The first startup that can answer that question in the affirmative will go far.

When The New Deal Was New

Saturday, March 13th, 2010

Robert Cerny mentioned looking for the historical antecedents of topic maps the other day. Subject identity and identification lie at the core of meaning, so it is no surprise that issues we associate with topic maps are woven into the fabric of history.

Collation, for example, appears in the history of Social Security, a program that was part of the New Deal in the United States.

I didn’t find this on my own (see below) but the ability to mechanically match records in two different sets, to see if they were related to each other, did not exist prior to Social Security. Jack Futterman recounts the invention of the collator:

…The machinery that we had to do the job in those days to keep records did not exist. I should rephrase that. There was no machinery that really could do the social security job before the Social Security organization came into existence….the collation, the ability to take two sets of records and do a matching to see whether they were appropriate or the same and related to one another and then to make, in effect, decisions as to whether to interfile one or to reject it was a facility that did not exist in the equipment up to that time….Oral Interview with Jack S. Futterman, January 23, 1974

Name collation used the Soundex algorithm, as Futterman recalls in other oral history remarks.

Name collation has evolved since then, see A comparison of personal name matching: Techniques and practical issues.

Topic mappers interested in expanding their toolkits for building topic maps or developing additional merging algorithms will find name collation a fruitful area for exploration.

(I first saw the information about the Social Security Administration while reading Edwin Black’s IBM and the Holocaust. Recommend reading the online excerpts with an eye towards how topic maps could have assisted with his and similar projects. There will be posts on aspects of this book in the future, please watch for them and consider posting your thoughts about using topic maps in such projects.)

Is There a Haptic Topic Map in Your Future?

Friday, March 12th, 2010

I ran across a short article today on improving access to maps using non-visual channels, gestures, tactile/haptic interaction and sound, The HaptiMap project aims to make maps accessible through touch, hearing and vision.

The HaptiMap project is sponsored by the EU Commission. There is a collection of recent papers.

One obvious relevance to topic maps is that HaptiMap is collocating information about the same locations from/to different non-visual channels. Hmmm, I have heard that before, but where? Oh, yeah, topic maps are about collocating information about the same subject. That would include information in different channels about the same subject.

A less obvious relevance is for determining when there are multiple representatives of the same subject. Comparing strings, which may or may not be meaningful to a user, is only one test of subject identity. Ever tried to identify the subject spoiled milk by sniffing it? Or a favorite artist or style of music by listening to it? Or a particular style of weave or fabric by touch? All of those sound like identification of subjects to me.

Imagine a map that presents representatives of subjects for merging based on non-symbolic clues experienced by the user. Rather than a music library organized by artist/song title, etc., a continuum that is navigated and merged only on the basis sound. Or representations of subjects in a haptic map found in a VR environment. Or an augmented environment that uses a variety of channels to communicate information about a single subject.

You will have to attend TMRA 2010 (sponsored in part by to see if any haptic topic maps show up this year.

In Praise of Legends (and the TMDM in particular)

Thursday, March 11th, 2010

Legends enable topic maps to have different representations of the same subject. Standard legends, like the Topic Maps Data Model (TMDM), are what enable blind interchange of topic maps.

Legends do a number of things but among the more important, legends define the rules for the contents of subject representatives and the rules for comparing them. The TMDM defines three representatives for subjects, topics, associations and occurrences. It also defines how to compare those representatives to see if they represent the same subjects.

Just to pull one of those rules out, if two or more topics have an equal string in their [subject identifiers] property, the two topics are deemed to represent the same subject. (TMDM 5.3.5 Properties) The [subject identifiers] property is a set so a topic could have two or more different strings in that property to match other topics.

It is the definition of a basis for comparison (see the TMDM for the other rules for comparing topics) of topics that enables the blind interchange of topic maps that follow the TMDM. That is to say that I can author a topic map in XTM (one syntax that follows the TMDM) and reasonably expect that other users will be able to successfully process it.

I am mindful of Robert Cerny’s recent comment on encoding strings as URLs but don’t think that covers the case where the identifications of the subjects are dynamic. That is to say that the strings themselves are composed of strings that are subject to change as additional items are merged into the topic map.

The best use case that comes to mind is that of the current concern in the United States over the non-sharing of intelligence data. You know, someone calls up and says their son is a terrorist and is coming to the United States to commit a terrorist act. That sort of intelligence. That isn’t passed on to anyone. At least anyone who cared enough to share it, I don’t know, with the airlines perhaps?

If I can author a subject identification that includes a previously overlooked source of information, say the parent of a potential terrorist, in addition to paid informants, current/former drug lords, etc. the usual categories, then the lights aren’t simply blinking red there is actual information in addition to the blinking lights.

I really should wait for Robert to make his own arguments but if you think of URLs as simply strings, without any need for resolution, you could compose a dynamic identification, freeze it into a URL, then pass it along to a TMDM based system. You don’t get any addition information but that would be one way to input such information into a TMDM based system. If you control the server you could provide a resolution back into the dynamic subject identification system. (Will have to think about that one.)

I think of it as the TMDM using sets of immutable strings for subject identification and one of the things the TMRM authorizes, but does not mandate, is the use of mutable strings as subject identifiers.

Implementing the TMRM (Part 2)

Wednesday, March 10th, 2010

Implementing the TMRM (Part 2)

I left off in Implementing the TMRM (Part 1) by saying that if the TMRM defined proxies for particular subjects, it would lack the generality needed to enable legends to be written between arbitrary existing systems.

The goal of the TMRM is not to be yet another semantic integration format but to enable users to speak meaningfully of the subjects their systems already represent and to know when the same subjects are being identified differently. The last thing we all need is another semantic integration format. Sorry, back to the main theme:

One reason why it isn’t possible to “implement” the TMRM is the lack of any subject identity equivalence rules.

String matching for IRIs is one test for equivalence of subject identification but not the only one. The TMRM places no restrictions on tests for subject equivalence so any implementation will only have a subset of all the possible subject equivalence tests. (Defining a subset of equivalence tests underlies the capacity for blind interchange of topic maps based on particular legends. More on that later.)

An implementation that compares IRIs for example, would fail if a legend asked it to compare the equivalence of Feynman diagrams generated from the detector output from the Large Hadron Collider. Equivalence of Feynman diagrams being a legitimate test for subject equivalence and well within the bounds of the TMRM.

(It occurs to me that the real question to ask is why we don’t have more generalized legends with ranges of subject identity tests. Sort of like XML parsers only parse part of the universe of markup documents but do quite well within that subset. Apologies for the interruption, that will be yet another post.)

The TMRM is designed to provide the common heuristic through the representation of any subject can be discussed. However, it does not define a processing model, which is another reason why it isn’t possible to “implement” the TMRM, but more on that in Implementing the TMRM (Part 3).

Schlepping From One Data Silo to Another (Part 1)

Tuesday, March 9th, 2010

Talking about data silos is popular. Particularly with a tone of indignation, about someone else’s data silo. But, like the weather, everyone talks about data silos, but nobody does anything about them. In fact, if you look closely, all solutions to data silos, are (drum roll please!), more data silos.

To be sure, some data silos are more common than others but every data format is a data silo to someone who doesn’t use that format. Take the detailed records from the Federal Election Commission (FEC) on candidates, parties and other committees as an example. Important stuff for US residents interested in who bought access to their local representative or senator.

The tutorial on how to convert the files to MS Access clues you in that the files are in fixed width fields, or as the tutorial puts it: “Notice that a columns’ start value is the previous columns’ start value plus its’ width value (except for the first column, which is always “1”).” That sounds real familiar.

But, we return to the download page where we read about how to handle overpunch characters. Overpunch characters? Oh, as in COBOL. Now that’s an old data silo.

The point being that for all the talk about data silos we never escape them. Data formats are data silos. Get over it.

What we can do is make it possible to view information in one data silo as though it were held by another data silo. And if you must switch from one data silo to another, the time, cost and uncertainty of the migration can be reduced. (to be continued)

Implementing the TMRM (Part 1)

Monday, March 8th, 2010

There are two short pieces on the Topic Maps Reference Model (TMRM) that are helpful to read before talking about “implementing” the TMRM. Both are by Robert Barta, one of the co-editors of the TMRM, A 5 min Introduction into TMRM and TMRM Exegesis: Proxies.

The TMRM defines an abstract structure to enable us to talk about proxies, the generic representative for subjects. It does not define:

  1. Any rules for identifying subjects
  2. Any rules for comparing identifications of subjects
  3. Any rules for what happens if proxies represent the same subjects
  4. Any subjects for that matter

If that seems like a lot to not define, it was and it took a while to get there.

The TMRM does not define any of those things, not because they are not necessary, but doing so would impair the ability of legends (the disclosures of all those things) to create views of information that merge diverse information resources.

Consider a recent call for help with the earthquake in Chile. Data was held by a Google’s people finder service but the request was to convert it into RDF. Then do incremental dumps every hour.

So the data moves from one data silo to another data silo. As Ben Stein would say, “Wow.”

If we could identify the subjects, both structural and as represented, we could merge information about those subjects with information about the same subjects in any data silo, not just one in particular.

How is that for a business case? Pay to identify your subjects once versus paying that cost every time you move from one data silo to another one.

The generality of the TMRM is necessary to support the writing of a legend that identifies the subjects in a more than one system and, more importantly, defines rules for when they are talking about the same subjects. (to be continued)

(BTW, using Robert Barta’s virtual topic map approach, hourly dumps/conversion would be unnecessary, unless there was some other reason for it. That is an approach that I hope continues in the next TMQL draft (see the current TMQL draft).)

ERIC – A Resource For Topic Maps Design and Research

Sunday, March 7th, 2010

ERIC – Education Resources Information Center offers free access to > 1.3 million bibliographic records on education related materials. Thousands of new records are added every month.

Education is communication and I can’t think of a better general category for topic maps than communication. There is no one size fits all subject identification and no one way to communicate with all users. Clever use of resources found through ERIC may help avoid old mistakes so that we can make new ones.

The thesaurus feature of ERIC is very topic map like. Entries are indexed under a set of uniform “descriptors” so you can locate records indexed by subject, regardless of the terminology the author may have used. I should say that topic maps are very thesaurus like to be completely accurate.

The ERIC data and thesaurus are freely available. Exploring the “triggers” that lead to assignment of “descriptors,” creating rules for merging “descriptors” with similar mechanisms in other data sets, or the advantages of associations would make good topic map research projects. Education is a current topic of public concern and dare I say funding?

Subject Headings and the Semantic Web

Saturday, March 6th, 2010

One of the underlying (and false) presumptions of the Semantic Web is that users have a uniform understanding of the world. One that matches the understanding of ontology authors.

The failure of that presumption was demonstrated over a decade ago in rather remarkable research conducted by Karen Drabenstott (now Marley) on user understanding of Library of Congress subject headings.

Despite the use of Library of Congress subject headings for almost a century, no one before Drabenstott had asked the fundamental question: Does anyone understand Library of Congress subject headings? The study, Understanding Subject Headings in Library Catalogs found that:

Overall percentages of correct meanings for subject headings in the original order of subdivisions were as follows: children, 32%, adults, 40%, reference 53%, and technical services librarians, 56%.

The conclusions one would draw from such a result are easy to anticipate but I will quote from the report:

The developers of new indexing systems especially systems aimed at organizing the World-Wide Web should include children, adults, librarians, and even subject-matter experts in the establishment of new terms and changes to existing ones. Perhaps there should be separate indexing systems for children, adults, librarians, and subject-matter experts. With a click of a button, users could choose the indexing system that works for them in terms of their understanding of the subject matter and the indexing system’s terminology.

Hmmm, users “…choose the indexing system that works for them…,” what a remarkable concept. Topic maps anyone?

An Early Example of Collocation

Friday, March 5th, 2010

An early example of collocation is the Rosetta Stone. It records a decree in 196 BCE by Ptolemy V granting a tax amnesty to temple priests.

The stele has the degree written in Egyptian (two scripts, hieroglyphic and Demotic) and Classical Greek.

The collocation of different translations of the same decree on the Rosetta Stone raises interesting questions about identification of subjects as well as how to process such identifications.

This decree of Ptolemy V could be identified as the decree on the Rosetta Stone. Or, it could be identified by reciting the entire text. There are multiple ways to identify any subject. That some means of identification are more common than others, should not blind us to alternative methods for subject identification. Or to the differences that a means of identification may make for processing.

Since each text/identifier was independent of the others, each reader was free to identify the subject without reference to the other identifiers. (Shades of parallel processing?)

Another processing issue to notice is that by reciting the text of the decree on the Rosetta Stone, it was not necessary for readers to “dereference” an identifier in order to understand what subject was being identified.

Topic maps are a recent development in a long history of honoring semantic diversity.

Defusing A Combinatorial Explosion?

Thursday, March 4th, 2010

One of the oldest canards in the “map to a common identifier/model” game is the allegation of a lurking combinatorial explosion.

It goes something like this: If you have identifiers A, B, and C, for a single subject, there are mappings to and from each identifier, hence:

  • A -> B
  • A -> C
  • B -> A
  • B -> C
  • C -> A
  • C -> B

Since no identifier maps to itself, the number of mappings is given by N * (N-1).

To avoid the overhead of tracking an ever increasing number of mappings between identifiers, the “cure” is to map all identifiers for a subject to a single identifier.

If something doesn’t feel right, congratulations! You’re right, something isn’t right. As Sam Hunting observed when I forwarded a draft of this post to him, if the mapping argument were true, it would not be possible to construct a thesaurus.

But it is possible to construct a thesaurus. So how to reconcile both the observed mappings and the existence of thesauri? The trick is that a thesaurus has only implicit mappings between identifiers for a subject. That assumption is glossed over in the combinatorial explosion argument. If mappings between the identifiers is left implied, the potential combinatorial explosion is defused. (You could also say the identifiers are collocated with each other. A term I will return to in other posts.)

Multiple identifiers for a subject lead to convenience and ease of use, not combinatorial explosions.

Semantics Irrelevant to Communication?

Wednesday, March 3rd, 2010

C. E. Shannon in A Mathematical Theory of Communication (Bell System Technical Journal, 1948) says:

The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point. Frequently the messages have meaning; that is they refer to or are correlated according to some system with certain physical or conceptual entities. These semantic aspects of communication are irrelevant to the engineering problem.” (emphasis added)

Avoidance of the “semantic aspects of communication” remains the most popular position. Think about it. What are the common responses to heterogeneous data (semantic “noise”)?

  1. Let’s use my semantic (or, lacking the power to insist),
  2. Let’s use a common semantic.

Both are homogenization of semantically heterogeneous messages. A “McDonald’s” version as opposed to having choices ranging from Thai to Southern Barbeque (BBQ, Bar-B-Q, Bar-B-Que). Not only is there information loss, the results are bland and uninteresting.

Semantic homogenization is not the answer. Semantic homogenization is the question. The answer is NO.

Skillful Semantic Users?

Tuesday, March 2nd, 2010

I recently discovered one reason for my unease with semantic this and that technologies, including topic map interfaces. A friend mentioned to me that he wanted users to do more than enter subject names in their topic map interface. “Users need to also enter….”

The idea of users busily populating a semantic space is an attractive one, but it hasn’t been borne out in practice. So I don’t think my friend’s interface is going to prove to be useful, but why?

Then I got to thinking, how many indexers or librarians do I know? The sort of people whose talents combined together to bring us the Reader’s Guide to Periodic Literature and useful back of the book indexes. Due to my work in computer standards I know a lot of smart people but very few of them strike me as also being good at indexing or cataloging type skills.

Any semantic solution, RDFa, RDF/OWL, SUMO, Topic Maps, etc., will fail from an authoring standpoint due to a lack of skill. No technology can magically make users competent at the indexing or cataloging skills required to enable access by others.

Semantic interface writers need to recognize most users are simply consumers of information created by others. I would not be surprised if the ratio of producers to consumers is close to the ratio in open source projects between contributors and the consumers in those projects.