Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 12, 2010

Topic Maps and the “Vocabulary Problem”

To situate topic maps in a traditional area of IR (information retrieval), try the “vocabulary problem.”

Furnas describes the “vocabulary problem” as follows:

Many functions of most large systems depend on users typing in the right words. New or intermittent users often use the wrong words and fail to get the actions or information they want. This is the vocabulary problem. It is a troublesome impediment in computer interactions both simple (file access and command entry) and complex (database query and natural language dialog).

In what follows we report evidence on the extent of the vocabulary problem, and propose both a diagnosis and a cure. The fundamental observation is that people use a surprisingly great variety of words to refer to the same thing. In fact, the data show that no single access word, however well chosen, can be expected to cover more than a small proportion of user’s attempts. Designers have almost always underestimated the problem and, by assigning far too few alternate entries to databases or services, created an unnecessary barrier to effective use. Simulations and direct experimental tests of several alternative solutions show that rich, probabilistically weighted indexes or alias lists can improve success rates by factors of three to five.

The Vocabulary Problem in Human-System Communication (1987)

Substitute topic maps for probabilistically weighted indexes or alias lists. (Techniques we are going to talk about in connection with topic maps authoring.)

Three to five times greater success is an incentive to use topic maps.

Marketing Department Summary

Customers can’t buy what they can’t find. Topic Maps help customers find purchases, increases sales. (Be sure to track pre and post topic maps sales results. So marketing can’t successfully claim the increases are due to their efforts.)

April 11, 2010

Texts and Topic Maps

Filed under: Subject Identity,Topic Maps — Patrick Durusau @ 8:50 pm

Topic maps are composed of representatives of subjects, that is representatives of:

anything whatsoever, regardless of whether it exists or has any other specific characteristics, about which anything whatsoever may be asserted by any means whatsoever (TMDM, 3.14)

Every text is composed of representatives of subjects as well.

Does that make every text a topic map? The answer to that is “no” but why?

Comparing a Text and a Topic Map:

Property Text Topic Map
Subject Representatives yes yes
Explicit Rules for Identification/Representation no yes
Explicit Rules for Merging no yes

I waver between saying that the explicit rules for Identification/Representation are sufficient by themselves and adding explicit rules for Merging. Certainly the rules for merging presume the first but without rules for merging, the rules for identification/representation are nugatory.

Following both sets of rules does not necessarily result in merging all the subject representatives for the same subject. The most any topic map application can claim is that a set of rules for identification/representation have been followed by a particular map and that specified rules for merging have been applied.

Whether a topic map has in fact properly “merged” all the subject representatives is a judgment only a human reader can make, along side whatever texts they happen to be reading.

PS: Merging means that a single representative for a subject results, containing all the different identifications for that subject and any properties of that subject.

April 10, 2010

Interfaces and Topic Maps

Filed under: Search Interface,Searching — Patrick Durusau @ 8:11 pm

A copy of Search User Interfaces by Marti A. Hearst, Cambridge University Press, 2009, ISBN 978-0-521-11379-3, arrived in my mailbox today.

I am in the final stages of putting Part 2 of the ODF 1.2 standard together but I did peek inside long enough to find:

  1. The Design of Search User Interfaces
  2. The Evaluation of Search User Interfaces
  3. Models of the Information Seeking Process
  4. Query Specification
  5. Presentation of Search Results
  6. Query Reformulation
  7. Supporting the Search Process
  8. Integrating Navigation with Search
  9. Personalization with Search
  10. Information Visualization for Search Interfaces
  11. Information Visualization for Text Analysis
  12. Emerging Trends in Search Interfaces

There is a place for this volume and others like it on the shelves of every topic map interface designer.

I will be tracking the references in this volume so I can report on the latest work in the field.

Stay tuned for future updates as I work my way through this one. Promises to be a real interesting read.

(Update: The full text of this volume is freely available at: http://searchuserinterfaces.com/. I will post links to individual chapters in future commentary.)

April 9, 2010

TFM (To Find Me) Scoring

Filed under: LCSH,Subject Headings,Subject Identifiers,Subject Identity — Patrick Durusau @ 8:34 pm

The TFM (To Find Me) score for a topic map or other information resource depends upon the subject being identified.

Here is a portion of a record from the Library of Congress:

LC Control No.: 2001376890
Type of Material: Book (Print, Microform, Electronic, etc.)
Main Title: Medieval Slavic manuscripts and SGML : problems and
perspectives = Srednovekovni slavi·a·nski rukopisi i
SGML / [Anisava Miltenova, David Birnbaum, editors].
Parallel Title: Srednovekovni slavi·a·nski rukopisi i SGML
Published/Created: Sofii·a· : A.I. “Prof. Marin Drinov”, 2000.
Related Names: Miltenova, Anisava
Birnbaum, David J.
Description: 371 p. : ill. ; 24 cm.
ISBN: 9544307400
Subjects: ***omitted, will cover in another post***
LC Classification: Z115.5.C57 M43 2000
Language Code: eng bul
Other System No.: (OCoLC)ocm45819499
CALL NUMBER: Z115.5.C57 M43 2000

How many ways can you find this book?

  1. Main title: Medieval Slavic manuscripts and SGML : problems and perspectives
  2. Parallel Title: Srednovekovni slavi·a·nski rukopisi i SGML
  3. ISBN: 9544307400
  4. Other System No.: (OCoLC)ocm45819499

TFM score of 4. Four ways to find this book.

But, why the following weren’t included?

  1. LC Control No.: 2001376890
  2. CALL NUMBER: Z115.5.C57 M43 2000

Which would have made the TFM score 6.

Depends on what subject you think is being identified.

If the subject is this book, as a publication, the TFM score remains at 4.

If the subject is a particular copy of this book, held by the Library of Congress, the TFM score goes to 6.

April 8, 2010

TFM (To Find Me) Mark Twain

Filed under: LCSH,Subject Identity,Topic Maps — Patrick Durusau @ 6:29 pm

My TFM (To Find Me) project for today is the Library of Congress catalog and the subject is “Mark Twain.” I started at: http://catalog.loc.gov, selected “Author Keyword,” and entered “Mark Twain.” Putting in the exact string is a TFM score of 1 but I had to start somewhere.

Results? 36 results in total: 6 personal names, 7 meeting names, and 23 corporate names. Since I am interested in the subject, the author “Mark Twain,” let’s look a bit closer at the returns. The returns include the number of “titles” for each listing, thus the first one is 1 title by “David, Mark Twain.”

  • 1 David, Mark Twain
  • 1 Nadir, Mark Twain, 1913-
  • 17 Twain, Mark.
  • 1438 Twain, Mark, 1835-1910
  • 1 Twain, Mark, 1835-1910 (Spirit)
  • Twain, Mark Mrs., 1845-1904

The fourth entry, “1438 Twain, Mark, 1835-1910” has a more info logo and if we follow that we find: “see also: Clemens, Samuel Langhorne, 1835-1910.” If we follow that, we get:

  • 9 Clemens, Samuel Langhorne, 1835-1910

There is a more info link with a pointer to “Twain, Mark, 1835-1910” at this result.

As it stands now, we have a TFM score of 2 on the subject of Mark Twain (Exact string, Mark Twain and Clemens, Samuel Langhorne). I am curious about the entry with 1438 titles since I am sure that Twain’s literary output was less than that number. Note that “A Connecticut Yankee in King Arthur’s Court” does not appear in the listing of the works by Twain in the third line item. Clearly something is amiss.

Localization and Topic Maps

Filed under: Localization — Patrick Durusau @ 2:26 pm

One obvious application of topic maps is assisting with issues of localization. Usually that means having an interface in different languages, displaying time/date/money in different ways and sometimes being sensitive to the layout preferred by a given culture.

Topic maps can do all of that and enable developers to interchange the information they have developed for localization.

It occurs to me that the years of research on localization of interfaces should be useful in developing localization of information.

I rather like that, the localization of information. Enabling users to find and use information as they understand it.

Having localized interfaces is important and I don’t want to take anything away from that goal. But, enabling users to find and use the information they need seems like a logical next step. One that topic maps can help developers take.

April 7, 2010

How Can I Find Thee? Let me count the ways…

Filed under: Subject Identity,Topic Maps — Patrick Durusau @ 7:12 pm

The number of ways users can find information has a direct impact on how many of them will actually find the information they need. I haven’t found any literature that suggests having fewer ways to find information improves retrieval. If you know of any, please post a link or reference to it.

The research on the higher number of ways to find information resulting in more users finding it has been around since the early 80’s, so say almost 30 years. I am curious how many information systems have taken those lessons to heart?

It won’t be a big part of any of my blogs for the next week or so but let me propose that you and I do an informal survey of information systems. Could be anything, a local website, the local library catalog, perhaps a government agency site, etc.

Pick some subject, one that interests you, then find that subject in an information system Now, for the fun part. How many other ways can you find that information? Could include other words for it, other ways to access the same information, etc. Write each one down and then post to one of my blog posts that mention it, a link (if you like), the subject and the To Find Me, TFM score for that subject.

TFM is incremented by one for every way to find a subject in a particular information resource. At my website, you can find information on “topic maps,” and the same thing as “ISO 13250,” so the TFM score would be 2.

I will pick a subject as well and will post a short note every day about my experience on finding that subject and then trying to find other ways to find that subject. Happy hunting!

April 6, 2010

Building Multilingual Topic Maps

Filed under: Conferences,Heterogeneous Data,Semantic Diversity — Patrick Durusau @ 8:42 pm

The one article of faith shared by all topic map enthusiasts is: topic maps can express anything! But having said that, “when the rubber hits the road” (Americanism, means to become meaningful, action being taken) the question is how to build a topic map, particularly a multilingual one.

We are all familiar with the ability of topic maps to place a “scope” on a name so that its language can be indicated. But that is only one aspect of a what is expected of a modern multilingual system.

Fortunately, topic map fans don’t have to re-invent multilingual information retrieval techniques!

Bookmark and use the resources found at the Cross Language Evaluation Forum. CLEF is sponsored by TrebleCLEF, an activity of the European Commission.

CLEF has almost a decade of annual proceedings and both sites offer link collection to other multilingual resources. I am going to start mining those proceedings and other documents for suggestions and tips on constructing topic maps.

Suggestions, comments, tips, etc., that you have found useful would be appreciated.

(PS: I am sure all this is old hat to European topic map folks but realize there are, ahem, parts of the world where multilingualism isn’t valued. I suspect many of the same techniques will work for multiple identifications in single languages.)

April 5, 2010

Are You Designing a 10% Solution?

Filed under: Full-Text Search,Heterogeneous Data,Recall,Search Engines — Patrick Durusau @ 8:28 pm

The most common feature on webpages is the search box. It is supposed to help readers find information, products, services; in other words, help the reader or your cash flow.

How effective is text searching? How often will your reader use the same word as your content authors for some object, product, service? Survey says: 10 to 20%!*

So the next time you insert a search box on a webpage, you or your client may be missing 80 to 90% of the potential readers or customers. Ouch!

Unlike the imaginary world of universal and unique identifiers, the odds of users choosing the same words has been established by actual research.

The data sets were:

  • verbs used to describe text-editing operations
  • descriptions of common objects, similar to PASSWORD ™ game
  • superordinate category names for swap-and-sale listings
  • main-course cooking recipes

There are a number of interesting aspects to the study that I will cover in future posts but the article offers the following assessment of text searching:

We found that random pairs of people use the same word for an object only 10 to 20 percent of the time.

This research is relevant to all information retrieval systems. Online stores, library catalogs, whether you are searching simple text, RDF or even topic maps. Ask yourself or your users: Is a 10% success rate really enough?

(There ways to improve that 10% score. More on those to follow.)

*Furnas, G. W., Landauer, T. K., Gomez, L. M., Dumais, S. T., (1983) “Statistical semantics: Analysis of the potential performance of keyword information access systems.” Bell System Technical Journal, 62, 1753-1806. Reprinted in: Thomas, J.C., and Schneider, M.L, eds. (1984) Human Factors in Computer Systems. Norwood, New Jersey: Ablex Publishing Corp., 187-242.

April 4, 2010

Redemption and Topic Maps

Filed under: Topic Maps — Patrick Durusau @ 7:51 pm

Easter Sunday seems like a good day to discuss the redeeming/salvation aspects of topic maps. It could be a very short post because in my view, topic maps offer us neither redemption nor salvation.

It has been a popular theme in Internet circles that better access to information will lead to better decision making. If we could just “see” things from other perspectives, we would not be bound by our parochial interests.

Topic maps offer the potential to transcend and preserve evidence of semantic barriers. Not only can we “see” semantic barriers but move beyond them. Heady stuff. But, better access to information will not necessarily make us better people.

Ask your local rabbi, priest, imam, or other religious leader. Their traditions have labored for centuries, if not millennia, to help us choose better conduct over other choices, with mixed results. They had access to all the information any one needs to be a better person. But we have been unwilling to take the advice.

Now with topic maps or the Semantic Web or (your choice), we are going to wake up and say, “I can be a better person!” Hardly. I have every confidence we will continue to be selfish, vain, parochial, inconsistent and, at times, foolish.

Let’s not mistake topic maps or any other tool, as a source of redemption or salvation. The path to redemption or salvation lies within us and the choices we make.

April 3, 2010

Source of Heterogeneous Data?

Filed under: Heterogeneous Data,Semantic Diversity — Patrick Durusau @ 7:19 pm

Topic maps are designed to deal with heterogeneous data. The question I have never heard asked (or answered) is: “Where does all this heterogeneous data come from?” Heterogeneous data is the topic of conversation in digital IT and pre-digital IT literature.

You would think that question would been asked and answered. I went out looking for it, since email is slow today. (Holy Saturday 2010)

If I can find a time when there wasn’t any heterogeneous data, then someone may have commented, “look, there’s heterogeneous data.” I could then track the cause forward. Sounds simple enough.

I have a number of specialized works on languages of the Ancient Near East but it turns out the Unicode standard has the information we need.

Chapter 14, Archaic Scripts has entries for both Egyptian hieroglyphics and Sumero-Akkadian. Both arose at about the same time, somewhere from the middle to the near the end of the fourth millennium BCE. That’s recorded heterogeneous data isn’t it?

For somewhere between 5,000 to 5,500 years we have had heterogeneous data. It appears to be universal, geographically speaking.

The source of heterogeneous data? That would be us. What we need is a solution that works with us and not against us. That would be topic maps.

April 2, 2010

Re-Inventing Natural Language

Filed under: Heterogeneous Data,Ontology,Semantic Diversity — Patrick Durusau @ 8:29 pm

What happens when users use ontologies? That is when ontologies leave the rarefied air of campuses, turgid dissertations and the clutches of arm chair ontologists?

Would you believe that users simply take terms from ontologies and use them as they wish? In other words, after decades of research, ontologists have re-invented natural language! With all of its inconsistent usage, etc.

I would send a fruit basket if I had their address.

For the full details, take a look at: The perceived utility of standard ontologies in document management for specialized domains. From the conclusion:

…rather than being locked into conforming to the standard, users will be free to use all or small fragments of the ontology as best suits their purpose; that is, these communities will be able to very flexibly import ontologies and make selective use of ontology resources. Their selective use and the extra terms they add will provide useful feedback on how the external ontologies could be evolved. A new ontology will emerge as the result and this itself may become a new standard ontology.

I would amend the final two sentences to read: “Their selective use and the extra terms they add will provide useful feedback on how their language is evolving. A new language will emerge as the result and this may itself become a new standard language.

Imagine, all that effort and we are back where we started. Users using language (terms from an ontology) to mean what they want it to mean and not what was meant by the ontology.

The arm chair ontologists have written down what they mean. Why don’t we ask ordinary users the same thing, and write that down?

April 1, 2010

Obama Whitehouse Adopts Topic Maps!

Filed under: Humor — Patrick Durusau @ 9:56 am

Rahm Emanuel, Chief of Staff for President Obama, said in an interview with Ann Coulter, that the Obama Whitehouse is adopting topic maps for all communication logs from the Whitehouse.

Communication logs from the Whitehouse monitor all incoming and outgoing telephone calls (land or cell), all Internet traffic, internal phone calls and other forms of communication (withheld on grounds of national security).

“Topic maps will enable interested citizens to reconstruct who spoke to who in what order on eventful days in the Whitehouse.” said Emanuel. When asked by Coulter, Emanuel admitted that nitpickers would seize upon some of the exclusions.

Exclusions include national security matters, calls to hookers, drug dealers, “cousins” from back East, fast food orders, leaks to the news media, and some other miscellaneous categories. No stranger to those categories herself, Coulter pressed for an example of what would be released.

Emanuel offered the following summary:

  • Michelle Obama: Calls to Library of Congress at 9 PM on Sunday nights (homework assignments): 20
  • Barack Obama: Pings to Internet time server on the average day: 100

“Barack and Michelle don’t use any of the exclusions,” said Emanuel. “The logs are mostly of Barack and Michelle’s communications, the exclusions wipe out almost all of the rest of the traffic.”

(After the interview ended, Emanuel grinned and said, “You know, they really are that straight.” He also asked Ann to autograph a copy of Guilty.)

« Newer Posts

Powered by WordPress