Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 5, 2010

Are You Designing a 10% Solution?

Filed under: Full-Text Search,Heterogeneous Data,Recall,Search Engines — Patrick Durusau @ 8:28 pm

The most common feature on webpages is the search box. It is supposed to help readers find information, products, services; in other words, help the reader or your cash flow.

How effective is text searching? How often will your reader use the same word as your content authors for some object, product, service? Survey says: 10 to 20%!*

So the next time you insert a search box on a webpage, you or your client may be missing 80 to 90% of the potential readers or customers. Ouch!

Unlike the imaginary world of universal and unique identifiers, the odds of users choosing the same words has been established by actual research.

The data sets were:

  • verbs used to describe text-editing operations
  • descriptions of common objects, similar to PASSWORD ™ game
  • superordinate category names for swap-and-sale listings
  • main-course cooking recipes

There are a number of interesting aspects to the study that I will cover in future posts but the article offers the following assessment of text searching:

We found that random pairs of people use the same word for an object only 10 to 20 percent of the time.

This research is relevant to all information retrieval systems. Online stores, library catalogs, whether you are searching simple text, RDF or even topic maps. Ask yourself or your users: Is a 10% success rate really enough?

(There ways to improve that 10% score. More on those to follow.)

*Furnas, G. W., Landauer, T. K., Gomez, L. M., Dumais, S. T., (1983) “Statistical semantics: Analysis of the potential performance of keyword information access systems.” Bell System Technical Journal, 62, 1753-1806. Reprinted in: Thomas, J.C., and Schneider, M.L, eds. (1984) Human Factors in Computer Systems. Norwood, New Jersey: Ablex Publishing Corp., 187-242.

3 Comments

  1. All true for “classical” (read: old) fulltext engines. Modern machine learning techniques also cope with “semantically close” matches, given that there is enough text to analyse upfront.

    But, true, even then users are left guessing what the text corpus can offer, and that – by itself – is cruel. The alternative is to “show them what they will get”: (e.g. http://kill.devc.at/node/317).

    What I do there is to give a semantic seed and THEN run the machine learning over that.

    Comment by Robert Barta — April 10, 2010 @ 2:19 am

  2. Good point! Although I curious about the accuracy of “semantically close?” Or to put it differently, in whose view?

    Moreover, as Blair and Maron (Size Really Does Matter…) point out, there are “matches” that do not appear to be “semantically close.” Authors using completely disjoint terminology for the same subject matter.

    Granting that has improved in some circles but would you believe that modern legal discovery still depends upon the selection of keywords for Boolean searches? (Internally I am sure they do better but this is in terms of the actual discovery process. Where finding letters signed by the current pope, that sort of thing, would be important.)

    Having said all that, your browsing a text archive application is very impressive. I think it would suit the way I look for information but don’t know how a more general population would react to it.

    I suppose the other question would be performance if the interface were over say the CiteSeer data set? Would I have to pick semantic dimension(s) for display so that the “field” so to speak would not be too cluttered to be useful?

    Comment by Patrick Durusau — April 10, 2010 @ 7:21 am

  3. […] The Furnas article from 1983 is the key to this series of posts. See the full citation in Are You Designing a 10% Solution?. […]

    Pingback by What Is Your TFM (To Find Me) Score? « Another Word For It — May 23, 2010 @ 7:17 pm

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress