Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 4, 2010

representing scientific discourse, or: why triples are not enough

Filed under: Classification,Indexing,Information Retrieval,Ontology,RDF,Semantic Web — Patrick Durusau @ 4:15 pm

representing scientific discourse, or: why triples are not enough by Anita de Waard, Disruptive Technologies Director (how is that for a cool title?), Elsevier Labs, merits a long look.

I won’t spoil the effect by trying to summarize the presentation.  It is only 23 slides long.

Read those slides carefully and then get yourself to: Rhetorical Document Structure Group HCLS IG W3C. Read, discuss, contribute.

PS: Based on this slide pack I am seriously thinking of getting a Twitter account so I can follow Anita. Not saying I will but am as tempted as I have ever been. This looks very interesting. Fertile ground for discussion of topic maps.

May 19, 2010

Context of Data?

Filed under: Context,Data Integration,Information Retrieval,Researchers — Patrick Durusau @ 6:02 am

Cristiana Bolchini and others in And What Can Context Do For Data? have started down an interesting path for exploration.

That all data exists in some context is an unremarkable observation until one considers how often that context can be stated, attributed to data, to say nothing of being used to filter or access that data.

Bolchini introduces the notion of a context dimension tree (CDT) which “models context in terms of a set of context dimensions, each capturing a different characteristic of the context.” (CACM, Nov. 2009, page 137) Note that dimensions can be decomposed into sub-trees for further analysis. Further operations combine these dimensions into the “context” of the data that is used to produce a particular view of the data.

Not quite what is meant by scope in topic maps but something a bit more nuanced and subtle. I would argue (no surprise) that the context of a subject is part and parcel of its identity. And how much of that context we choose to represent will vary from project to project.

Further reading:

Bolchini, C., Curino, C. A., Quintaretti, E., Tanca, L. and Schreber, F. A. A data-oriented study of context models. SIGMOD Record, 2007.

Bolchini, C., Quintaretti, E. and Rossato, R. Relational data tailoring through view composition. In Proc. Intl. Conf. on Conceptual Modeling (ER’2007). LNCS. Nov. 2007

Context-ADDICT (its an acronym, I swear!) Website for the project developing this line of research. Prototype software available.

May 15, 2010

Semantic Indexing

Filed under: Authoring Topic Maps,Indexing,Information Retrieval,Semantics — Patrick Durusau @ 6:41 pm

Semantic indexing and searching using a Hopfield net

Automatic creation of thesauri as a means of dealing with the “vocabulary problem.”

Another topic map construction tool.

A bit dated, 1997, but will run this line of research forward and report back.

With explicit subject identity, machine generated thesauri could be reliably interchanged.

And improved upon by human users.

May 3, 2010

Search User Interfaces: Chapter 1 (Part 1)

Chapter 1, The Design of Search User Interfaces of Hearst’s Search User Interfaces, surveys searching and related issues from a user interface perspective.

I needed the reminders about the need for simplicity in search interfaces and the shift in search interface design. (sections 1.1 – 1.2) If you think you have a “simple” interface for your topic map, read those two sections. Then read them again.

Design principles for user interface design (sections 1.3 – 1.4) is a good overview and contrast between user centered design and developers deciding what users need design. (Which one did you use?)

Feedback from search interfaces (section 1.5) ranges from the use of two dimensional representation of items as icons (against) to highlighting query terms, sorting and query term suggestions (generally favorable).

Let’s work towards having interfaces that are as attractive to users as our topic map applications are good at semantic integration.

April 24, 2010

Explicit Semantic Analysis

Filed under: Classification,Data Integration,Information Retrieval,Semantics — Patrick Durusau @ 7:58 am

Explicit Semantic Analysis looks like another tool for the topic maps toolkit.

Not 100% accurate but close enough to give a topic map project involving a serious amount of text a running start.

Start with Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis by Evgeniy Gabrilovich and Shaul Markovitch.

There are 55 citations of this work (as of 2010-04-24), ranging from Geographic Information Retrieval and Beyond the Stars: Exploiting Free-Text User Reviews for Improving the Accuracy of Movie Recommendations (2009) to Explicit Versus Latent Concept Models for Cross-Language Information Retrieval.

I encountered this line of work while reading Combining Concept Based and Text Based Indexes for CLIR by Philipp Sorg and Philipp Cimiano (slides) from the 2009 Cross Language Evaluation Forum. (For any search engines, CLIR = Cross-Language Information Retrieval.) Cross Language Evaluation Forum General link because it does not expose direct links to resources.

Quibble:

Evgeniy Gabrilovich and Shaul Markovitch say that:

We represent texts as a weighted mixture of a predetermined set of natural concepts, which are defined by humans themselves and can be easily explained. To achieve this aim, we use concepts defined by Wikipedia articles, e.g., COMPUTER SCIENCE, INDIA, or LANGUAGE.

and

The choice of encyclopedia articles as concepts is quite natural, as each article is focused on a single issue, which it discusses in detail.

Their use of “natural,” which I equate in academic writing to “…a miracle occurs…,” drew my attention. There are things we choose to treat as concepts or even subject representatives, but that hardly makes them “natural.” Most academic articles would claim (whether true or not) to be “…focused on a single issue, which it discusses in detail.”

Rather than “natural concepts,” describe the headers of Wikipedia texts. More accurate and sets the groundwork for investigation into the nature and length of headers and their impact on semantic mapping and information retrieval.

April 22, 2010

A Missing Step?

I happened across a guide to study and writing research papers that I had as an undergraduate. Looking back over it, I noticed there is a step in the research process that is missing from search engines. Perhaps by design, perhaps not.

After choosing a topic, you did research, then in a variety of print resources to gather material for the paper. As you gathered it, you wrote down each piece of information on a note card along with the full bibliographic information for the source.

When you were writing a paper, you did not consult the original sources but rather your sub-set of those sources that were on your note cards.

In group research projects, we exchanged note cards so that everyone had access to the same sub-set of materials that we had found.

Bibliographic software mimics the note card based process but my question is why is that capacity missing from search interfaces?

That seems to be a missing step.  I don’t know if it is missing by design, i.e., it is cheaper to let everyone look for the same information over and over, or if it is missing in anticipation of bibliographic software filling the gap.

Search interfaces need to offer ways for us to preserve and share our research results with others.

Topic maps would be a good way to offer that sort of capability.

April 15, 2010

What Is Your TFM (To Find Me) Score?

Filed under: Information Retrieval,Recall,Search Engines,Subject Identity — Patrick Durusau @ 10:54 am

I have talked about TFM (To Find Me) scores before. Take a look at How Can I Find Thee? Let me count the ways… for example.

So, you have looked at your OPAC, database, RDF datastore, topic map. What is your average TMF Score?

What do you think it needs to be for 60 to 80% retrieval?

The Furnas article from 1983 is the key to this series of posts. See the full citation in Are You Designing a 10% Solution?.

Would you believe 15 ways to identify a subject? Or aliases to use the common terminology.

Say it slowly, 15 ways to identify a subject gets on average 60 to 80% retrieval. If you are in the range of 3 – 5 ways to identify a subject on your ecommerce site, you are leaving money on the table. Lots of money on the table.

Want to leave less money on the table? Use topic maps and try for 15 aliases for a subject or more.

April 12, 2010

Topic Maps and the “Vocabulary Problem”

To situate topic maps in a traditional area of IR (information retrieval), try the “vocabulary problem.”

Furnas describes the “vocabulary problem” as follows:

Many functions of most large systems depend on users typing in the right words. New or intermittent users often use the wrong words and fail to get the actions or information they want. This is the vocabulary problem. It is a troublesome impediment in computer interactions both simple (file access and command entry) and complex (database query and natural language dialog).

In what follows we report evidence on the extent of the vocabulary problem, and propose both a diagnosis and a cure. The fundamental observation is that people use a surprisingly great variety of words to refer to the same thing. In fact, the data show that no single access word, however well chosen, can be expected to cover more than a small proportion of user’s attempts. Designers have almost always underestimated the problem and, by assigning far too few alternate entries to databases or services, created an unnecessary barrier to effective use. Simulations and direct experimental tests of several alternative solutions show that rich, probabilistically weighted indexes or alias lists can improve success rates by factors of three to five.

The Vocabulary Problem in Human-System Communication (1987)

Substitute topic maps for probabilistically weighted indexes or alias lists. (Techniques we are going to talk about in connection with topic maps authoring.)

Three to five times greater success is an incentive to use topic maps.

Marketing Department Summary

Customers can’t buy what they can’t find. Topic Maps help customers find purchases, increases sales. (Be sure to track pre and post topic maps sales results. So marketing can’t successfully claim the increases are due to their efforts.)

March 16, 2010

Size Really Does Matter…

Filed under: Information Retrieval,Recall,Searching,Semantic Diversity — Patrick Durusau @ 7:20 pm

…when you are evaluating the effectiveness of full-text searching. Twenty-five years Blair and Maron, An evaluation of retrieval effectiveness for a full-text document-retrieval system, established that size effects the predicted usefulness of full text searching.

Blair and Maron used a then state of the art litigation support database containing 40,000 documents for a total of approximately 350,000 pages. Their results differ significantly from earlier, optimistic reports concerning full-text search retrieval. The earlier reports were based on sets of less than 750 documents.

The lawyers using the system, thought they were obtaining at a minimum, 75% of the relevant documents. The participants were astonished to learn they were recovering only 20% of the relevant documents.

One of the reasons cited by Blair and Maron merits quoting:

The belief in the predictability of words and phrases that may be used to discuss a particular subject is a difficult prejudice to overcome….Stated succinctly, is is impossibly difficult for users to predict the exact word, word combinations, and phrases that are used by all (or most) relevant documents and only (or primarily) by those documents….(emphasis in original, page 295)

That sounds to me like users using different ways to talk about the same subjects.

Topic maps won’t help users to predict the “exact word, word combinations, and phrases.” However, they can be used to record mappings into document collections,that collect up the “exact word, word combinations, and phrases” used in relevant documents.

Topic maps can used like the maps of early explorers that become more precise with each new expedition.

March 1, 2010

Is 00.7% of Relevant Documents Enough?

Filed under: Information Retrieval,Searching — Tags: , , , — Patrick Durusau @ 9:04 am

Searching for implication, that is p implies q, I got:

  • “q whenever p” – 44,200 “hits” (00.7%)
  • “p is sufficient for q” – 385,000 “hits” (6%)
  • “p implies q” – 506,000 “hits” (8%)
  • “if p, then q” – 2,189,000 “hits” (36%)
  • “q if p” – 2,920,000 “hits” (48%)

What if the search was for a “smoking gun” sort of document during legal discovery? Or searching for the latest treatment for a patient dying in ER? Or engineering literature to avoid what could be a fatal flaw in a part that will go into hundreds of airplanes? Hmmm, 00.7% results don’t look all that attractive.

It isn’t possible to know what percentage of relevant documents your query returned for a document set of any size. Your query might be the 48% query but it could also be the 00.7% query.

To make matters worse, the 00.7% query could be even worse. That score assumes that those five queries return *all* the relevant documents.

The problem is that different users identify the same subjects in different ways. Or use the same identifications for different subjects. Matters get worse the more users that produce documents that need to be searched.

Available options include:

  1. Create new identifiers and ignore previous ones
  2. Create new identifiers and map previous ones
  3. Map identifiers people already use

This blog will explore all three and why I prefer the last one.

« Newer Posts

Powered by WordPress