Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 18, 2011

Complex Indexing?

Filed under: Indexing,Subject Identity,Topic Maps — Patrick Durusau @ 6:52 pm

The post The Joy of Indexing made me think about the original use case for topic maps, the merging of indexes prepared by different authors.

Indexing that relies either on a token in the text (simple indexing) or even a contextual clue, the compound indexing mentioned in the Joy of Indexing post, but fall short in terms of enabling the merging of indexes.

Why?

In my comments on the Joy of Indexing I mentioned that what we need is a subject indexing engine.

That is an engine that indexes the subjects that are appear in a text and not merely the manner of their appearance.

(Jack Park, topic map advocate and my friend would say I am hand waving at this point so perhaps an example will help.)

Say that I have a text where I use the words George Washington.

That could be a reference to the first president of the United States or it could be a reference to George Washington rabbit (my wife is a children’s librarian).

A simple indexing engine could not distinguish one from the other.

A compound indexing engine might list one under Presidents and the other under Characters but without more in the example we don’t know for sure.

A complex indexing engine, that is one that took into account more than simply the token in the text, say that it created its entry from that token plus other attributes of the subject it represents, would not mistake a president for a rabbit or vice versa.

Take Lucene for example. For any word in a text, it records

The position increment, start, and end offsets and payload are the only additional metadata associated with the token that is recorded in the index.

That pretty much isolates the problem is a nutshell. If that is all the metadata we get, which isn’t much, the likelihood we are going to do any reliable subject matching is pretty low.

Not to single Lucene out, I think all the search engines operate pretty much the same way.

To return to our example, what if while indexing, when we encounter George Washington, instead of the bare token we record, respectively:

George Washington – Class = Mammalia

George Washington – Class = Mammalia

Hmmm, that didn’t help much did it?

How about:

George Washington – Class = Mammalia Order = Primate

George Washington – Class = Mammalia Order = Lagomorpha

So that I can distinguish these two cases but can also ask for all instances of class = Mammalia.

Of course the trick is that no automated system is likely to make that sort of judgement reliably, at least left to its own devices.

But it doesn’t have to does it?

Imagine that I am interested in U.S. history and want to prepare an index of the Continental Congress proceedings. I could simply create an index by tokens but that will encounter all the problems we know that comes from merging indexes. Or searching across tokens as seen by such indexes. See Google for example.

But, what if I indexed the Continental Congress proceedings using more complex tokens? Ones that had multiple properties that could be indexed for one subject and that could exist in relationship to other subjects?

That is for some body of material, I declared the subjects that would be identified and what would be known about them post-identification?

A declarative model of subject identity. (There are other, equally legitimate, models of identity, that I will be covering separately.)

More on the declarative model anon.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress