Size Really Does Matter… « Another Word For It

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 16, 2010

Size Really Does Matter…

Filed under: Information Retrieval,Recall,Searching,Semantic Diversity — Patrick Durusau @ 7:20 pm

…when you are evaluating the effectiveness of full-text searching. Twenty-five years Blair and Maron, An evaluation of retrieval effectiveness for a full-text document-retrieval system, established that size effects the predicted usefulness of full text searching.

Blair and Maron used a then state of the art litigation support database containing 40,000 documents for a total of approximately 350,000 pages. Their results differ significantly from earlier, optimistic reports concerning full-text search retrieval. The earlier reports were based on sets of less than 750 documents.

The lawyers using the system, thought they were obtaining at a minimum, 75% of the relevant documents. The participants were astonished to learn they were recovering only 20% of the relevant documents.

One of the reasons cited by Blair and Maron merits quoting:

The belief in the predictability of words and phrases that may be used to discuss a particular subject is a difficult prejudice to overcome….Stated succinctly, is is impossibly difficult for users to predict the exact word, word combinations, and phrases that are used by all (or most) relevant documents and only (or primarily) by those documents….(emphasis in original, page 295)

That sounds to me like users using different ways to talk about the same subjects.

Topic maps won’t help users to predict the “exact word, word combinations, and phrases.” However, they can be used to record mappings into document collections,that collect up the “exact word, word combinations, and phrases” used in relevant documents.

Topic maps can used like the maps of early explorers that become more precise with each new expedition.

Comments (12)

12 Comments

Hei Patrick.

So you basically longing for the use of topic maps for query expansion? If so, I’d totally dissent from that. I don’t think, that making keyword search fatter is the solution to the problem of finding more relevant documents.

Instead, we can take advantage of the ontology knowledge that comes with every sane modeled topic map. Don’t let us search for more keywords, let’s search for concepts!

Cheers,
Sven

Comment by Sven — March 19, 2010 @ 6:14 am
Not advocating making keywords fatter.

Am advocating that we make subject identifications fatter.

Not the same thing.

The common confusion has been that keywords = subject identifications.

True, but only in particular contexts. If we can capture the context in which a keyword is used, that is have a complex (as opposed to simple keyword) identification of subjects, then we can find subjects as they are identified by users.

Otherwise, we are authoring “sane” topic maps, which means we can only find information we have input into our topic maps.

See my post What The World Needs Now for the amount of electronically stored information as of 2011. We are never going to convert everything into topic maps so we had better learn how to view it as topic maps, including how users actually identify their subjects.

PS: Ontologies are ok. But different people have different ontologies. In my view topic maps can accommodate those different ontologies. Or as Nietzsche once said: “” I am not bigoted enough for a system and not even for my system”

Comment by Patrick Durusau — March 19, 2010 @ 6:40 am
Supplemental:

Recall that subject identification = URI was a simplification introduced for the XML syntax.

The topic maps standard reads:

The optional subject identity (identity) attribute refers to one or more indications (‘subject descriptors’) of the identity of the subject (the organizing principle) of the topic link. All of the other topic characteristics specified by the topic link are regarded as elaborating, and in no way contradicting, the subject described by the subject descriptor(s), if any. There are no restrictions on the kinds of information that may be referenced by an identity attribute.
(13250:2000, 5.2.1 Topic Link Architectural Form)

That remains a valid statement about topic maps.

Comment by Patrick Durusau — March 19, 2010 @ 7:17 am
I totally see your point now. Context was the word, that made it clear. I had some thoughts on how to realize this, but figured it would blast this comments section. So I posted them on my own blog: http://semantosoph.net/2010/3/19/dude-where-s-my-context.

Looking forward to hear your comments on that.

Comment by Sven — March 19, 2010 @ 10:09 am
Patrick’s quotation from ISO 13250:2000 moves me to share some
reflections that some of his readers may find stimulating.

The bulk of the information-interchange power of ISO 13250:2000
Topic Maps was deliberately omitted from the XTM Specification,
from which today’s Topic Maps Data Model (TMDM) is derived.
Among other topic mapping pioneers, including Michel Biezunski,
I encouraged and assisted in that specialization. Additional
specializations were introduced by well-intentioned implementers
of XTM, who had investments to protect, investors to please, and
products to move into the marketplace. Today, of all the parts
of the current state of ISO 13250, only its Reference Model
(Part 5) has a frame of reference that extends outside TMDM’s
relatively confining perimeter.

It’s important to me that our thinking not be confined to TMDM,
and many of us are still fascinated by the original scope of
13250:2000: facilitating the amalgamation of master indexes of
diverse corpora from partial indexes that are ontologically and
taxonomically independent of each other. By contrast, to use
today’s TMDM-oriented tools, one must view all information
through the taxonomic lens of TMDM. Gratifying though TMDM’s
success is, that success, and the scope of TMDM itself, is
dwarfed by the scope of what we were attempting to accomplish
when we drafted 13250:2000.

Patrick’s quotation from 13250:2000, and particularly the
sentence:

“There are no restrictions on the kinds of information that
may be referenced by an identity attribute.”

reminds me how much 13250:2000 depended on normative references
to the ISO 17044:1997 (and :1992) “HyTime” standard to convey
its intent. All of the kinds of information for which HyTime
defined interchange syntaxes are included in that sweeping “no
restrictions” statement, and those kinds of information were
very prominent in the minds of 13250’s drafters and reviewers.
(That prominence was no coincidence!)

Here’s one of the 24 normative references to HyTime in
13250:2000:

“The definitions provided in […] ISO/IEC 10744:1997
(including Amendment 1) shall apply to this International
Standard.”

So, in order to glimpse the intended scope of 13250:2000 with
any accuracy, I think one really needs to know a thing or two
about HyTime.

HyTime pioneered the idea of formally standardizing a way to
interchange — and to exploit in unanticipated contexts —
strategies for positively identifying, and for regarding as
exactly the same for all purposes of linking, scheduling, etc.,
certain classes of subjects of conversation, the classes being:

* information components,

* information addresses,

* abstract extents in n-dimensional finite coordinate spaces,

* mappings among such spaces,

* semantic-bearing relationships,

* namespaces,

* and more.

In HyTime, all of these subjects are notionally represented not
by “topic information items” (or “topic links” in 13250:2000),
but instead by nodes in “Graph Representations of Property
Values” called “groves”.

The HyTime “grove” idea establishes a way to endow the
components of information objects (etc.) with identities, and
with addresses that leverage those identities in whatever way(s)
may be desired. Groves enable all information components to be
addressed without having first to add metadata to them. For
example, in a grove of an SGML document, a given element is
addressable regardless of whether it has an ID attribute, and
adding an ID attribute to it only makes it addressable in yet
another way. And elements are just one kind of addressable
information component; in an extreme (and usually absurd) case,
a grove could include a node that represents a given whitespace
character in an XML start tag.

“To build a grove” means “to view the information as a graph of
nodes constructed in whatever formal and deterministic ways meet
the requirements of whatever the intended applications may be.”
Everything — every syntactic and/or semantic component of an
instance of SGML or any other notation — can be endowed with
identity and addressability, and therefore it can play a role in
any kind of hyperlink. In the grove paradigm, the identity of a
component can be defined or addressed in terms of the identity
of any other component, or even in terms of the identities of
all of the other components and their relationships to it. Or
in any other way. In grove-land, you get to choose (and even to
design, if you like) how the whole information object will be
viewable as a graph. According to HyTime, the way in which you
are choosing to view it is formally defined by an
interchangeable “Property Set” — documentation about, and
structural constraint specifications on, the view of the
information that you have chosen to use. For the most part, a
Property Set defines classes of nodes. In a grove that conforms
to a given Property Set, each instance of each class represents
an instance of some corresponding class of subjects. And in a
grove, every subject is either a piece of information, or a
semantic derived in a defined manner from one or more subjects
that are pieces of addressable information. An example of the
latter is a property defined in the “HyTime Property Set” whose
value is, in effect, a dictionary of the nodes that are
addressed by other nodes in some specified corpus.

Groves were the prototypes of Topic Maps, and the Topic Maps
idea is no more or less than a generalization of the Grove idea.
Every grove node represents a very specific, formally-identified
subject of conversation. Indeed, the only differences between
HyTime Groves and Topic Maps, as defined in the Topic Maps
Reference Model (TMRM) are constraints on groves that are
relaxed in topic maps:

(1) In HyTime Groves, there are only a few valid property value
types, whereas in TMRM, the types of the values of
properties is unconstrained.

(2) In HyTime Groves, no grove (and no node’s properties) can be
defined by more than one monolithic Property Set, while in
TMRM, a given single node can have instances of properties
whose classes were defined independently, with no
cooperation or mutual understanding among their definers.

(3) A HyTime Grove may or may not be acyclic, but it is always
hierarchical in that there is always a root node. There is
no such constraint on a Topic Map. A Topic Map may or may
not be hierarchical, and it may or may not have a root node.

(4) Every node in a HyTime Grove represents a subject which is
always some piece of information. In a TMRM Topic Map,
there is no such constraint on the subjects that nodes can
represent.

(5) Every node in a HyTime Grove is an instance of some node
class that is defined in the grove’s Property Set. By
contrast, there are no node classes in the topic map graphs
described in TMRM. Or, maybe it would be clearer to say
that in TMRM there is only one node class, “subject proxy”,
and that all subject proxies are instances of it. In
effect, of course, there are *subject* classes in Topic
Maps, and the class membership(s) of each subject are
revealed by the classes and values of the properties of the
corresponding topic (aka “subject proxy”).

Thus, all HyTime groves are easily seen as TMRM Topic Maps; any
remaining differences are merely terminological. HyTime groves
consist of nodes, the nodes have properties, the properties are
instances of user-defined classes of properties, the property
classes are disclosed (the legend is the Property Set), and
every node’s purpose is to serve as a proxy for a single subject
of conversation.

With all that in mind, let’s return to those words in
13250:2000:

“There are no restrictions on the kinds of information that
may be referenced by an identity attribute.”

This was a conscious reference to all of the identity- and
addressability-endowment power of the HyTime “grove” paradigm,
among all the other possibilities. The intent was to allow
Topic Map authors the freedom to decide not only what their
subjects are, but also the subject-identity-invoking techniques
embodied in the information referenced by the “identity”
attributes of &lt:topic>s. The referenced information could be,
for example, a node in a grove, and thus the entire
semantic-loading and subject-sameness apparatus of HyTime
Property Sets, Grove Plans, Architectural Forms, Scheduling,
Mapping, Activity Policy Tracking, and much more could be
brought to bear, using any combination of HyTime modules.

The very next paragraph of 13250:2000, immediately after
Patrick’s quote, says:

“NOTE 18 The information referenced by an identity attribute
may or may not take the form of a topic link in a topic map
document…”

Among other things, this note underlines the idea that the
referenced subject descriptor’s context is important in
understanding the identity of the subject being invoked by the
reference. If the referent is a topic, then what is being
referenced is the *subject represented by the topic* —
something that may not be knowable without understanding the
referenced topic’s context in its own topic map. This idea is
further clarifed later in 13250:2000:

“Similarly, if the identity attribute references one or more
topic links, topic map processing applications must regard
the referencing topic link, and all the referenced topic
links, as having one and the same subject, and therefore they
may all be merged.”

But what if the topic map author needed to refer not to the
subject of some <topic>, but rather to the syntactic SGML
element that is that <topic>? That is, what if that particular
instance of a <topic> element was supposed to be the *subject*
of the referring topic?

The answer to this question was not explicit in 13250:2000, but
it was implicit in the “no restrictions” formula and in the
normative references to HyTime. I, among others, assumed that
the identity attribute would refer not to the ID of the <topic>
(because, according to 13250:2000, that would always be a
reference to the <topic>’s subject, as we have just seen), but
instead to that <topic>’s corresponding grove node. This works
because the subject of the grove node that represents the
<topic> element is not the subject being represented by the
<topic>, but rather the <topic> element itself, considered as an
instance of an SGML syntactic construct. The two referents (a
<topic> element vs. a grove node whose subject is the same
<topic> element) have different semantics by virtue of their
different contexts.

In the context of an SGML grove, the subject of a node is always
an instance of an SGML syntactic construct, and it can’t be
anything else. In the context of a Topic Map, however, the
subject of a node can be anything at all, including but not
limited to an instance of an SGML syntactic construct. It could
be something wildly different from an instance of a syntactic
construct, such as Minnie Mouse’s high-heeled shoes. You may
not be able to tell what the subject of a <topic> actually is
without looking more deeply at the topic map in which it
appears, because the identity of the <topic>’s subject may
depend on the identities of the subjects of other <topic>s in
that map, just as the identity of the subject of a grove node
may depend on the identities of the rest of the information
components that have been node-ified in the grove.

Steve Newcomb
24 March 2010

Comment by Steve Newcomb — March 24, 2010 @ 9:21 am
Ugh. My posting about HyTime groves and Topic Maps, above, has been rendered unreadable by the posting process. In each of the 15 places where I wrote

(the symbol for less-than) topic (the symbol for greater-than)

…nothing appears. Hmmm. No warning from WordPress, either, just silent scrozzling that I wouldn’t have discovered if I had not checked the actual posting. Until this is resolved, you can see what I actually wrote at http://www.coolheads.com/SRNPUBS/groveProgenitorOfTMs.txt — including the original punctuation and spacing, which WordPress also silently changed, without yielding a significant loss of clarity.

Steve Newcomb

Comment by Steve Newcomb — March 24, 2010 @ 10:04 am
Thanks, Patrick, for fixing the problem.

Comment by Steve Newcomb — March 25, 2010 @ 10:17 am
[…] large portion of relevant documents will go unfound. As much as 80% of the relevant documents. See Size Really Does Matter… (A study of full text searching but the underlying problem is the same: “What term was […]

Pingback by Semantic Compression « Another Word For It — June 26, 2010 @ 12:55 pm
[…] really isn’t that hard to guess some of them. I blogged about Blair and Maron saying twenty-five years ago: Stated succinctly, it is impossibly difficult for users to predict […]

Pingback by Is search passé? « Another Word For It — August 30, 2010 @ 4:51 pm
[…] Blair and Maron and lawyers thinking they were getting 75% of the relevant resources (reality was 20%). What are […]

Pingback by LEGAL INFORMATION SYSTEMS & LEGAL INFORMATICS RESOURCES « Another Word For It — October 29, 2010 @ 5:59 am
[…] An evaluation of retrieval effectiveness for a full-text document-retrieval system (see Size Really Does Matter…, which was published in 1985. If more than twenty-five years later, some researchers are not yet […]

Pingback by Watson – Indexing – Human vs. Computer « Another Word For It — March 28, 2011 @ 10:07 am
[…] will remember in Size Really Does Matter… that Blair and Maron reported that lawyers over estimated their accuracy in document retrieval by […]

Pingback by Confidence Bias: Evidence from Crowdsourcing « Another Word For It — November 4, 2011 @ 6:10 pm

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.