Size Really Does Matter…

…when you are evaluating the effectiveness of full-text searching. Twenty-five years Blair and Maron, An evaluation of retrieval effectiveness for a full-text document-retrieval system, established that size effects the predicted usefulness of full text searching.

Blair and Maron used a then state of the art litigation support database containing 40,000 documents for a total of approximately 350,000 pages. Their results differ significantly from earlier, optimistic reports concerning full-text search retrieval. The earlier reports were based on sets of less than 750 documents.

The lawyers using the system, thought they were obtaining at a minimum, 75% of the relevant documents. The participants were astonished to learn they were recovering only 20% of the relevant documents.

One of the reasons cited by Blair and Maron merits quoting:

The belief in the predictability of words and phrases that may be used to discuss a particular subject is a difficult prejudice to overcome….Stated succinctly, is is impossibly difficult for users to predict the exact word, word combinations, and phrases that are used by all (or most) relevant documents and only (or primarily) by those documents….(emphasis in original, page 295)

That sounds to me like users using different ways to talk about the same subjects.

Topic maps won’t help users to predict the “exact word, word combinations, and phrases.” However, they can be used to record mappings into document collections,that collect up the “exact word, word combinations, and phrases” used in relevant documents.

Topic maps can used like the maps of early explorers that become more precise with each new expedition.

12 Responses to “Size Really Does Matter…”

  1. Sven says:

    Hei Patrick.

    So you basically longing for the use of topic maps for query expansion? If so, I’d totally dissent from that. I don’t think, that making keyword search fatter is the solution to the problem of finding more relevant documents.

    Instead, we can take advantage of the ontology knowledge that comes with every sane modeled topic map. Don’t let us search for more keywords, let’s search for concepts!

    Cheers,
    Sven

  2. Patrick Durusau says:

    Not advocating making keywords fatter.

    Am advocating that we make subject identifications fatter.

    Not the same thing.

    The common confusion has been that keywords = subject identifications.

    True, but only in particular contexts. If we can capture the context in which a keyword is used, that is have a complex (as opposed to simple keyword) identification of subjects, then we can find subjects as they are identified by users.

    Otherwise, we are authoring “sane” topic maps, which means we can only find information we have input into our topic maps.

    See my post What The World Needs Now for the amount of electronically stored information as of 2011. We are never going to convert everything into topic maps so we had better learn how to view it as topic maps, including how users actually identify their subjects.

    PS: Ontologies are ok. But different people have different ontologies. In my view topic maps can accommodate those different ontologies. Or as Nietzsche once said: “” I am not bigoted enough for a system and not even for my system”

  3. Patrick Durusau says:

    Supplemental:

    Recall that subject identification = URI was a simplification introduced for the XML syntax.

    The topic maps standard reads:

    The optional subject identity (identity) attribute refers to one or more indications (‘subject descriptors’) of the identity of the subject (the organizing principle) of the topic link. All of the other topic characteristics specified by the topic link are regarded as elaborating, and in no way contradicting, the subject described by the subject descriptor(s), if any. There are no restrictions on the kinds of information that may be referenced by an identity attribute.
    (13250:2000, 5.2.1 Topic Link Architectural Form)

    That remains a valid statement about topic maps.

  4. Sven says:

    I totally see your point now. Context was the word, that made it clear. I had some thoughts on how to realize this, but figured it would blast this comments section. So I posted them on my own blog: http://semantosoph.net/2010/3/19/dude-where-s-my-context.

    Looking forward to hear your comments on that.

  5. Patrick’s quotation from ISO 13250:2000 moves me to share some
    reflections that some of his readers may find stimulating.

    The bulk of the information-interchange power of ISO 13250:2000
    Topic Maps was deliberately omitted from the XTM Specification,
    from which today’s Topic Maps Data Model (TMDM) is derived.
    Among other topic mapping pioneers, including Michel Biezunski,
    I encouraged and assisted in that specialization. Additional
    specializations were introduced by well-intentioned implementers
    of XTM, who had investments to protect, investors to please, and
    products to move into the marketplace. Today, of all the parts
    of the current state of ISO 13250, only its Reference Model
    (Part 5) has a frame of reference that extends outside TMDM’s
    relatively confining perimeter.

    It’s important to me that our thinking not be confined to TMDM,
    and many of us are still fascinated by the original scope of
    13250:2000: facilitating the amalgamation of master indexes of
    diverse corpora from partial indexes that are ontologically and
    taxonomically independent of each other. By contrast, to use
    today’s TMDM-oriented tools, one must view all information
    through the taxonomic lens of TMDM. Gratifying though TMDM’s
    success is, that success, and the scope of TMDM itself, is
    dwarfed by the scope of what we were attempting to accomplish
    when we drafted 13250:2000.

    Patrick’s quotation from 13250:2000, and particularly the
    sentence:

    “There are no restrictions on the kinds of information that
    may be referenced by an identity attribute.”

    reminds me how much 13250:2000 depended on normative references
    to the ISO 17044:1997 (and :1992) “HyTime” standard to convey
    its intent. All of the kinds of information for which HyTime
    defined interchange syntaxes are included in that sweeping “no
    restrictions” statement, and those kinds of information were
    very prominent in the minds of 13250’s drafters and reviewers.
    (That prominence was no coincidence!)

    Here’s one of the 24 normative references to HyTime in
    13250:2000:

    “The definitions provided in […] ISO/IEC 10744:1997
    (including Amendment 1) shall apply to this International
    Standard.”

    So, in order to glimpse the intended scope of 13250:2000 with
    any accuracy, I think one really needs to know a thing or two
    about HyTime.

    HyTime pioneered the idea of formally standardizing a way to
    interchange — and to exploit in unanticipated contexts —
    strategies for positively identifying, and for regarding as
    exactly the same for all purposes of linking, scheduling, etc.,
    certain classes of subjects of conversation, the classes being:

    * information components,

    * information addresses,

    * abstract extents in n-dimensional finite coordinate spaces,

    * mappings among such spaces,

    * semantic-bearing relationships,

    * namespaces,

    * and more.

    In HyTime, all of these subjects are notionally represented not
    by “topic information items” (or “topic links” in 13250:2000),
    but instead by nodes in “Graph Representations of Property
    Values” called “groves”.

    The HyTime “grove” idea establishes a way to endow the
    components of information objects (etc.) with identities, and
    with addresses that leverage those identities in whatever way(s)
    may be desired. Groves enable all information components to be
    addressed without having first to add metadata to them. For
    example, in a grove of an SGML document, a given element is
    addressable regardless of whether it has an ID attribute, and
    adding an ID attribute to it only makes it addressable in yet
    another way. And elements are just one kind of addressable
    information component; in an extreme (and usually absurd) case,
    a grove could include a node that represents a given whitespace
    character in an XML start tag.

    “To build a grove” means “to view the information as a graph of
    nodes constructed in whatever formal and deterministic ways meet
    the requirements of whatever the intended applications may be.”
    Everything — every syntactic and/or semantic component of an
    instance of SGML or any other notation — can be endowed with
    identity and addressability, and therefore it can play a role in
    any kind of hyperlink. In the grove paradigm, the identity of a
    component can be defined or addressed in terms of the identity
    of any other component, or even in terms of the identities of
    all of the other components and their relationships to it. Or
    in any other way. In grove-land, you get to choose (and even to
    design, if you like) how the whole information object will be
    viewable as a graph. According to HyTime, the way in which you
    are choosing to view it is formally defined by an
    interchangeable “Property Set” — documentation about, and
    structural constraint specifications on, the view of the
    information that you have chosen to use. For the most part, a
    Property Set defines classes of nodes. In a grove that conforms
    to a given Property Set, each instance of each class represents
    an instance of some corresponding class of subjects. And in a
    grove, every subject is either a piece of information, or a
    semantic derived in a defined manner from one or more subjects
    that are pieces of addressable information. An example of the
    latter is a property defined in the “HyTime Property Set” whose
    value is, in effect, a dictionary of the nodes that are
    addressed by other nodes in some specified corpus.

    Groves were the prototypes of Topic Maps, and the Topic Maps
    idea is no more or less than a generalization of the Grove idea.
    Every grove node represents a very specific, formally-identified
    subject of conversation. Indeed, the only differences between
    HyTime Groves and Topic Maps, as defined in the Topic Maps
    Reference Model (TMRM) are constraints on groves that are
    relaxed in topic maps:

    (1) In HyTime Groves, there are only a few valid property value
    types, whereas in TMRM, the types of the values of
    properties is unconstrained.

    (2) In HyTime Groves, no grove (and no node’s properties) can be
    defined by more than one monolithic Property Set, while in
    TMRM, a given single node can have instances of properties
    whose classes were defined independently, with no
    cooperation or mutual understanding among their definers.

    (3) A HyTime Grove may or may not be acyclic, but it is always
    hierarchical in that there is always a root node. There is
    no such constraint on a Topic Map. A Topic Map may or may
    not be hierarchical, and it may or may not have a root node.

    (4) Every node in a HyTime Grove represents a subject which is
    always some piece of information. In a TMRM Topic Map,
    there is no such constraint on the subjects that nodes can
    represent.

    (5) Every node in a HyTime Grove is an instance of some node
    class that is defined in the grove’s Property Set. By
    contrast, there are no node classes in the topic map graphs
    described in TMRM. Or, maybe it would be clearer to say
    that in TMRM there is only one node class, “subject proxy”,
    and that all subject proxies are instances of it. In
    effect, of course, there are *subject* classes in Topic
    Maps, and the class membership(s) of each subject are
    revealed by the classes and values of the properties of the
    corresponding topic (aka “subject proxy”).

    Thus, all HyTime groves are easily seen as TMRM Topic Maps; any
    remaining differences are merely terminological. HyTime groves
    consist of nodes, the nodes have properties, the properties are
    instances of user-defined classes of properties, the property
    classes are disclosed (the legend is the Property Set), and
    every node’s purpose is to serve as a proxy for a single subject
    of conversation.

    With all that in mind, let’s return to those words in
    13250:2000:

    “There are no restrictions on the kinds of information that
    may be referenced by an identity attribute.”

    This was a conscious reference to all of the identity- and
    addressability-endowment power of the HyTime “grove” paradigm,
    among all the other possibilities. The intent was to allow
    Topic Map authors the freedom to decide not only what their
    subjects are, but also the subject-identity-invoking techniques
    embodied in the information referenced by the “identity”
    attributes of &lt:topic>s. The referenced information could be,
    for example, a node in a grove, and thus the entire
    semantic-loading and subject-sameness apparatus of HyTime
    Property Sets, Grove Plans, Architectural Forms, Scheduling,
    Mapping, Activity Policy Tracking, and much more could be
    brought to bear, using any combination of HyTime modules.

    The very next paragraph of 13250:2000, immediately after
    Patrick’s quote, says:

    “NOTE 18 The information referenced by an identity attribute
    may or may not take the form of a topic link in a topic map
    document…”

    Among other things, this note underlines the idea that the
    referenced subject descriptor’s context is important in
    understanding the identity of the subject being invoked by the
    reference. If the referent is a topic, then what is being
    referenced is the *subject represented by the topic* —
    something that may not be knowable without understanding the
    referenced topic’s context in its own topic map. This idea is
    further clarifed later in 13250:2000:

    “Similarly, if the identity attribute references one or more
    topic links, topic map processing applications must regard
    the referencing topic link, and all the referenced topic
    links, as having one and the same subject, and therefore they
    may all be merged.”

    But what if the topic map author needed to refer not to the
    subject of some <topic>, but rather to the syntactic SGML
    element that is that <topic>? That is, what if that particular
    instance of a <topic> element was supposed to be the *subject*
    of the referring topic?

    The answer to this question was not explicit in 13250:2000, but
    it was implicit in the “no restrictions” formula and in the
    normative references to HyTime. I, among others, assumed that
    the identity attribute would refer not to the ID of the <topic>
    (because, according to 13250:2000, that would always be a
    reference to the <topic>’s subject, as we have just seen), but
    instead to that <topic>’s corresponding grove node. This works
    because the subject of the grove node that represents the
    <topic> element is not the subject being represented by the
    <topic>, but rather the <topic> element itself, considered as an
    instance of an SGML syntactic construct. The two referents (a
    <topic> element vs. a grove node whose subject is the same
    <topic> element) have different semantics by virtue of their
    different contexts.

    In the context of an SGML grove, the subject of a node is always
    an instance of an SGML syntactic construct, and it can’t be
    anything else. In the context of a Topic Map, however, the
    subject of a node can be anything at all, including but not
    limited to an instance of an SGML syntactic construct. It could
    be something wildly different from an instance of a syntactic
    construct, such as Minnie Mouse’s high-heeled shoes. You may
    not be able to tell what the subject of a <topic> actually is
    without looking more deeply at the topic map in which it
    appears, because the identity of the <topic>’s subject may
    depend on the identities of the subjects of other <topic>s in
    that map, just as the identity of the subject of a grove node
    may depend on the identities of the rest of the information
    components that have been node-ified in the grove.

    Steve Newcomb
    24 March 2010

  6. Ugh. My posting about HyTime groves and Topic Maps, above, has been rendered unreadable by the posting process. In each of the 15 places where I wrote

    (the symbol for less-than) topic (the symbol for greater-than)

    …nothing appears. Hmmm. No warning from WordPress, either, just silent scrozzling that I wouldn’t have discovered if I had not checked the actual posting. Until this is resolved, you can see what I actually wrote at http://www.coolheads.com/SRNPUBS/groveProgenitorOfTMs.txt — including the original punctuation and spacing, which WordPress also silently changed, without yielding a significant loss of clarity.

    Steve Newcomb

  7. Thanks, Patrick, for fixing the problem.

  8. […] large portion of relevant documents will go unfound. As much as 80% of the relevant documents. See Size Really Does Matter… (A study of full text searching but the underlying problem is the same: “What term was […]

  9. […] really isn’t that hard to guess some of them. I blogged about Blair and Maron saying twenty-five years ago: Stated succinctly, it is impossibly difficult for users to predict […]

  10. […] Blair and Maron and lawyers thinking they were getting 75% of the relevant resources (reality was 20%). What are […]

  11. […] An evaluation of retrieval effectiveness for a full-text document-retrieval system (see Size Really Does Matter…, which was published in 1985. If more than twenty-five years later, some researchers are not yet […]

  12. […] will remember in Size Really Does Matter… that Blair and Maron reported that lawyers over estimated their accuracy in document retrieval by […]