Search and Exogenous Complexity
Stephen Arnold writes:
I am now using the phrase “exogenous complexity” to describe systems, methods, processes, and procedures which are likely to fail due to outside factors. This initial post focuses on indexing, but I will extend the concept to other content centric applications in the future. Disagree with me? Use the comments section of this blog, please.
What is an outside factor?
Let’s think about value adding indexing, content enrichment, or metatagging. The idea is that unstructured text contains entities, facts, bound phrases, and other identifiable entities. A key word search system is mostly blind to the meaning of a number in the form nnn nn nnnn, which in the United States is the pattern for a Social Security Number. There are similar patterns in Federal Express, financial, and other types of sequences. The idea is that a system will recognize these strings and tag them appropriately; for example:
nnn nn nnn Social Security Number
Thus, a query for Social Security Numbers will return a string of digits matching the pattern. The same logic can be applied to certain entities and with the help of a knowledge base, Bayesian numerical recipes, and other techniques such as synonym expansion determine that a query for Obama residence will return White House or a query for the White House will return links to the Obama residence.
I am not sure the inside/outside division is helpful.
For example, Arnold starts with the issue:
First, there is the issue of humans who use language in unexpected or what some poets call “fresh” or “metaphoric” methods. English is synthetic in that any string of sounds can be used in quite unexpected ways.
True, but recall the overloading of owl:sameAs, which is entirely within a semantic system.
I mention that to make the point that while inside/outside may be useful informal metaphors, semantics are with us, always. Even in systems where one or more parties think the semantics are “obvious” or “defined.” Maybe, depends on who you ask.
The second issue is:
Second, there is the quite real problem of figuring out the meaning of short, mostly context free snippets of text.
But isn’t that an inside problem too? Search engines vacuum up content from a variety of contexts, not preserving the context that would make the “snippets of text” meaningful. Snippets of text have very different meanings in comp.compilers than in alt.religion.scientology. It is the searcher’s choice to treat both as a single pile of text.
His third point is:
Third, there is the issue of people and companies desperate for a solution or desperate for revenue. The coin has two sides. Individuals who are looking for a silver bullet find vendors who promise not just one silver bullet but an ammunition belt stuffed with the rounds. The buyers and vendors act out a digital kabuki.
But isn’t this an issue of design and requirements, which are “inside” issues as well?
No system can meet a requirement for universal semantic resolution with little or not human involvement. The questions are: How much better information do you need How much are you willing to pay for it? No free lunch when its comes to semantics, ever. That includes the semantics of the systems we use and the information to which they are applied.
The requirements for any search system must address both “inside” and “outside” issues and semantics.
(Apologies for the length but semantic complexity is one of my pet topics.)