Archive for the ‘Ambiguity’ Category

AlphaZero: Mastering Unambiguous, Low-Dimensional Data

Wednesday, December 6th, 2017

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm by David Silver, et al.

Abstract:

The game of chess is the most widely-studied domain in the history of artificial intelligence. The strongest programs are based on a combination of sophisticated search techniques, domain-specific adaptations, and handcrafted evaluation functions that have been refined by human experts over several decades. In contrast, the AlphaGo Zero program recently achieved superhuman performance in the game of Go, by tabula rasa reinforcement learning from games of self-play. In this paper, we generalise this approach into a single AlphaZero algorithm that can achieve, tabula rasa, superhuman performance in many challenging domains. Starting from random play, and given no domain knowledge except the game rules, AlphaZero achieved within 24 hours a superhuman level of play in the games of chess and shogi (Japanese chess) as well as Go, and convincingly defeated a world-champion program in each case.

The achievements by the AlphaZero team and their algorithm merit joyous celebration.

Joyous celebration recognizing AlphaZero masters unambiguous, low-dimensional data governed by deterministic rules that define the outcomes for any state, more quickly and completely than any human.

Chess, Shogi and Go appear complex to humans due to the large number of potential outcomes. But every outcome is the result of the application of deterministic rules to unambiguous, low-dimensional data. Something that AlphaZero excels at doing.

What hasn’t been shown is equivalent performance on ambiguous, high-dimensional data, governed by partially (if that) known rules, for a limited set of sub-cases. For those cases, well, you need a human being.

That’s not to take anything away from the AlphaZero team, but to recognize the strengths of AlphaZero and to avoid its application where it is weak.

Interpretation Under Ambiguity [First Cut Search Results]

Sunday, February 7th, 2016

Interpretation Under Ambiguity by Peter Norvig.

From the paper:

Introduction

This paper is concerned with the problem of semantic and pragmatic interpretation of sentences. We start with a standard strategy for interpretation, and show how problems relating to ambiguity can confound this strategy, leading us to a more complex strategy. We start with the simplest of strategies:

Strategy 1: Apply syntactic rules to the sentence to derive a parse tree, then apply semantic rules to get a translation into some logical form, and finally do a pragmatic interpretation to arrive at the final meaning.

Although this strategy completely ignores ambiguity, and is intended as a sort of strawman, it is in fact a commonly held approach. For example, it is approximately the strategy assumed by Montague grammar, where `pragmatic interpretation’ is replaced by `model theoretic interpretation.’ The problem with this strategy is that ambiguity can strike at the lexical, syntactic, semantic, or pragmatic level, introducing multiple interpretations. The obvious way to counter this problem is as follows:

Strategy 2: Apply syntactic rules to the sentence to derive a set of parse trees, then apply semantic rules to get a set of translations in some logical form, discarding any inconsistent formulae. Finally compute pragmatic interpretation scores for each possibility, to arrive at the `best’ interpretation (i.e. `most consistent’ or `most likely’ in the given context).

In this framework, the lexicon, grammar, and semantic and pragmatic interpretation rules determine a mapping between sentences and meanings. A string with exactly one interpretation is unambiguous, one with no interpretation is anomalous, and one with multiple interpretations is ambiguous. To enumerate the possible parses and logical forms of a sentence is the proper job of a linguist; to then choose from the possibilities the one “correct” or “intended” meaning of an utterance is an exercise in pragmatics or Artificial Intelligence.

One major problem with Strategy 2 is that it ignores the difference between sentences that seem truly ambiguous to the listener, and those that are only found to be ambiguous after careful analysis by the linguist. For example, each of (1-3) is technically ambiguous (with could signal the instrument or accompanier case, and port could be a harbor or the left side of a ship), but only (3) would be seen as ambiguous in a neutral context.

(1) I saw the woman with long blond hair.
(2) I drank a glass of port.
(3) I saw her duck.

Lotfi Zadeh (personal communication) has suggested that ambiguity is a matter of degree. He assumes each interpretation has a likelihood score attached to it. A sentence with a large gap between the highest and second ranked interpretation has low ambiguity; one with nearly-equal ranked interpretations has high ambiguity; and in general the degree of ambiguity is inversely proportional to the sharpness of the drop-off in ranking. So, in (1) and (2) above, the degree of ambiguity is below some threshold, and thus is not noticed. In (3), on the other hand, there are two similarly ranked interpretations, and the ambiguity is perceived as such. Many researchers, from Hockett (1954) to Jackendoff (1987), have suggested that the interpretation of sentences like (3) is similar to the perception of visual illusions such as the Necker cube or the vase/faces or duck/rabbit illusion. In other words, it is possible to shift back and forth between alternate interpretations, but it is not possible to perceive both at once. This leads us to Strategy 3:

Strategy 3: Do syntactic, semantic, and pragmatic interpretation as in Strategy 2. Discard the low-ranking interpretations, according to some threshold function. If there is more than one interpretation remaining, alternate between them.

Strategy 3 treats ambiguity seriously, but it leaves at least four problems untreated. One problem is the practicality of enumerating all possible parses and interpretations. A second is how syntactic and lexical preferences can lead the reader to an unlikely interpretation. Third, we can change our mind about the meaning of a sentence-“at first I thought it meant this, but now I see it means that.” Finally, our affectual reaction to ambiguity is variable. Ambiguity can go unnoticed, or be humorous, confusing, or perfectly harmonious. By `harmonious,’ I mean that several interpretations can be accepted simultaneously, as opposed to the case where one interpretation is selected. These problems will be addressed in the following sections.

Apologies for the long introduction quote but I want to entice you to read Norvig’s essay in full and if you have the time, the references that he cites.

It’s the literature you will have to master to use search engines and develop indexing strategies.

At least for one approach to search and indexing.

That within a language there is enough commonality for automated indexing or searching to be useful has been proven over and over again by Internet search engines.

But at the same time, the first twenty or so results typically leave you wondering what interpretation the search engine put on your words.

As I said, Peter’s approach is useful, at least for a first cut at search results.

The problem is that the first cut has become the norm for “success” of search results.

That works if I want to pay lawyers, doctors, teachers and others to find the same results as others have found before (past tense).

That cost doesn’t appear as a line item in any budget but repetitive “finding” of the same information over and over again is certainly a cost to any enterprise.

First cut on semantic interpretation, follow Norvig.

Saving re-finding costs and the cost of not-finding, requires something more robust than a one model to find words and in the search darkness bind them to particular meanings.

PS: See Peter@norvig.com for an extensive set of resources, papers, presentations, etc.

I first saw this in a tweet by James Fuller.

Fifty Words for Databases

Saturday, March 7th, 2015

Fifty Words for Databases by Phil Factor

From the post:

Almost every human endeavour seems simple from a distance: even database deployment. Reality always comes as a shock, because the closer you get to any real task, the more you come to appreciate the skills that are necessary to accomplish it.

One of the big surprises I have when I attend developer conferences is to be told by experts how easy it is to take a database from development and turn it into a production system, and then implement the processes that allow it to be upgraded safely. Occasionally, I’ve been so puzzled that I’ve drawn the speakers to one side after the presentation to ask them for the details of how to do it so effortlessly, mentioning a few of the tricky aspects I’ve hit. Invariably, it soon becomes apparent from their answers that their experience, from which they’ve extrapolated, is of databases the size of a spreadsheet with no complicated interdependencies, compliance issues, security complications, high-availability mechanisms, agent tasks, alerting systems, complex partitioning, queuing, replication, downstream analysis dependencies and so on about which you, the readers, know more than I. At the vast international enterprise where I once worked in IT, we had a coded insult for such people: ‘They’ve catalogued their CD collection in a database’. Unfair, unkind, but even a huge well-used ‘Big Data’ database dealing in social media is a tame and docile creature compared with a heavily- used OLTP trading system where any downtime or bug means figures for losses where you have to count the trailing zeros. The former has unique problems, of course, but the two types of database are so different.

I wonder if the problem is one of language. Just as the English have fifty ways of describing rainfall, and the Inuit have many ways of describing pack ice, it is about time that we created the language for a variety of databases from a mild drizzle (‘It is a soft morning to be sure’) to a cloud-burst. Until anyone pontificating about the database lifecycle can give their audience an indication of the type of database they’re referring to, we will continue to suffer the sort of misunderstandings that so frustrate the development process. Though I’m totally convinced that the development culture should cross-pollinate far more with the science of IT operations, It will need more than a DevOps group-hug; it will require a change in the technical language so that it can accurately describe the rich variety of databases in operational use and their widely- varying requirements. The current friction is surely due more to misunderstandings on both sides, because it is so difficult to communicate these requirements. Any suggestions for suitable descriptive words for types of database? (emphasis added)

If you have “descriptive words” to suggest to Phil, comment on his post.

With the realization that your “descriptive words” may be different from my “descriptive words” for the same database or mean a different database altogether or have nothing to do with databases at all (when viewed by others).

Yes, I have been thinking about identifiers, again, and will start off the coming week with a new series of posts on subject identification. I hope to include a proposal for a metric of subject identification.

…ambiguous phrases in research papers…

Sunday, November 23rd, 2014

When scientists use ambiguous phrases in research papers… And what they might actually mean 😉

This graphic was posted to Twitter by Jan Lentzos.

scientists phrases

This sort of thing makes the rounds every now and again. From the number of retweets of Jan’s post, it never fails to amuse.

Enjoy!

Data visualization: ambiguity as a fellow traveler

Wednesday, July 10th, 2013

Data visualization: ambiguity as a fellow traveler by Vivien Marx. (Nature Methods 10, 613–615 (2013) doi:10.1038/nmeth.2530)

From the article:

Data from an experiment may appear rock solid. Upon further examination, the data may morph into something much less firm. A knee-jerk reaction to this conundrum may be to try and hide uncertain scientific results, which are unloved fellow travelers of science. After all, words can afford ambiguity, but with visuals, “we are damned to be concrete,” says Bang Wong, who is the creative director of the Broad Institute of MIT and Harvard. The alternative is to face the ambiguity head-on through visual means.

Color or color gradients in heat maps, for example, often show degrees of data uncertainty and are, at their core, visual and statistical expressions. “Talking about uncertainty is talking about statistics,” says Martin Krzywinski, whose daily task is data visualization at the Genome Sciences Centre at the British Columbia Cancer Agency.

Statistically driven displays such as box plots can work for displaying uncertainty, but most visualizations use more ad hoc methods such as transparency or blur. Error bars are also an option, but it is difficult to convey information clearly with them, he says. “It’s likely that if something as simple as error bars is misunderstood, anything more complex will be too,” Krzywinski says.

I don’t hear “ambiguity” and “uncertainty” as the same thing.

The duck/rabbit image you will remember from Sperberg-McQueen’s presentations is ambiguous, but not uncertain.

duck rabbit

Granting that “uncertainty” and its visualization is a difficult task but let’s not compound the task by confusing it with ambiguity.

The uncertainty issue in this article echoes Steve Pepper’s concern over binary choices for type under the current TMDM. Either a topic, for example, is of a particular type or not. There isn’t any room for uncertainty.

The article has a number of suggestions on visualizing uncertainty that I think you may find helpful.

I first saw this at: Visualizing uncertainty still unsolved problem by Nathan Yau.

Open Source Natural Language Spell-Checker [Disambiguation at the point of origin.]

Friday, October 26th, 2012

Automattic Open Sources Natural Language Spell-Checker After the Deadline by Jolie O’Dell.

I am sure the original headline made sense to its author, but I wonder how a natural language processor would react to it?

My reaction, being innocent of any prior knowledge of the actors or the software was: What deadline? Reading it as a report of a missed deadline.

It is almost a “who’s on first” type headline. The software’s name is “After the Deadline.”

That confusion resolved, I read:

Matt Mullenweg has just annouced on his blog that WordPress parent company Automattic is open sourcing After the Deadline, a natural-language spell-checking plugin for WordPress and TinyMCE that was only recently ushered into the Automattic fold.

Scarcely seven weeks after its acquisition was announced, After the Deadline’s core technology is being released under the GPL. Moreover, writes Mullenweg, “There’s also a new jQuery API that makes it easy to integrate with any text area.”

Interested parties can check out this demo or read the tech overview and grab the source code here.

I can use spelling/grammar suggestions. Particularly since I make the same mistakes over and over again.

Does that also mean I talk about the same subjects/entities over and over again? Or at least a limited range of subjects/entities?

Imagine a user configurable subject/entity “checker” that annotated recognized subjects/entities with an <a> element. Enabling the user to accept/reject the annotation.

Disambiguation at the point of origin.

The title of the original article could become:

“<a href=”http://automattic.com/”>Automattic</a> Open Sources Natural Language Spell-Checker <a href=”http://www.afterthedeadline.com/”>After the Deadline</a>”

Seems less ambiguous to me.

Certainly less ambiguous to a search engine.

You?

Yu and Robinson on The Ambiguity of “Open Government”

Saturday, August 11th, 2012

Yu and Robinson on The Ambiguity of “Open Government”

Legal Informatics calls our attention to the use of ambiguity to blunt, at least in one view, the potency of the phrase “open government.”

Whatever your politics, it is a reminder that for good or ill, semantics originate with us.

Topic maps are one tool to map those semantics, to remove (or enhance) ambiguity.

Prostitutes Appeal to Pope: Text Analytics applied to Search

Sunday, April 29th, 2012

Prostitutes Appeal to Pope: Text Analytics applied to Search by Tony Russell-Rose.

It is hard for me to visit Tony’s site and not come away with several posts he has written that I want to mention. Today was no different.

Here is a sampling of what Tony talks about in this post:

Consider the following newspaper headlines, all of which appeared unambiguous to the original writer:

  • DRUNK GETS NINE YEARS IN VIOLIN CASE
  • PROSTITUTES APPEAL TO POPE
  • STOLEN PAINTING FOUND BY TREE
  • RED TAPE HOLDS UP NEW BRIDGE
  • DEER KILL 300,000
  • RESIDENTS CAN DROP OFF TREES
  • INCLUDE CHILDREN WHEN BAKING COOKIES
  • MINERS REFUSE TO WORK AFTER DEATH

Although humorous, they illustrate much of the ambiguity in natural language, and just how much pragmatic and linguistic knowledge must be employed by NLP tools to function accurately.

A very informative and highly amusing post.

What better way to start the week?

Enjoy!

Draft (polysemy and ambiguity)

Sunday, January 22nd, 2012

Draft by Mark Liberman

From the post:

In a series of Language Log posts, Geoff Pullum has called attention to the prevalence of polysemy and ambiguity:

The people who think clarity involves lack of ambiguity, so we have to strive to eliminate all multiple meanings and should never let a word develop a new sense… they simply don’t get it about how language works, do they?

Languages love multiple meanings. They lust after them. They roll around in them like a dog in fresh grass.

The other day, as I reading a discussion in our comments about whether English draftable does or doesn’t refer to the same concept as Finnish asevelvollisuus (“obligation to serve in the military”), I happened to be sitting in a current of uncomfortably cold air. So of course I wondered how the English word draft came to refer to military conscription as well as air flow. And a few seconds of thought brought to mind several others senses of the the noun draft and its associated verb. I figured that this must represent a confusion of several originally separate words. But then I looked it up.

If you like language and have an appreciation for polsemy and ambiguity, you will enjoy this post a lot.

Semantic Tech the Key to Finding Meaning in the Media

Friday, January 20th, 2012

Semantic Tech the Key to Finding Meaning in the Media by Chris Lamb.

Chris starts off well enough:

News volume has moved from infoscarcity to infobesity. For the last hundred years, news in print was delivered in a container, called a newspaper, periodically, typically every twenty-four hours. The container constrained the product. The biggest constraints of the old paradigm were periodic delivery and limitations of column inches.

Now information continually bursts through our Google Readers, our cell phones, our tablets, display screens in elevators and grocery stores. Do we really need to read all 88,731 articles on the Bernie Madoff trial? Probably not. And that’s the dilemma for news organizations.

In the old metaphor, column-inches was the constraint. In the new metaphor, reader attention span becomes the constraint.

But, then quickly starts to fade:

Disambiguation is a technique to uniquely identify named entities: people, cities, and subjects. Disambiguation can identify that one article is about George Herbert Walker Bush, the 41st President of the US, and another article is about George Walker Bush, number 43. Similarly, the technology can distinguish between Lincoln Continental, the car, and Lincoln, Nebraska, the town. As part of the metadata, many tagging engines that disambiguate return unique identifiers called Uniform Resource Identifiers (URI). A URI is a pointer into a database.

If tagging creates machine readable assets, disambiguation is the connective tissue between these assets. Leveraging tagging and disambiguation technologies, applications can now connect content with very disparate origins. Today’s article on George W. Bush can be automatically linked to an article he wrote when he owned the Texas Ranger’s baseball team. Similarly the online bio of Bill Gates can be automatically tied to his online New Mexico arrest record in April 1975.

Apparently he didn’t read the paper The communicative function of ambiguity in language.

The problem with disambiguation is that you and I may well set up a system to disambiguate named entities differently. To be sure, we will get some of them the same, but the question becomes which ones? Is 80% of them the same enough?

Depends on the application doesn’t it? What if we are looking for a terrorist who may have fissionable material? Does 80% look good enough?

Ironic. Disambiguation is subject to the same ambiguity as it set out to solve.

PS: URIs aren’t necessarily pointers into databases.

The communicative function of ambiguity in language

Friday, January 20th, 2012

The communicative function of ambiguity in language by Steven T. Piantadosi, Harry Tily and Edward Gibson. (Cognition, 2011) (PDF file)

Abstract:

We present a general information-theoretic argument that all efficient communication systems will be ambiguous, assuming that context is informative about meaning. We also argue that ambiguity allows for greater ease of processing by permitting efficient linguistic units to be re-used. We test predictions of this theory in English, German, and Dutch. Our results and theoretical analysis suggest that ambiguity is a functional property of language that allows for greater communicative efficiency. This provides theoretical and empirical arguments against recent suggestions that core features of linguistic systems are not designed for communication.

This is a must read paper if you are interesting in ambiguity and similar issues.

At page 289, the authors report:

These findings suggest that ambiguity is not enough of a problem to real-world communication that speakers would make much effort to avoid it. This may well be because actual language in context provides other information that resolves the ambiguities most of the time.

I don’t know if our communication systems are efficient or not but I think the phrase “in context” is covering up a very important point.

Our communication systems came about in very high-bandwidth circumstances. We were in the immediate presence of a person speaking. With all the context that provides.

Even if we accept an origin of language of say 200,000 years ago, written language, which provides the basis for communication without the presence of another person, emerges only in the last five or six thousand years. Just to keep it simple, 5 thousand years would be 2.5% of the entire history of language.

So for 97.5% of the history of language, it has been used in a high bandwidth situation. No wonder it has yet to adapt to narrow bandwidth situations.

If writing puts us into a narrow bandwidth situation and ambiguity, where does that leave our computers?

In Defense of Ambiguity

Wednesday, October 5th, 2011

In Defense of Ambiguity by Patrick J. Hayes and Harry A. Halpin.

Abstract:

URIs, a universal identification scheme, are different from human names insofar as they can provide the ability to reliably access the thing identified. URIs also can function to reference a non-accessible thing in a similar manner to how names function in natural language. There are two distinctly different relationships between names and things: access and reference. To confuse the two relations leads to underlying problems with Web architecture. Reference is by nature ambiguous in any language. So any attempts by Web architecture to make reference completely unambiguous will fail on the Web. Despite popular belief otherwise, making further ontological distinctions often leads to more ambiguity, not less. Contrary to appeals to Kripke for some sort of eternal and unique identification, reference on the Web uses descriptions and therefore there is no unambiguous resolution of reference. On the Web, what is needed is not just a simple redirection, but a uniform and logically consistent manner of associating descriptions with URIs that can be done in a number of practical ways that should be made consistent.

Highly readable critique with passages such as:

There are two distinct relationships between names and things: reference and access. The architecture of the Web determines access, but has no direct influence on reference. Identifiers like URIs can be considered types of names. It is important to distinguish these two possible different relationships between a name and a thing.

1. accesses, meaning that the name provides a causal pathway to the thing, perhaps mediated by the Web.

2. refers to, meaning that the name is being used to mention the thing.

Current practice in Web Architecture uses “identifies” to mean both or either of these, apparently in the belief that they are synonyms. They are not, and to think of them as being the same is to be profoundly confused. For example, when uttering the name “Eiffel Tower” one does not in anyway get magically transported to the Eiffel Tower. One can talk about it, have beliefs, plan a trip there, and otherwise have intentions about the Eiffel Tower, but the name has no causal path to the Eiffel Tower itself. In contrast, the URI http://www.tour-eiffel.fr/ offers us access to a group of Web pages via an HTTP-compliant agent. A great deal of the muddle Web architecture finds itself in can be directly traced to this confusion between access and reference.

The solution proffered by Hayes and Halpin:

Regardless of the details, the use of any technology in Web architecture to distinguish between access and reference, including our proposed ex:refersTo and ex:describedBy, does nothing more than allow the author of a URI to explain how they would like the URI to be used.

For those interested in previous recognitions of this distinction, see <resourceRef> and <subjectIndicatorRef> in XTM 1.0.

ORCID (Open Researcher & Contributor ID)

Saturday, September 24th, 2011

ORCID (Open Researcher & Contributor ID)

From the About page:

ORCID, Inc. is a non-profit organization dedicated to solving the name ambiguity problem in scholarly research and brings together the leaders of the most influential universities, funding organizations, societies, publishers and corporations from around the globe. The ideal solution is to establish a registry that is adopted and embraced as the de facto standard by the whole of the community. A resolution to the systemic name ambiguity problem, by means of assigning unique identifiers linkable to an individual’s research output, will enhance the scientific discovery process and improve the efficiency of funding and collaboration. The organization is managed by a fourteen member Board of Directors.

ORCID’s principles will guide the initiative as it grows and operates. The principles confirm our commitment to open access, global communication, and researcher privacy.

Accurate identification of researchers and their work is one of the pillars for the transition from science to e-Science, wherein scholarly publications can be mined to spot links and ideas hidden in the ever-growing volume of scholarly literature. A disambiguated set of authors will allow new services and benefits to be built for the research community by all stakeholders in scholarly communication: from commercial actors to non-profit organizations, from governments to universities.

Thomson Reuters and Nature Publishing Group convened the first Name Identifier Summit in Cambridge, MA in November 2009, where a cross-section of the research community explored approaches to address name ambiguity. The ORCID initiative officially launched as a non-profit organization in August 2010 and is moving ahead with broad stakeholder participation (view participant gallery). As ORCID develops, we plan to engage researchers and other community members directly via social media and other activity. Participation from all stakeholders at all levels is essential to fulfilling the Initiative’s mission.

I am not altogether certain that elimination of ambiguity in identification will enable “…min[ing] to spot links and ideas hidden in the ever-growing volume of scientific literature.” Or should I say there is no demonstrated connection between unambiguous identification of researchers and such gains?

True enough, the claim is made but I thought science was based on evidence, not simply making claims.

And, like most researchers, I have discovered unexpected riches when mistaking one researcher’s name for another’s. Reducing ambiguity in identification will reduce the incidence of, well, ambiguity in identification.

Jack Park forwarded this link to me.

The Language Problem: Jaguars & The Turing Test

Saturday, September 10th, 2011

The Language Problem: Jaguars & The Turing Test by Gord Hotchkiss.

The post begins innocently enough:

“I love Jaguars!”

When I ask you to understand that sentence, I’m requiring you to take on a pretty significant undertaking, although you do it hundreds of times each day without really thinking about it.

The problem comes with the ambiguity of words.

If you appreciate discussions of language, meaning and the short falls of our computing companions, you will really like this article and the promised following posts.

Not to mention bringing into sharp relief the issues that topic map authors (or indexers) face when trying to specify a subject that will be recognized and used by N unknown users.

I suppose that is really the tricky part, or at least part of it, the communication channel for an index or topic map is only one way. There is no opportunity for correcting a reading/mis-reading by the author. All that lies with the user/reader alone.

That’s What She Said: Double Entendre Identification

Sunday, May 1st, 2011

That’s What She Said: Double Entendre Identification by Chloé Kiddon and Yuriy Brun.

Abstract:

Humor identification is a hard natural language understanding problem. We identify a subproblem — the “that’s what she said” problem—with two distinguishing characteristics: (1) use of nouns that are euphemisms for sexually explicit nouns and (2) structure common in the erotic domain. We address this problem in a classification approach that includes features that model those two characteristics. Experiments on web data demonstrate that our approach improves precision by 12% over baseline techniques that use only word-based features.

A highly entertaining paper that examines a particular type of double entendre, which is itself a particular type of metaphor.

The authors note:

A “that’s what she said” (TWSS) joke is a type of double entendre. A double entendre, or adianoeta, is an expression that can be understood in two different ways: an innocuous, straightforward way, given the context, and a risqué way that indirectly alludes to a different, indecent context. To our knowledge, related research has not studied the task of identifying double entendres in text or speech. The task is complex and would require both deep semantic and cultural understanding to recognize the vast array of double entendres. We focus on a subtask of double entendre identification: TWSS recognition. We say a sentence is a TWSS if it is funny to follow that sentence with “that’s what she said”. (emphasis added)

It would be interesting to see a crowd-sourced topic map project on double entendre.

BTW, strictly for non-office enjoyment, see: TWSS, a site that collects TWSS stories.

Semantic Ambiguity and Perceived Ambiguity

Wednesday, December 1st, 2010

Semantic Ambiguity and Perceived Ambiguity by Massimo Poesio.

Abstract:

I explore some of the issues that arise when trying to establish a connection between the underspecification hypothesis pursued in the NLP literature and work on ambiguity in semantics and in the psychological literature. A theory of underspecification is developed `from the first principles’, i.e., starting from a definition of what it means for a sentence to be semantically ambiguous and from what we know about the way humans deal with ambiguity. An underspecified language is specified as the translation language of a grammar covering sentences that display three classes of semantic ambiguity: lexical ambiguity, scopal ambiguity, and referential ambiguity. The expressions of this language denote sets of senses. A formalization of defeasible reasoning with underspecified representations is presented, based on Default Logic. Some issues to be confronted by such a formalization are discussed.

Practice is grounded on actual experience (“the burnt hand learns best”) and on understanding the nature of the task and applying that understanding. Neither is really complete without the other.

Poesio’s paper makes for good mental exercise and hopefully deeper insight into the difficulties that surround ambiguity and its reduction.

Measuring the meaning of words in contexts:…

Sunday, November 21st, 2010

Measuring the meaning of words in contexts: An automated analysis of controversies about ‘Monarch butterflies,’ ‘Frankenfoods,’ and ‘stem cells’ Author(s): Loet Leydesdorff and Iina Hellsten Keywords: co-words, metaphors, diaphors, context, meaning

Abstract:

Co-words have been considered as carriers of meaning across different domains in studies of science, technology, and society. Words and co-words, however, obtain meaning in sentences, and sentences obtain meaning in their contexts of use. At the science/society interface, words can be expected to have different meanings: the codes of communication that provide meaning to words differ on the varying sides of the interface. Furthermore, meanings and interfaces may change over time. Given this structuring of meaning across interfaces and over time, we distinguish between metaphors and diaphors as reflexive mechanisms that facilitate the translation between contexts. Our empirical focus is on three recent scientific controversies: Monarch butterflies, Frankenfoods, and stem-cell therapies. This study explores new avenues that relate the study of co-word analysis in context with the sociological quest for the analysis and processing of meaning.

Excellent article on shifts of word meaning over time. Reports sufficient detail on methodology that interested readers will be able to duplicate or extend the research reported here.

Questions:

  1. Annotated bibliography of research citing this paper.
  2. Design a study of the shifting meaning of a 2 or 3 terms. What texts would you select? (3-5 pages, with citations)
  3. Perform a study of shifting meaning of terms in library science. (Project)

A Direct Mapping of Relational Data to RDF

Thursday, November 18th, 2010

A Direct Mapping of Relational Data to RDF

A major step towards putting relational data “on the web.”

Identifying what that data means and providing a basis for reconciling it with other data remains to be addressed.

URIs and Identity

Thursday, November 18th, 2010

If I read Halpin and others correctly, URIs identify the subjects they identify, except when they identify some other subject and it isn’t possible to know which of any number of subjects is being identified.

That is what I (and others) take as “ambiguity.”

Some readers have taken my comments to on URIs to be critical of RDF, which wasn’t my intent.

What I object to is the sentiment that everyone should use only URIs and then cherry pick any RDF graph that may result for identity purposes.

For example, in a family tree, there may be an entry: John Smith.

For which we can create: http://myfamilytree.smith.com/john_smith

That may resolve to an RDF graph but what properties in that graph identify a particular John Smith?

A “uniform” syntax for that “identifier” isn’t helpful if we all reach various conclusions about what properties in the graph to use for identification.

Or if we have different tests to evaluate the values of those properties.

Even with an RDF graph and rules for which properties to evaluate, we may still have ambiguity.

But rules for evaluation of RDF graphs for identity lessen the ambiguity.

All within the context, format, data model of RDF.

It does detract from URIs as identifiers but URIs as identifiers are no more viable than any single token as an identifier.

Sets of key/value pairs, which are made up of tokens, have the potential to lessen ambiguity, but not banish it.

Reducing Ambiguity, LOD, Ookaboo, TMRM

Tuesday, November 16th, 2010

While reading Resource Identity and Semantic Extensions: Making Sense of Ambiguity and In Defense of Ambiguity it occurred to me that reducing ambiguity has a hidden assumption.

That hidden assumption is the intended audience for who I wish to reduce ambiguity.

For example, Ookaboo does #it solves the problem of multiple vocabularies for its intended audience thusly:

Our strategy for dealing with multiple subject terminologies is to what we call a reference set, which in this case is

http://ookaboo.com/o/pictures/topic/2021903/Central_Air_Force_Museum#it
http://dbpedia.org/resource/Central_Air_Force_Museum
http://rdf.freebase.com/ns/m.0g_2bv

If we want to assert foaf:depicts we assert foaf:depicts against all of these. The idea is that not all clients are going to have the inferencing capabilities that I wish they’d have, so I’m trying to assert terms in the most “core” databases of the LOD cloud.

In a case like this we may have YAGO, OpenCyc, UMBEL and other terms available. Relationships like this are expressed as

<:Whatever> <ontology2:ak>
<http://mpii.de/yago/resource/Central_Air_Force_Museum> .

<ontology2:aka>, not dereferencable yet, means (roughly) that “some people use term X to refer to substantially the same thing as term Y.” It’s my own answer to the <owl:sameAs> problem and deliberately leaves the exact semantics to the reader. (It’s a lossy expression of the data structures that I use for entity management)

This is very like a TMRM solution since it gathers different identifications together, in hopes that at least one will be understood by a reader.

This is very unlike a TMRM solution because it has no legend to say how to compare these “values,” must less their “key.”

The lack of a legend makes integration in legal, technical, medical or intelligence applications, ah, difficult.

Still, it is encouraging to see the better Linked Data applications moving in the direction of the TMRM.

Ambiguity and Linked Data URIs

Monday, November 8th, 2010

I like the proposal by Ian Davis to avoid the 303 cloud while try to fix the mistake of confusing identifiers with addresses in an address space.

Linked data URIs are already known to be subject to the same issues of ambiguity as any other naming convention.

All naming conventions are subject to ambiguity and “expanded” naming conventions, such as a list of properties in a topic map, may make the ambiguity a bit more manageable.

That depends on a presumption that if more information is added and a user advised of it, the risk of ambiguity will be reduced.

But the user needs to be able to use the additional information. What if the additional information is to distinguish two concepts in calculus and the reader is innocent of even basic algebra?

That is that say ambiguity can be overcome only in particular contexts.

But overcoming ambiguity in a particular context may be enough. Such as:

  • Interchange between intelligence agencies
  • Interchange between audited entities and their auditors (GAO, SEC, Federal Reserve (or their foreign equivalents))
  • Interchange between manufacturers and distributors

None of those are the golden age of seamless knowledge sharing and universal democratization of decision making or even scheduling tennis matches sort of applications.

They are applications that can reduce incremental costs, improve overall efficiency and perhaps contribute to achievement of organizational goals.

Perhaps that is enough.

On Classifying Drifting Concepts in P2P Networks

Saturday, November 6th, 2010

On Classifying Drifting Concepts in P2P Networks Authors: Hock Hee Ang, Vivekanand Gopalkrishnan, Wee Keong Ng and Steven Hoi Keywords: Concept drift, classification, peer-to-peer (P2P) networks, distributed classification

Abstract:

Concept drift is a common challenge for many real-world data mining and knowledge discovery applications. Most of the existing studies for concept drift are based on centralized settings, and are often hard to adapt in a distributed computing environment. In this paper, we investigate a new research problem, P2P concept drift detection, which aims to effectively classify drifting concepts in P2P networks. We propose a novel P2P learning framework for concept drift classification, which includes both reactive and proactive approaches to classify the drifting concepts in a distributed manner. Our empirical study shows that the proposed technique is able to effectively detect the drifting concepts and improve the classification performance.

The authors define the problem as:

Concept drift refers to the learning problem where the target concept to be predicted, changes over time in some unforeseen behaviors. It is commonly found in many dynamic environments, such as data streams, P2P systems, etc. Real-world examples include network intrusion detection, spam detection, fraud detection, epidemiological, and climate or demographic data, etc.

The authors may well have been the first to formulate this problem among mechanical peers but any humanist could have pointed out examples concept drift between people. Both in the literature as well as real life.

Questions:

  1. What are the implications of concept drift for Linked Data? (3-5 pages, no citations)
  2. What are the implications of concept drift for static ontologies? (3-5 pages, no citations)
  3. Is concept development (over time) another form of concept drift? (3-5 pages, citations, illustrations, presentation)

*****
PS: Finding this paper is an illustration of ambiguity leading to serendipitous discovery. I searched for one of the author’s instead of the exact title of another paper. While scanning the search results I found this paper.

Ambiguity and Serendipity

Friday, November 5th, 2010

There was an email discussion recently where ambiguity was discussed as something to be avoided.

It occurred to me, if there were no ambiguity, there would be no serendipity.

Think about the last time you searched for a particular paper. If you remembered enough to go directly to it, you did not see any similar or closely resembling papers along the way.

Now imagine every information request you make results in exactly what you were searching for.

What a terribly dull search experience that would be!

Topic maps can produce the circumstances where serendipity occurs because a subject can be identified any number of ways. Quite possibly several that you are unaware of. And seeing those other ways may spark a memory of another paper, perhaps another line of thought, etc.

I think my list of “other names” for record linkage now exceeds 25 and I really need to cast those into a topic map fragment along with citations to the places they can be found.

I don’t think of topic maps as a means to avoid ambiguity but rather as a means to make ambiguity a manageable part of an information seeking experience.