Archive for the ‘Semantic Inconsistency’ Category

Interpretation Under Ambiguity [First Cut Search Results]

Sunday, February 7th, 2016

Interpretation Under Ambiguity by Peter Norvig.

From the paper:

Introduction

This paper is concerned with the problem of semantic and pragmatic interpretation of sentences. We start with a standard strategy for interpretation, and show how problems relating to ambiguity can confound this strategy, leading us to a more complex strategy. We start with the simplest of strategies:

Strategy 1: Apply syntactic rules to the sentence to derive a parse tree, then apply semantic rules to get a translation into some logical form, and finally do a pragmatic interpretation to arrive at the final meaning.

Although this strategy completely ignores ambiguity, and is intended as a sort of strawman, it is in fact a commonly held approach. For example, it is approximately the strategy assumed by Montague grammar, where `pragmatic interpretation’ is replaced by `model theoretic interpretation.’ The problem with this strategy is that ambiguity can strike at the lexical, syntactic, semantic, or pragmatic level, introducing multiple interpretations. The obvious way to counter this problem is as follows:

Strategy 2: Apply syntactic rules to the sentence to derive a set of parse trees, then apply semantic rules to get a set of translations in some logical form, discarding any inconsistent formulae. Finally compute pragmatic interpretation scores for each possibility, to arrive at the `best’ interpretation (i.e. `most consistent’ or `most likely’ in the given context).

In this framework, the lexicon, grammar, and semantic and pragmatic interpretation rules determine a mapping between sentences and meanings. A string with exactly one interpretation is unambiguous, one with no interpretation is anomalous, and one with multiple interpretations is ambiguous. To enumerate the possible parses and logical forms of a sentence is the proper job of a linguist; to then choose from the possibilities the one “correct” or “intended” meaning of an utterance is an exercise in pragmatics or Artificial Intelligence.

One major problem with Strategy 2 is that it ignores the difference between sentences that seem truly ambiguous to the listener, and those that are only found to be ambiguous after careful analysis by the linguist. For example, each of (1-3) is technically ambiguous (with could signal the instrument or accompanier case, and port could be a harbor or the left side of a ship), but only (3) would be seen as ambiguous in a neutral context.

(1) I saw the woman with long blond hair.
(2) I drank a glass of port.
(3) I saw her duck.

Lotfi Zadeh (personal communication) has suggested that ambiguity is a matter of degree. He assumes each interpretation has a likelihood score attached to it. A sentence with a large gap between the highest and second ranked interpretation has low ambiguity; one with nearly-equal ranked interpretations has high ambiguity; and in general the degree of ambiguity is inversely proportional to the sharpness of the drop-off in ranking. So, in (1) and (2) above, the degree of ambiguity is below some threshold, and thus is not noticed. In (3), on the other hand, there are two similarly ranked interpretations, and the ambiguity is perceived as such. Many researchers, from Hockett (1954) to Jackendoff (1987), have suggested that the interpretation of sentences like (3) is similar to the perception of visual illusions such as the Necker cube or the vase/faces or duck/rabbit illusion. In other words, it is possible to shift back and forth between alternate interpretations, but it is not possible to perceive both at once. This leads us to Strategy 3:

Strategy 3: Do syntactic, semantic, and pragmatic interpretation as in Strategy 2. Discard the low-ranking interpretations, according to some threshold function. If there is more than one interpretation remaining, alternate between them.

Strategy 3 treats ambiguity seriously, but it leaves at least four problems untreated. One problem is the practicality of enumerating all possible parses and interpretations. A second is how syntactic and lexical preferences can lead the reader to an unlikely interpretation. Third, we can change our mind about the meaning of a sentence-“at first I thought it meant this, but now I see it means that.” Finally, our affectual reaction to ambiguity is variable. Ambiguity can go unnoticed, or be humorous, confusing, or perfectly harmonious. By `harmonious,’ I mean that several interpretations can be accepted simultaneously, as opposed to the case where one interpretation is selected. These problems will be addressed in the following sections.

Apologies for the long introduction quote but I want to entice you to read Norvig’s essay in full and if you have the time, the references that he cites.

It’s the literature you will have to master to use search engines and develop indexing strategies.

At least for one approach to search and indexing.

That within a language there is enough commonality for automated indexing or searching to be useful has been proven over and over again by Internet search engines.

But at the same time, the first twenty or so results typically leave you wondering what interpretation the search engine put on your words.

As I said, Peter’s approach is useful, at least for a first cut at search results.

The problem is that the first cut has become the norm for “success” of search results.

That works if I want to pay lawyers, doctors, teachers and others to find the same results as others have found before (past tense).

That cost doesn’t appear as a line item in any budget but repetitive “finding” of the same information over and over again is certainly a cost to any enterprise.

First cut on semantic interpretation, follow Norvig.

Saving re-finding costs and the cost of not-finding, requires something more robust than a one model to find words and in the search darkness bind them to particular meanings.

PS: See Peter@norvig.com for an extensive set of resources, papers, presentations, etc.

I first saw this in a tweet by James Fuller.

Kidnapping Caitlynn (47 AKAs – Is There a Topic Map in the House?)

Thursday, December 10th, 2015

Kidnapping Caitlynn in 10 minutes long, but has accumulated forty-seven (47 AKAs).

Imagine the search difficulty in finding reviews under all forty-eight (48) titles.

Even better, imagine your search request was for something that really mattered.

Like known terrorists crossing national borders using their real names and passports.

Intelligence services aren’t doing all that hot even with string to string matches.

Perhaps that explains their inability to consider more sophisticated doctrines of identity.

If you can’t do string to string, more complex notions will grind your system to a halt.

Maybe intelligence agencies need new contractors. You think?

IoT: The New Tower of Babel

Thursday, December 10th, 2015

640-babel

Luke Anderson‘s post at Clickhole, titled: Humanity Could Totally Pull Off The Tower Of Babel At This Point, was a strong reminder of the Internet of Things (IoT).

See what you think:

If you went to Sunday school, you know the story: After the Biblical flood, the people of earth came together to build the mighty Tower of Babel. Speaking with one language and working tirelessly, they built a tower so tall that God Himself felt threatened by it. So, He fractured their language so that they couldn’t understand each other, construction ceased, and mankind spread out across the ancient world.

We’ve come a long way in the few millennia since then, and at this point, humanity could totally pull off the Tower of Babel.

Just look at the feats of human engineering we’ve accomplished since then: the Great Wall; the Golden Gate Bridge; the Burj Khalifa. And don’t even get me started on the International Space Station. Building a single tall building? It’d be a piece of cake.

Think about it. Right off the bat, we’d be able to communicate with each other, no problem. Besides most of the world speaking either English, Spanish, and/or Chinese by now, we’ve got translators, Rosetta Stone, Duolingo, the whole nine yards. Hell, IKEA instructions don’t even have words and we have no problem putting their stuff together. I can see how a guy working next to you suddenly speaking Arabic would throw you for a loop a few centuries ago. But now, I bet we could be topping off the tower and storming heaven in the time it took people of the past to say “Hey, how ya doing?”

Compare this Internet of Things statement from the Masters of Contracts that Yield No Useful Result:


IoT implementation, at its core, is the integration of dozens and up to tens of thousands of devices seamlessly communicating with each other, exchanging information and commands, and revealing insights. However, when devices have different usage scenarios and operating requirements that aren’t compatible with other devices, the system can break down. The ability to integrate different elements or nodes within broader systems, or bringing data together to drive insights and improve operations, becomes more complicated and costly. When this occurs, IoT can’t reach its potential, and rather than an Internet of everything, you see siloed Internets of some things.

The first, in case you can’t tell from it being posted at Clickhole, was meant as sarcasm or humor.

The second was deadly serious from folks who would put a permanent siphon on your bank account. Whether their services are cost effective or not is up to you to judge.

The Tower of Babel is a statement about semantics and the human condition. It should come as no surprise that we all prefer our language over that of others, whether those are natural or programming languages. Moreover, judging from code reuse, to say nothing of the publishing market, we prefer our restatements of the material, despite equally useful statements by others.

How else would you explain the proliferation of MS Excel books? 😉 One really good one is more than enough. Ditto for Bible translations.

Creating new languages to “fix” semantic diversity just adds another partially adopted language to the welter of languages that need to be integrated.

The better option, at least from my point of view, is to create mappings between languages, mappings that are based on key/value pairs to enable others to build upon, contract or expand those mappings.

It simply isn’t possible to foresee every use case or language that needs semantic integration but if we perform such semantic integration as returns ROI for us, then we can leave the next extension or contraction of that mapping to the next person with a different ROI.

It’s heady stuff to think we can cure the problem represented by the legendary Tower of Babel, but there is a name for that. It’s called hubris and it never leads to a good end.

The Encyclopedia of Life v2:…

Saturday, May 10th, 2014

The Encyclopedia of Life v2: Providing Global Access to Knowledge About Life on Earth by Cynthia S. Parr, et al. (Biodiversity Data Journal 2: e1079 (29 Apr 2014) doi: 10.3897/BDJ.2.e1079)

Abstract:

The Encyclopedia of Life (EOL, http://eol.org) aims to provide unprecedented global access to a broad range of information about life on Earth. It currently contains 3.5 million distinct pages for taxa and provides content for 1.3 million of those pages. The content is primarily contributed by EOL content partners (providers) that have a more limited geographic, taxonomic or topical scope. EOL aggregates these data and automatically integrates them based on associated scientific names and other classification information. EOL also provides interfaces for curation and direct content addition. All materials in EOL are either in the public domain or licensed under a Creative Commons license. In addition to the web interface, EOL is also accessible through an Application Programming Interface.

In this paper, we review recent developments added for Version 2 of the web site and subsequent releases through Version 2.2, which have made EOL more engaging, personal, accessible and internationalizable. We outline the core features and technical architecture of the system. We summarize milestones achieved so far by EOL to present results of the current system implementation and establish benchmarks upon which to judge future improvements.

We have shown that it is possible to successfully integrate large amounts of descriptive biodiversity data from diverse sources into a robust, standards-based, dynamic, and scalable infrastructure. Increasing global participation and the emergence of EOL-powered applications demonstrate that EOL is becoming a significant resource for anyone interested in biological diversity.

This section on the organization of the taxonomy for the Encyclopedia of Life v2 seems particularly relevant:

Resource documents made available by content partners define the text and multimedia being provided as well as the taxa to which the content refers, the associations between content and taxa, and the associations among taxa (i.e. taxonomies). Expert taxonomists often disagree about the best classification for a given group of organisms, and there is no universal taxonomy for partners to adhere to (Patterson et al. 2008, Rotman et al. 2012a, Yoon and Rose 2001). As an aggregator, EOL accepts all taxonomic viewpoints from partners and attempts to assign them to existing Taxon Pages, or create new Taxon Pages when necessary. A reconciliation algorithm uses incoming taxon information, previously indexed data, and assertions from our curators to determine the best aggregation strategy. (links omitted)

Integration of information without agreement on a single view of the information. (Have we heard this before?)

If you think of the taxon pages as proxies, it is easier to see the topic map aspects of this project.

Where Does the Data Go?

Monday, December 23rd, 2013

Where Does the Data Go?

A brief editorial on The Availability of Research Data Declines Rapidly with Article Age by Timothy H. Vines, et.al., which reads in part:

A group of researchers in Canada examined 516 articles published between 1991 and 2011, and “found that availability of the data was strongly affected by article age.” For instance, the team reports that the odds of finding a working email address associated with a paper decreased by 7 percent each year and that the odds of an extant dataset decreased by 17 percent each year since publication. Some data was technically available, the researchers note, but stored on floppy disk or on zip drives that many researchers no longer have the hardware to access.

The one of highlights of the article (which appears in Current Biology) reads:

Broken e-mails and obsolete storage devices were the main obstacles to data sharing

Curious because I would have ventured that semantic drift over twenty (20) years would have been a major factor as well.

Then I read the paper and discovered:

To avoid potential confounding effects of data type and different research community practices, we focused on recovering data from articles containing morphological data from plants or animals that made use of a discriminant function analysis (DFA). [Under Results, the online edition has no page numbers]

The authors appeared to have dodged the semantic bullet by the selection of data and their non-reporting of difficulties, if any, in using the data (19.5%) that was shared by the original authors.

Preservation of data is a major concern for researchers but I would urge that the semantics of data be preserved as well.

Imagine that feeling when you “ls -l” a directory and recognize only some of the file names writ large. Writ very large.

Detecting Semantic Overlap and Discovering Precedents…

Monday, July 8th, 2013

Detecting Semantic Overlap and Discovering Precedents in the Biodiversity Research Literature by Graeme Hirst, Nadia Talenty, and Sara Scharfz.

Abstract:

Scientific literature on biodiversity is longevous, but even when legacy publications are available online, researchers often fail to search it adequately or effectively for prior publications; consequently, new research may replicate, or fail to adequately take into account, previously published research. The mechanisms of the Semantic Web and methods developed in contemporary research in natural language processing could be used, in the near-term future, as the basis for a precedent-finding system that would take the text of an author’s early draft (or a submitted manuscript) and find potentially related ideas in published work. Methods would include text-similarity metrics that take different terminologies, synonymy, paraphrase, discourse relations, and structure of argumentation into account.

Footnote one (1) of the paper gives an idea of the problem the authors face:

Natural history scientists work in fragmented, highly distributed and parochial communities, each with domain specific requirements and methodologies [Scoble 2008]. Their output is heterogeneous, high volume and typically of low impact, but with a citation half-life that may run into centuries” (Smith et al. 2009). “The cited half-life of publications in taxonomy is longer than in any other scientific discipline, and the decay rate is longer than in any scientific discipline” (Moritz 2005). Unfortunately, we have been unable to identify the study that is the basis for Moritz’s remark.

The paper explores in detail issues that have daunted various search techniques, when the material is available in electronic format at all.

The authors make a general proposal for addressing these issues, with mention of the Semantic Web but omit from their plan:

The other omission is semantic interpretation into a logical form, represented in XML, that draws on ontologies in the style of the original Berners-Lee, Hendler, and Lassila (2001) proposal for the Semantic Web. The problem with logical-form representation is that it implies a degree of precision in meaning that is not appropriate for the kind of matching we are proposing here. This is not to say that logical forms would be useless. On the contrary, they are employed by some approaches to paraphrase and textual entailment (section 4.1 above) and hence might appear in the system if only for that reason; but even so, they would form only one component of a broader and somewhat looser kind of semantic representation.

That’s the problem with the Semantic Web in a nutshell:

The problem with logical-form representation is that it implies a degree of precision in meaning that is not appropriate for the kind of matching we are proposing here.

What if I want to be logically precise sometimes but not others?

What if I want to be more precise in some places and less precise in others?

What if I want to have different degrees or types of imprecision?

With topic maps the question is: How im/precise do you want to be?

NSA…Verizon…Obama…
Connecting the Dots. Or not.

Friday, June 7th, 2013

Why Verizon?

The first question that came to mind when the Guardian broke the NSA-Verizon news.

Here’s why I ask:

Verizon market share

(source: http://www.statista.com/statistics/199359/market-share-of-wireless-carriers-in-the-us-by-subscriptions/)

Verizon over 2011-2012 had only 34% of the cell phone market.

Unless terrorists prefer Verizon for ideological reasons, why Verizon?

Choosing only Verizon means the NSA is missing 66% of potential terrorist cell traffic.

That sounds like a bad plan.

What other reason could there be for picking Verizon?

Consider some other known players:

President Barack Obama, candidate for President of the United States, 2012.

“Bundlers” who gathered donations for Barack Obama:

Min Max Name City State Employer
$200,000 $500,000 Hill, David Silver Spring MD Verizon Communications
$200,000 $500,000 Brown, Kathryn Oakton VA Verizon Communications
$50,000 $100,000 Milch, Randal Bethesda MD Verizon Communications

(Source: OpenSecrets.org – 2012 Presidential – Bundlers)

BTW, the Max category means more money may have been given, but that is the top reporting category.

I have informally “identified” the bundlers as follows:

  • Kathryn C. Brown

    Kathryn C. Brown is senior vice president – Public Policy Development and Corporate Responsibility. She has been with the company since June 2002. She is responsible for policy development and issues management, public policy messaging, strategic alliances and public affairs programs, including Verizon Reads.

    Ms. Brown is also responsible for federal, state and international public policy development and international government relations for Verizon. In that role she develops public policy positions and is responsible for project management on emerging domestic and international issues. She also manages relations with think tanks as well as consumer, industry and trade groups important to the public policy process.

  • David A. Hill, Bloomberg Business Week reports: David A. Hill serves as Director of Verizon Maryland Inc.

    LinkedIn profile reports David A. Hill worked for Verizon, VP & General Counsel (2000 – 2006), Associate General Counsel (March 2006 – 2009), Vice President & Associate General Counsel (March 2009 – September 2011) “Served as a liaison between Verizon and the Obama Administration”

  • Randal S. Milch Executive Vice President – Public Policy and General Counsel

What is Verizon making for each data delivery? Is this cash for cash given?

If someone gave your more than $1 million (how much more is unknown), would you talk to them about such a court order?

If you read the “secret” court order, you will notice it was signed on April 23, 2013.

There isn’t a Kathryn C. Brown in Oakton in the White House visitor’s log, but I did find this record, where a “Kathryn C. Brown” made an appointment at the Whitehouse and was seen two (2) days later on the 17th of January 2013.

BROWN,KATHRYN,C,U69535,,VA,,,,,1/15/13 0:00,1/17/13 9:30,1/17/13 23:59,,176,CM,WIN,1/15/13 11:27,CM,,POTUS/FLOTUS,WH,State Floo,MCNAMARALAWDER,CLAUDIA,,,04/26/2013

I don’t have all the dots connected because I am lacking some unknown # of the players, internal Verizon communications, Verizon accounting records showing government payments, but it is enough to make you wonder about the purpose of the “secret” court order.

Was it a serious attempt at gathering data for national security reasons?

Or was it gathering data as a pretext for payments to Verizon or other contractors?

My vote goes for “pretext for payments.”

I say that because using data from different sources has always been hard.

In fact, about 60 to 80% of the time of a data analyst is spent “cleaning up data” for further processing.

The phrase “cleaning up data” is the colloquial form of “semantic impedance.”

Semantic impedance happens when the same people are known by different names in different data sets or different people are known by the same names in the same or different data sets.

Remember Kathryn Brown, of Oakton, VA? One of the Obama bundlers. Let’s use her as an example of “semantic impedance.”

The FEC has a record for Kathryn Brown of Oakton, VA.

But a search engine found:

Kathryn C. Brown

Same person? Or different?

I found another Kathryn Brown at Facebook:

And an image of Facebook Kathryn Brown:

Kathryn Brown, Facebook

And a photo from a vacation she took:

Bangkok

Not to mention the Kathryn Brown that I found at Twitter.

Kathryn Brown, Twitter

That’s only four (4) data sources and I have at least four (4) different Kathryn Browns.

Across the United States, a quick search shows 227,000 Kathryn Browns.

Remember that is just a personal name. What about different forms of addresses? Or names of employers? Or job descriptions? Or simple errors, like the 20% error rate in credit report records.

Take all the phones, plus names, addresses, employers, job descriptions, errors + other data and multiply that times 311.6 million Americans.

Can that problem be solved with petabytes of data and teraflops of processing?

Not a chance.

Remember that my identification of Kathryn “bundler” Brown with the Kathryn C. Brown of Verison was a human judgement, not an automatic rule. Nor would a computer think to check the White House visitor logs to see if another, possibly the same Kathryn C. Brown visited the White House before the secret order was signed.

Human judgement is required because all the data that the NSA has been collecting is “dirty” data, from one perspective or other. Either is is truly “dirty” in the sense of having errors or it is “dirty” in the sense it doesn’t play well with other data.

The Orwellian fearists can stop huffing and puffing about the coming eclipse of civil liberties. Those passed from view a short time after 9/11 with the passage of the Patriot Act.

That wasn’t the fault of ineffectual NSA data collection. American voters bear responsibility for the loss of civil liberties not voting leadership into office that would repeal the Patriot Act.

Ineffectual NSA data collection impedes the development of techniques that for a sanely scoped data collection effort could make a difference.

A sane scope for preventing terrorist attacks could be starting with a set of known or suspected terrorist phone numbers. Using all phone data (not just from Obama contributors), only numbers contacting or being contacted by those numbers would be subject to further analysis.

Using that much smaller set of phone numbers as identifiers, we could then collect other data, such as names and addresses associated with that smaller set of phone numbers. That doesn’t make the data any cleaner but it does give us a starting point for mapping “dirty” data sets into our starter set.

The next step would be create mappings from other data sets. If we say why we have created a mapping, others can evaluate the accuracy of our mappings.

Those tasks would require computer assistance, but they ultimately would be matters of human judgement.

Examples of such judgements exist, say for example in Palantir product line. If you watch Palantir Gotham being used to model biological relationships, take note of the results that were tagged by another analyst. And how the presenter tags additional material that becomes available to other researchers.

Computer assisted? Yes. Computer driven? No.

To be fair, human judgement is also involved in ineffectual NSA data collection efforts.

But it is human judgement that rewards sycophants and supporters, not serving the public interest.

FuzzyLaw [FuzzyDBA, FuzzyRDF, FuzzySW?]

Monday, May 20th, 2013

FuzzyLaw

From the webpage:

(…)

FuzzyLaw has gathered explanations of legal terms from members of the public in order to get a sense of what the ‘person on the street’ has in mind when they think of a legal term. By making lay-people’s explanations of legal terms available to interpreters, police and other legal professionals, we hope to stimulate debate and learning about word meaning, public understanding of law and the nature of explanation.

The explanations gathered in FuzzyLaw are unusual in that they are provided by members of the public. These people, all aged over 18, regard themselves as ‘native speakers’, ‘first language speakers’ and ‘mother tongue’ speakers of English and have lived in England and/or Wales for 10 years or more. We might therefore expect that they will understand English legal terminology as well as any member of the public might. No one who has contributed has ever worked in the criminal law system or as an interpreter or translator. They therefore bring no special expertise to the task of explanation, beyond whatever their daily life has provided.

We have gathered explanations for 37 words in total. You can see a sample of these explanations on FuzzyLaw. The sample of explanations is regularly updated. You can also read responses to the terms and the explanations from mainly interpreters, police officers and academics. You are warmly invited to add your own responses and join in the discussion of each and every word. Check back regularly to see how discussions develop and consider bookmarking the site for future visits. The site also contains commentaries on interesting phenomena which have emerged through the site. You can respond to the commentaries too on that page, contributing to the developing research project.

(…)

Have you ever wondered that the ‘person on the street’ thinks about relational databases, RDF or the Semantic Web?

Those are the folks who are being pushed content based on interpretations not their own making.

Here’s a work experiment for you:

  1. Take ten search terms from your local query log.
  2. At each department staff meeting, distribute sheets with the words, requesting everyone to define the terms in their own words. No wrong answers.
  3. Tally up the definitions per department and across the company.
  4. Comments anyone?

I first saw this at: FuzzyLaw: Collection of lay citizens’ understandings of legal terminology.

Hadoop Adds Red Hat [More Hadoop Silos Coming]

Friday, February 22nd, 2013

Red Hat Unveils Big Data and Open Hybrid Cloud Direction

From the post:

Red Hat, Inc. (NYSE: RHT), the world’s leading provider of open source solutions, today announced its big data direction and solutions to satisfy enterprise requirements for highly reliable, scalable, and manageable solutions to effectively run their big data analytics workloads. In addition, Red Hat announced that the company will contribute its Red Hat Storage Hadoop plug-in to the ApacheTM Hadoop® open community to transform Red Hat Storage into a fully-supported, Hadoop-compatible file system for big data environments, and that Red Hat is building a robust network of ecosystem and enterprise integration partners to deliver comprehensive big data solutions to enterprise customers. This is another example of Red Hat’s strategic commitment to big data customers and its continuing efforts to provide them with enterprise solutions through community-driven innovation.

The more Hadoop grows, the more Hadoop silos will as well.

You will need Hadoop and semantic skills to wire Hadoop silos together.

Re-wire with topic maps to avoid re-wiring the same Hadoop silos over and over again.

I first saw this at Red Hat reveal big data plans, open sources HDFS replacement by Elliot Bentley.

Hadoop silos need integration…

Thursday, February 21st, 2013

Hadoop silos need integration, manage all data as asset, say experts by Brian McKenna.

From the post:

Big data hype has caused infantile disorders in corporate organisations over the past year. Hadoop silos, an excess of experimentation, and an exaggeration of the importance of data scientists are among the teething problems of big data, according to experts, who suggest organisations should manage all data as an asset.

Steve Shelton, head of data services at consultancy Detica, part of BAE Systems, said Hadoop silos have become part of the enterprise IT landscape, both in the private and public sectors. “People focused on this new thing called big data and tried to isolate it [in 2011 and 2012],” he said.

The focus has been too concentrated on non-traditional data types, and that has been driven by the suppliers. The business value of data is more effectively understood when you look at it all together, big or otherwise, he said.

Have big data technologies been a distraction? “I think it has been an evolutionary learning step, but businesses are stepping back now. When it comes to information governance, you have to look at data across the patch,” said Shelton.

He said Detica had seen complaints about Hadoop silos, and these were created by people going through a proof-of-concept phase, setting up a Hadoop cluster quickly and building a team. But a Hadoop platform involves extra costs on top, in terms of managing it and integrating it into your existing business processes.

“It’s not been a waste of time and money, it is just a stage. And it is not an insurmountable challenge. The next step is to integrate those silos, but the thinking is immature relative to the technology itself,” said Shelton.

I take this as encouraging news for topic maps.

Semantically diverse data has been stores in semantically diverse datastores. Data, which if integrated, could provide business value.

Again.

There will always be a market for topic maps because people can’t stop creating semantically diverse data and data stores.

How’s that for long term market security?

No matter what data or data storage technology arises, semantic inconsistency will be with us always.

Content-Based Image Retrieval at the End of the Early Years

Tuesday, January 22nd, 2013

Content-Based Image Retrieval at the End of the Early Years by Arnold W.M. Smeulders, Marcel Worring, Simone Santini, Amarnath Gupta, and Ramesh Jain. (Smeulders, A.W.M.; Worring, M.; Santini, S.; Gupta, A.; Jain, R.; , “Content-based image retrieval at the end of the early years,” Pattern Analysis and Machine Intelligence, IEEE Transactions on , vol.22, no.12, pp.1349-1380, Dec 2000
doi: 10.1109/34.895972)

Abstract:

Presents a review of 200 references in content-based image retrieval. The paper starts with discussing the working conditions of content-based retrieval: patterns of use, types of pictures, the role of semantics, and the sensory gap. Subsequent sections discuss computational steps for image retrieval systems. Step one of the review is image processing for retrieval sorted by color, texture, and local geometry. Features for retrieval are discussed next, sorted by: accumulative and global features, salient points, object and shape features, signs, and structural combinations thereof. Similarity of pictures and objects in pictures is reviewed for each of the feature types, in close connection to the types and means of feedback the user of the systems is capable of giving by interaction. We briefly discuss aspects of system engineering: databases, system architecture, and evaluation. In the concluding section, we present our view on: the driving force of the field, the heritage from computer vision, the influence on computer vision, the role of similarity and of interaction, the need for databases, the problem of evaluation, and the role of the semantic gap.

Excellent survey article from 2000 (not 2002 as per the Ostermann paper).

I think you will appreciate the treatment of the “semantic gap,” both in terms of its description as well as ways to address it.

If you are using annotated images in your topic map application, definitely a must read.

User evaluation of automatically generated keywords and toponyms… [of semantic gaps]

Tuesday, January 22nd, 2013

User evaluation of automatically generated keywords and toponyms for geo-referenced images by Frank O. Ostermann, Martin Tomko, Ross Purves. (Ostermann, F. O., Tomko, M. and Purves, R. (2013), User evaluation of automatically generated keywords and toponyms for geo-referenced images. J. Am. Soc. Inf. Sci.. doi: 10.1002/asi.22738)

Abstract:

This article presents the results of a user evaluation of automatically generated concept keywords and place names (toponyms) for geo-referenced images. Automatically annotating images is becoming indispensable for effective information retrieval, since the number of geo-referenced images available online is growing, yet many images are insufficiently tagged or captioned to be efficiently searchable by standard information retrieval procedures. The Tripod project developed original methods for automatically annotating geo-referenced images by generating representations of the likely visible footprint of a geo-referenced image, and using this footprint to query spatial databases and web resources. These queries return raw lists of potential keywords and toponyms, which are subsequently filtered and ranked. This article reports on user experiments designed to evaluate the quality of the generated annotations. The experiments combined quantitative and qualitative approaches: To retrieve a large number of responses, participants rated the annotations in standardized online questionnaires that showed an image and its corresponding keywords. In addition, several focus groups provided rich qualitative information in open discussions. The results of the evaluation show that currently the annotation method performs better on rural images than on urban ones. Further, for each image at least one suitable keyword could be generated. The integration of heterogeneous data sources resulted in some images having a high level of noise in the form of obviously wrong or spurious keywords. The article discusses the evaluation itself and methods to improve the automatic generation of annotations.

An echo of Steve Newcomb’s semantic impedance appears at:

Despite many advances since Smeulders et al.’s (2002) classic paper that set out challenges in content-based image retrieval, the quality of both nonspecialist text-based and content-based image retrieval still appears to lag behind the quality of specialist text retrieval, and the semantic gap, identified by Smeulders et al. as a fundamental issue in content-based image retrieval, remains to be bridged. Smeulders defined the semantic gap as

the lack of coincidence between the information that one can extract from the visual data and the interpretation that the same data have for a user in a given situation. (p. 1353)

In fact, text-based systems that attempt to index images based on text thought to be relevant to an image, for example, by using image captions, tags, or text found near an image in a document, suffer from an identical problem. Since text is being used as a proxy by an individual in annotating image content, those querying a system may or may not have similar worldviews or conceptualizations as the annotator. (emphasis added)

That last sentence could have come out of a topic map book.

Curious what you make of the author’s claim that spatial locations provide an “external context” that bridges the “semantic gap?”

If we all use the same map of spatial locations, are you surprised by the lack of a “semantic gap?”

Appropriating IT: Glue Steps [Gluing Subject Representatives Together?]

Tuesday, October 9th, 2012

Appropriating IT: Glue Steps by Tony Hirst.

Over the years, I’ve been fortunate enough to have been gifted some very evocative, and powerful, ideas that immediately appealed to me when I first heard them and that I’ve been able to draw on, reuse and repurpose over and over again. One such example is “glue logic”, introduced to me by my original OU PhD supervisor George Kiss. The idea of glue logic is to provide a means by which two digital electronic circuits (two “logic” circuits) that don’t share a common interface can be “glued” together.

(diagrams and other material omitted)

This idea is powerful enough in its own right, but there was a second bit to it that made it really remarkable: the circuitry typically used to create the glue logic was a device known as a Field Programmable Gate Array, or FPGA. This is a type of digital circuit whose logical function can be configured, or programmed. That is, I can take my “shapeless” FPGA, and programme it so that it physically implements a particular digital circuit. Just think about that for a moment… You probably have a vague idea that the same computer can be reprogrammed to do particular things, using some vaguely mysterious and magical thing called software, instructions that computer processors follow in order to do incredible things. With an FPGA, the software actually changes the hardware: there is no processor that “runs a programme”; when you programme an FPGA, you change its hardware. FPGAs are, literally, programmable chips. (If you imagine digital circuits to be like bits of plastic, an FPGA is like polymorph.)

The notion of glue logic has stuck with me for two reasons, I think: firstly, because of what it made possible, the idea of flexibly creating an interface between two otherwise incompatible components; secondly, because of the way in which it could be achieved – using a flexible, repurposable, reprogrammable device – one that you could easily reprogramme if the mapping from one device to another wasn’t quite working properly.

If instead of “don’t share a common interface” you read “semantic diversity” and in place of Field Programmable Gate Array, or FPGA, you read “legend,” to “creat[e] an interface between two otherwise incompatible [subject representatives],” you would think Tony’s post was about the topic maps reference model.

Well, this post is and Tony’s is very close.

Particularly the part about being a “reprogrammable device.”

I can tell you: “black” = “schwarz,” but without more, you won’t be able to rely on or extend that statement.

For that, you need a “reprogrammable device” and some basis on which to do the reprogramming.

Legends anyone?

A Good Example of Semantic Inconsistency [C-Suite Appropriate]

Tuesday, October 9th, 2012

A Good Example of Semantic Inconsistency by David Loshin.

You can guide users through the intellectual minefield of Frege, Peirce, Russell, Carnap, Sowa and others to illustrate the need for topic maps, with stunning (as in daunting) graphics.

Or, you can use David’s story:

I was at an event a few weeks back talking about data governance, and a number of the attendees were from technology or software companies. I used the term “semantic inconsistency” and one of the attendees asked me to provide an example of what I meant.

Since we had been discussing customers, I thought about it for a second and then asked him what his definition was of a customer. He said that a customer was someone who had paid the company money for one of their products. I then asked if anyone in the audience was on the support team, and one person raised his hand. I asked him for a definition, and he said that a customer is someone to whom they provide support.

I then posed this scenario: the company issued a 30-day evaluation license to a prospect with full support privileges. Since the prospect had not paid any money for the product, according to the first definition that individual was not a customer. However, since that individual was provided full support privileges, according to the second definition that individual was a customer.

Within each silo, the associated definition is sound, but the underlying data sets are not compatible. An attempt to extract the two customer lists and merge them together into a single list will lead to inconsistent results. This may be even worse if separate agreements dictate how long a purchaser is granted full support privileges – this may lead to many inconsistencies across those two data sets.

Illustrating “semantic inconsistency,” one story at a time.

What’s your 250 – 300 word semantic inconsistency story?

PS: David also points to webinar that will be of interest. Visit his post.