## Archive for the ‘Semantic Diversity’ Category

### Interpretation Under Ambiguity [First Cut Search Results]

Sunday, February 7th, 2016

Interpretation Under Ambiguity by Peter Norvig.

From the paper:

Introduction

This paper is concerned with the problem of semantic and pragmatic interpretation of sentences. We start with a standard strategy for interpretation, and show how problems relating to ambiguity can confound this strategy, leading us to a more complex strategy. We start with the simplest of strategies:

 Strategy 1: Apply syntactic rules to the sentence to derive a parse tree, then apply semantic rules to get a translation into some logical form, and finally do a pragmatic interpretation to arrive at the final meaning.

Although this strategy completely ignores ambiguity, and is intended as a sort of strawman, it is in fact a commonly held approach. For example, it is approximately the strategy assumed by Montague grammar, where pragmatic interpretation’ is replaced by model theoretic interpretation.’ The problem with this strategy is that ambiguity can strike at the lexical, syntactic, semantic, or pragmatic level, introducing multiple interpretations. The obvious way to counter this problem is as follows:

 Strategy 2: Apply syntactic rules to the sentence to derive a set of parse trees, then apply semantic rules to get a set of translations in some logical form, discarding any inconsistent formulae. Finally compute pragmatic interpretation scores for each possibility, to arrive at the best’ interpretation (i.e. most consistent’ or most likely’ in the given context).

In this framework, the lexicon, grammar, and semantic and pragmatic interpretation rules determine a mapping between sentences and meanings. A string with exactly one interpretation is unambiguous, one with no interpretation is anomalous, and one with multiple interpretations is ambiguous. To enumerate the possible parses and logical forms of a sentence is the proper job of a linguist; to then choose from the possibilities the one “correct” or “intended” meaning of an utterance is an exercise in pragmatics or Artificial Intelligence.

One major problem with Strategy 2 is that it ignores the difference between sentences that seem truly ambiguous to the listener, and those that are only found to be ambiguous after careful analysis by the linguist. For example, each of (1-3) is technically ambiguous (with could signal the instrument or accompanier case, and port could be a harbor or the left side of a ship), but only (3) would be seen as ambiguous in a neutral context.

(1) I saw the woman with long blond hair.
(2) I drank a glass of port.
(3) I saw her duck.

Lotfi Zadeh (personal communication) has suggested that ambiguity is a matter of degree. He assumes each interpretation has a likelihood score attached to it. A sentence with a large gap between the highest and second ranked interpretation has low ambiguity; one with nearly-equal ranked interpretations has high ambiguity; and in general the degree of ambiguity is inversely proportional to the sharpness of the drop-off in ranking. So, in (1) and (2) above, the degree of ambiguity is below some threshold, and thus is not noticed. In (3), on the other hand, there are two similarly ranked interpretations, and the ambiguity is perceived as such. Many researchers, from Hockett (1954) to Jackendoff (1987), have suggested that the interpretation of sentences like (3) is similar to the perception of visual illusions such as the Necker cube or the vase/faces or duck/rabbit illusion. In other words, it is possible to shift back and forth between alternate interpretations, but it is not possible to perceive both at once. This leads us to Strategy 3:

 Strategy 3: Do syntactic, semantic, and pragmatic interpretation as in Strategy 2. Discard the low-ranking interpretations, according to some threshold function. If there is more than one interpretation remaining, alternate between them.

Strategy 3 treats ambiguity seriously, but it leaves at least four problems untreated. One problem is the practicality of enumerating all possible parses and interpretations. A second is how syntactic and lexical preferences can lead the reader to an unlikely interpretation. Third, we can change our mind about the meaning of a sentence-“at first I thought it meant this, but now I see it means that.” Finally, our affectual reaction to ambiguity is variable. Ambiguity can go unnoticed, or be humorous, confusing, or perfectly harmonious. By harmonious,’ I mean that several interpretations can be accepted simultaneously, as opposed to the case where one interpretation is selected. These problems will be addressed in the following sections.

Apologies for the long introduction quote but I want to entice you to read Norvig’s essay in full and if you have the time, the references that he cites.

It’s the literature you will have to master to use search engines and develop indexing strategies.

At least for one approach to search and indexing.

That within a language there is enough commonality for automated indexing or searching to be useful has been proven over and over again by Internet search engines.

But at the same time, the first twenty or so results typically leave you wondering what interpretation the search engine put on your words.

As I said, Peter’s approach is useful, at least for a first cut at search results.

The problem is that the first cut has become the norm for “success” of search results.

That works if I want to pay lawyers, doctors, teachers and others to find the same results as others have found before (past tense).

That cost doesn’t appear as a line item in any budget but repetitive “finding” of the same information over and over again is certainly a cost to any enterprise.

First cut on semantic interpretation, follow Norvig.

Saving re-finding costs and the cost of not-finding, requires something more robust than a one model to find words and in the search darkness bind them to particular meanings.

PS: See Peter@norvig.com for an extensive set of resources, papers, presentations, etc.

I first saw this in a tweet by James Fuller.

### Kidnapping Caitlynn (47 AKAs – Is There a Topic Map in the House?)

Thursday, December 10th, 2015

Kidnapping Caitlynn in 10 minutes long, but has accumulated forty-seven (47 AKAs).

Imagine the search difficulty in finding reviews under all forty-eight (48) titles.

Even better, imagine your search request was for something that really mattered.

Like known terrorists crossing national borders using their real names and passports.

Intelligence services aren’t doing all that hot even with string to string matches.

Perhaps that explains their inability to consider more sophisticated doctrines of identity.

If you can’t do string to string, more complex notions will grind your system to a halt.

Maybe intelligence agencies need new contractors. You think?

### IoT: The New Tower of Babel

Thursday, December 10th, 2015

Luke Anderson‘s post at Clickhole, titled: Humanity Could Totally Pull Off The Tower Of Babel At This Point, was a strong reminder of the Internet of Things (IoT).

See what you think:

If you went to Sunday school, you know the story: After the Biblical flood, the people of earth came together to build the mighty Tower of Babel. Speaking with one language and working tirelessly, they built a tower so tall that God Himself felt threatened by it. So, He fractured their language so that they couldn’t understand each other, construction ceased, and mankind spread out across the ancient world.

We’ve come a long way in the few millennia since then, and at this point, humanity could totally pull off the Tower of Babel.

Just look at the feats of human engineering we’ve accomplished since then: the Great Wall; the Golden Gate Bridge; the Burj Khalifa. And don’t even get me started on the International Space Station. Building a single tall building? It’d be a piece of cake.

Think about it. Right off the bat, we’d be able to communicate with each other, no problem. Besides most of the world speaking either English, Spanish, and/or Chinese by now, we’ve got translators, Rosetta Stone, Duolingo, the whole nine yards. Hell, IKEA instructions don’t even have words and we have no problem putting their stuff together. I can see how a guy working next to you suddenly speaking Arabic would throw you for a loop a few centuries ago. But now, I bet we could be topping off the tower and storming heaven in the time it took people of the past to say “Hey, how ya doing?”

Compare this Internet of Things statement from the Masters of Contracts that Yield No Useful Result:

IoT implementation, at its core, is the integration of dozens and up to tens of thousands of devices seamlessly communicating with each other, exchanging information and commands, and revealing insights. However, when devices have different usage scenarios and operating requirements that aren’t compatible with other devices, the system can break down. The ability to integrate different elements or nodes within broader systems, or bringing data together to drive insights and improve operations, becomes more complicated and costly. When this occurs, IoT can’t reach its potential, and rather than an Internet of everything, you see siloed Internets of some things.

The first, in case you can’t tell from it being posted at Clickhole, was meant as sarcasm or humor.

The second was deadly serious from folks who would put a permanent siphon on your bank account. Whether their services are cost effective or not is up to you to judge.

The Tower of Babel is a statement about semantics and the human condition. It should come as no surprise that we all prefer our language over that of others, whether those are natural or programming languages. Moreover, judging from code reuse, to say nothing of the publishing market, we prefer our restatements of the material, despite equally useful statements by others.

How else would you explain the proliferation of MS Excel books? 😉 One really good one is more than enough. Ditto for Bible translations.

Creating new languages to “fix” semantic diversity just adds another partially adopted language to the welter of languages that need to be integrated.

The better option, at least from my point of view, is to create mappings between languages, mappings that are based on key/value pairs to enable others to build upon, contract or expand those mappings.

It simply isn’t possible to foresee every use case or language that needs semantic integration but if we perform such semantic integration as returns ROI for us, then we can leave the next extension or contraction of that mapping to the next person with a different ROI.

It’s heady stuff to think we can cure the problem represented by the legendary Tower of Babel, but there is a name for that. It’s called hubris and it never leads to a good end.

### why I try to teach writing when I am supposed to be teaching art history

Wednesday, December 9th, 2015

why I try to teach writing when I am supposed to be teaching art history

From the post:

My daughter asked me yesterday what I had taught for so long the day before, and I told her, “the history of photography” and “writing.” She found this rather funny, since she, as a second-grader, has lately perfected the art of handwriting, so why would I be teaching it — still — to grown ups? I told her it wasn’t really how to write so much as how to put the ideas together — how to take a lot of information and say something with it to somebody else. How to express an idea in an organised way that lets somebody know what and why you think something. So, it turns out, what we call writing is never really just writing at all. It is expressing something in the hopes of becoming less alone. Of finding a voice, yes, but also in finding an ear to hear that voice, and an ear with a mouth that can speak back. It is about learning to enter into a conversation that becomes frozen in letters, yes, but also flexible in the form of its call and response: a magic trick that has the potential power of magnifying each voice, at times in conflict, but also in collusion, and of building those voices into the choir that can be called community. I realise that there was a time before I could write, and also a time when, like my daughter, writing consisted simply of the magic of transforming a line from my pen into words that could lift off the page no different than how I had set them down. But it feels like the me that is me has always been writing, as long as I can remember. It is this voice, however far it reaches or does not reach, that has been me and will continue to be me as long as I live and, in the strange way of words, enter into history. Someday, somebody will write historiographies in which they will talk about me, and I will consist not of this body that I inhabit, but the words that I string onto a page.

This is not to say that I write for the sake of immortality, so much as its opposite: the potential for a tiny bit of immortality is the by product of my attempt to be alive, in its fullest sense. To make a mark, to piss in the corners of life as it were, although hopefully in a slightly more sophisticated and usually less smelly way. Writing is, to me, the greatest output for the least investment: by writing, I gain a voice in the world which, like the flap of a butterfly’s wing, has the potential to grow on its own, outside of me, beyond me. My conviction that I should write is not so much because I think I’m better or have more of a right to speak than anybody else, but because I’m equally not convinced that anybody, no matter what their position of authority, is better or has more of an authorisation to write than me.

Writing is the greatest power that I can ever have. It is also an intimate passion, an orgy, between the many who write and the many who read, excitedly communicating with each other. For this reason it is not a power that I wish only for myself, for that would be no more interesting than the echo chamber of my own head. I love the power that is in others to write, the liberty they grant me to enter into their heads and hear their voices. I love our power to chime together, across time and space. I love the strange ability to enter into conversations with ghosts, as well as argue with, and just as often befriend, people I may never meet and people I hope to have a chance to meet. Even when we disagree, reading what people have written and taking it seriously feels like a deep form of respect to other human beings, to their right to think freely. It is this power of voices, of the many being able of their own accord to formulate a chorus, that appeals to the idealist deep within my superficially cynical self. To my mind, democracy can only emerge through this chorus: a cacophanous chorus that has the power to express as well as respect the diversity within itself.

A deep essay on writing that I recommend you read in full.

There is a line that hints at a reason for semantic diversity data science and the lack of code reuse in programming.

My conviction that I should write is not so much because I think I’m better or have more of a right to speak than anybody else, but because I’m equally not convinced that anybody, no matter what their position of authority, is better or has more of an authorisation to write than me.

Beyond the question of authority, whose writing do you understand better or more intuitively, yours or the writing or code of someone else? (At least assuming not too much time has gone by since you wrote it.)

The vast majority of use are more comfortable with our own prose or code, even though it required the effort to transpose prose or code written by others into our re-telling.

Being more aware of the nearly universal casting of prose/code to be our own, should help us acknowledge the moral debts to others and to point back to the sources of our prose/code.

I first saw this in a tweet by Atabey Kaygun.

### Rare Find: Honest General Speaks Publicly About IS (ISIL, ISIS)

Monday, December 29th, 2014

In Battle to Defang ISIS, U.S. Targets Its Psychology by Eric Schmitt.

From the post:

Maj. Gen. Michael K. Nagata, commander of American Special Operations forces in the Middle East, sought help this summer in solving an urgent problem for the American military: What makes the Islamic State so dangerous?

Trying to decipher this complex enemy — a hybrid terrorist organization and a conventional army — is such a conundrum that General Nagata assembled an unofficial brain trust outside the traditional realms of expertise within the Pentagon, State Department and intelligence agencies, in search of fresh ideas and inspiration. Business professors, for example, are examining the Islamic State’s marketing and branding strategies.

“We do not understand the movement, and until we do, we are not going to defeat it,” he said, according to the confidential minutes of a conference call he held with the experts. “We have not defeated the idea. We do not even understand the idea.” (emphasis added)

An honest member of the any administration in Washington is so unusual that I wanted to draw your attention to Maj. General Michael K. Nagata.

His problem, as you will quickly recognize, is one of a diversity of semantics. What is heard one way by a Western audience is heard completely differently by an audience with a different tradition.

The general may not think of it as “progress,” but getting Washington policy makers to acknowledge that there is a legitimate semantic gap between Western policy makers and IS is a huge first step. It can’t be grudging or half-hearted. Western policy makers have to acknowledge that there are honest views of the world that are different from their own. IS isn’t practicing dishonest, deception, perversely refusing to acknowledge the truth of Western statements, etc. Members of IS have an honest but different semantic view of the world.

If the good general can get policy makers to take that step, then and only then can the discussion of what that “other” semantic is and how to map it into terms comprehensible to Western policy makers can begin. If that step isn’t taken, then the resources necessary to explore and map that “other” semantic are never going to be allocated. And even if allocated, the results will never figure into policy making with regard to IS.

Failing on any of those three points: failing to concede the legitimacy of the IS semantic, failing to allocate resources to explore and understand the IS semantic, failing to incorporate an understanding of the IS semantic into policy making, is going to result in a failure to “defeat” IS, if that remains a goal after understanding its semantic.

Need an example? Consider the Viet-Nam war, in which approximately 58,220 Americans died and millions of Vietnamese, Laotions and Cambodians died, not counting long term injuries among all of the aforementioned. In case you have not heard, the United States lost the Vietnam War.

The reasons for that loss are wide and varied but let me suggest two semantic differences that may have played a role in that defeat. First, the Vietnamese have a long term view of repelling foreign invaders. Consider that Vietnam was occupied by the Chinese from 111 BCE until 938 CE, a period of more than one thousand (1,000) years. American war planners had a war semantic of planning for the next presidential election, not a winning strategy for a foe with a semantic that was two hundred and fifty (250) times longer.

The other semantic difference (among many others) was the understanding of “democracy,” which is usually heralded by American policy makers as a grand prize resulting from American involvement. In Vietnam, however, the villages and hamlets already had what some would consider democracy for centuries. (Beyond Hanoi: Local Government in Vietnam) Different semantic for “democracy” to be sure but one that was left unexplored in the haste to import a U.S. semantic of the concept.

Fighting a war where you don’t understand the semantics in play for the “other” side is risky business.

General Nagata has taken the first step towards such an understanding by admitting that he and his advisors don’t understand the semantics of IS. The next step should be to find someone who does. May I suggest talking to members of IS under informal meeting arrangements? Such that diplomatic protocols and news reporting doesn’t interfere with honest conversations? I suspect IS members are as ignorant of U.S. semantics as U.S. planners are of IS semantics so there would be some benefit for all concerned.

Such meetings would yield more accurate understandings than U.S. born analysts who live in upper middle-class Western enclaves and attempt to project themselves into foreign cultures. The understanding derived from such meetings could well contradict current U.S. policy assessments and objectives. Whether any administration has the political will to act upon assessments that aren’t the product of a shared post-Enlightenment semantic remains to be seen. But such a assessments must be obtained first to answer that question.

Would topic maps help in such an endeavor? Perhaps, perhaps not. The most critical aspect of such a project would be conceding for all purposes, the legitimacy of the “other” semantic, where “other” depends on what side you are on. That is a topic map “state of mind” as it were, where all semantics are treated equally and not any one as more legitimate than any other.

PS: A litmus test for Major General Michael K. Nagata to use in assembling a team to attempt to understand IS semantics: Have each applicant write their description of the 9/11 hijackers in thirty (30) words or less. Any applicant who uses any variant of coward, extremist, terrorist, fanatic, etc. should be wished well and sent on their way. Not a judgement on their fitness for other tasks but they are not going to be able to bridge the semantic gap between current U.S. thinking and that of IS.

The CIA has a report on some of the gaps but I don’t know if it will be easier for General Nagata to ask the CIA for a copy or to just find a copy on the Internet. It illustrates, for example, why the American strategy of killing IS leadership is non-productive if not counter-productive.

If you have the means, please forward this post to General Nagata’s attention. I wasn’t able to easily find a direct means of contacting him.

### Yet More “Hive” Confusion

Wednesday, December 10th, 2014

From the post:

A few months ago we told you about a new tool from The New York Times that allowed readers to help identify ads inside the paper’s massive archive. Madison, as it was called, was the first iteration on a new crowdsourcing tool from The New York Times R&D Lab that would make it easier to break down specific tasks and get users to help an organization get at the data they need.

Today the R&D Lab is opening up the platform that powers the whole thing. Hive is an open-source framework that lets anyone build their own crowdsourcing project. The code responsible for Hive is now available on GitHub. With Hive, a developer can create assignments for users, define what they need to do, and keep track of their progress in helping to solve problems.

Not all that long ago, I penned: Avoiding “Hive” Confusion, which pointed out the possible confusion between Apache Hive and High-performance Integrated Virtual Environment (HIVE), in mid to late October, 2014. Now, barely two months later we have another “Hive” in the information technology field.

I have no idea how many “hives” there are inside or outside of IT but as of today, I can name at least three (3).

Have you ever thought that semantic confusion is part and parcel of the human condition? Can be allowed for, can be compensated for, but can never be eliminated.

### CIDOC Conceptual Reference Model

Saturday, February 22nd, 2014

From the “Definition of the CIDOC Conceptual Reference Model:”

This document is the formal definition of the CIDOC Conceptual Reference Model (“CRM”), a formal ontology intended to facilitate the integration, mediation and interchange of heterogeneous cultural heritage information. The CRM is the culmination of more than a decade of standards development work by the International Committee for Documentation (CIDOC) of the International Council of Museums (ICOM). Work on the CRM itself began in 1996 under the auspices of the ICOM-CIDOC Documentation Standards Working Group. Since 2000, development of the CRM has been officially delegated by ICOM-CIDOC to the CIDOC CRM Special Interest Group, which collaborates with the ISO working group ISO/TC46/SC4/WG9 to bring the CRM to the form and status of an International Standard.

Objectives of the CIDOC CRM

The primary role of the CRM is to enable information exchange and integration between heterogeneous sources of cultural heritage information. It aims at providing the semantic definitions and clarifications needed to transform disparate, localised information sources into a coherent global resource, be it with in a larger institution, in intranets or on the Internet. Its perspective is supra-institutional and abstracted from any specific local context. This goal determines the constructs and level of detail of the CRM.

More specifically, it defines and is restricted to the underlying semantics of database schemata and document structures used in cultural heritage and museum documentation in terms of a formal ontology. It does not define any of the terminology appearing typically as data in the respective data structures; however it foresees the characteristic relationships for its use. It does not aim at proposing what cultural institutions should document. Rather it explains the logic of what they actually currently document, and thereby enables semantic interoperability.

It intends to provide a model of the intellectual structure of cultural documentation in logical terms. As such, it is not optimised for implementation-specific storage and processing aspects. Implementations may lead to solutions where elements and links between relevant elements of our conceptualizations are no longer explicit in a database or other structured storage system. For instance the birth event that connects elements such as father, mother, birth date, birth place may not appear in the database, in order to save storage space or response time of the system. The CRM allows us to explain how such apparently disparate entities are intellectually interconnected, and how the ability of the database to answer certain intellectual questions is affected by the omission of such elements and links.

The CRM aims to support the following specific functionalities:

• Inform developers of information systems as a guide to good practice in conceptual modelling, in order to effectively structure and relate information assets of cultural documentation.
• Serve as a common language for domain experts and IT developers to formulate requirements and to agree on system functionalities with respect to the correct handling of cultural contents.
• To serve as a formal language for the identification of common information contents in different data formats; in particular to support the implementation of automatic data transformation algorithms from local to global data structures without loss of meaning. The latter being useful for data exchange, data migration from legacy systems, data information integration and mediation of heterogeneous sources.
• To support associative queries against integrated resources by providing a global model of the basic classes and their associations to formulate such queries.
• It is further believed, that advanced natural language algorithms and case-specific heuristics can take significant advantage of the CRM to resolve free text information into a formal logical form, if that is regarded beneficial. The CRM is however not thought to be a means to replace scholarly text, rich in meaning, by logical forms, but only a means to identify related data.

(emphasis in original)

Apologies for the long quote but this covers a number of important topic map issues.

For example:

For instance the birth event that connects elements such as father, mother, birth date, birth place may not appear in the database, in order to save storage space or response time of the system. The CRM allows us to explain how such apparently disparate entities are intellectually interconnected, and how the ability of the database to answer certain intellectual questions is affected by the omission of such elements and links.

In topic map terms I would say that the database omits a topic to represent “birth event” and therefore there is no role player for an association with the various role players. What subjects will have representatives in a topic map is always a concern for topic map authors.

Helpfully, CIDOC explicitly separates the semantics it documents from data structures.

Because the CRM’s primary role is the meaningful integration of information in an Open World, it aims to be monotonic in the sense of Domain Theory. That is, the existing CRM constructs and the deductions made from them must always remain valid and well-formed, even as new constructs are added by extensions to the CRM.

Which restricts integration using CRM to systems where CRM is the primary basis for integration, as opposed to be one way to integrate several data sets.

That may not seem important in “web time,” where 3 months equals 1 Internet year. But when you think of integrating data and integration practices as they evolve over decades if not centuries, the limitations of monotonic choices come to the fore.

To take one practical discussion under way, how to handle warning about radioactive waste, which must endure anywhere from 10,000 to 1,000,000 years? A far simpler task than preserving semantics over centuries.

If you think that is easy, remember that lots of people saw the pyramids of Egypt being built. But it was such common knowledge, that no one thought to write it down.

Preservation of semantics is a daunting task.

CIDOC merits a slow read by anyone interested in modeling, semantics, vocabularies, and preservation.

PS: CIDOC: Conceptual Reference Model as a Word file.

### Detecting Semantic Overlap and Discovering Precedents…

Monday, July 8th, 2013

Detecting Semantic Overlap and Discovering Precedents in the Biodiversity Research Literature by Graeme Hirst, Nadia Talenty, and Sara Scharfz.

Abstract:

Scientiﬁc literature on biodiversity is longevous, but even when legacy publications are available online, researchers often fail to search it adequately or effectively for prior publications; consequently, new research may replicate, or fail to adequately take into account, previously published research. The mechanisms of the Semantic Web and methods developed in contemporary research in natural language processing could be used, in the near-term future, as the basis for a precedent-ﬁnding system that would take the text of an author’s early draft (or a submitted manuscript) and ﬁnd potentially related ideas in published work. Methods would include text-similarity metrics that take different terminologies, synonymy, paraphrase, discourse relations, and structure of argumentation into account.

Footnote one (1) of the paper gives an idea of the problem the authors face:

Natural history scientists work in fragmented, highly distributed and parochial communities, each with domain speciﬁc requirements and methodologies [Scoble 2008]. Their output is heterogeneous, high volume and typically of low impact, but with a citation half-life that may run into centuries” (Smith et al. 2009). “The cited half-life of publications in taxonomy is longer than in any other scientiﬁc discipline, and the decay rate is longer than in any scientiﬁc discipline” (Moritz 2005). Unfortunately, we have been unable to identify the study that is the basis for Moritz’s remark.

The paper explores in detail issues that have daunted various search techniques, when the material is available in electronic format at all.

The authors make a general proposal for addressing these issues, with mention of the Semantic Web but omit from their plan:

The other omission is semantic interpretation into a logical form, represented in XML, that draws on ontologies in the style of the original Berners-Lee, Hendler, and Lassila (2001) proposal for the Semantic Web. The problem with logical-form representation is that it implies a degree of precision in meaning that is not appropriate for the kind of matching we are proposing here. This is not to say that logical forms would be useless. On the contrary, they are employed by some approaches to paraphrase and textual entailment (section 4.1 above) and hence might appear in the system if only for that reason; but even so, they would form only one component of a broader and somewhat looser kind of semantic representation.

That’s the problem with the Semantic Web in a nutshell:

The problem with logical-form representation is that it implies a degree of precision in meaning that is not appropriate for the kind of matching we are proposing here.

What if I want to be logically precise sometimes but not others?

What if I want to be more precise in some places and less precise in others?

What if I want to have different degrees or types of imprecision?

With topic maps the question is: How im/precise do you want to be?

### “tweet” enters the Oxford English Dictionary

Monday, June 17th, 2013

A heads up for the June 2013 OED release

From the post:

The shorter a word is, generally speaking, the more complex it is lexicographically. Short words are likely to be of Germanic origin, and so to derive from the earliest bedrock of English words; they have probably survived long enough in the language to spawn many new sub-senses; they are almost certain to have generated many fixed compounds and phrases often taking the word into undreamt-of semantic areas; and last but not least they have typically formed the basis of secondary derivative words which in turn develop a life of their own.

All of these conditions apply to the three central words in the current batch of revised entries: hand, head, and heart. Each one of these dates in English from the earliest times and forms part of a bridge back to the Germanic inheritance of English. The revised and updated range contains 2,875 defined items, supported by 22,116 illustrative quotations.

(…)

The noun and verb tweet (in the social-networking sense) has just been added to the OED. This breaks at least one OED rule, namely that a new word needs to be current for ten years before consideration for inclusion. But it seems to be catching on.

Dictionaries, particularly ones like the OED, should be all the evidence needed to prove semantic diversity is everywhere.

But I don’t think anyone really contests that point.

Disagreement arises when others refuse to abandon their incorrect understanding of terms and to adhere to the meanings intended by a speaker.

A speaker understands themselves perfectly and so expects their audience to put for the effort to do the same.

No surprise that we have so many silos, since we have personal, family, group and enterprise silos.

What is surprising is that we communicate as well as we do, despite the many layers of silos.

### Semantic Diversity – Special Characters

Monday, June 17th, 2013

neo4j/cypher/Lucene: Dealing with special characters by Mark Needham.

Mark outlines how to handle “special characters” in Lucene (indexer for Neo4j), only to find that an escape character for a Lucene query is also a special character for Cypher, which itself must be escaped.

There is a chart in Mastering Regular Expressions by Jeffrey E F Friedl of “special” characters but that doesn’t cover all the internal parsing choices software.

Over the last sixty plus years there has been little progress towards a common set of “special” characters in computer science.

Handling of “special” characters lies at the heart of accessing data and all programs have code to account for them.

With no common agreement on “special” characters, what reason would you offer to expect convergence elsewhere?

### NSA…Verizon…Obama… Connecting the Dots. Or not.

Friday, June 7th, 2013

Why Verizon?

The first question that came to mind when the Guardian broke the NSA-Verizon news.

Here’s why I ask:

Verizon over 2011-2012 had only 34% of the cell phone market.

Unless terrorists prefer Verizon for ideological reasons, why Verizon?

Choosing only Verizon means the NSA is missing 66% of potential terrorist cell traffic.

That sounds like a bad plan.

What other reason could there be for picking Verizon?

Consider some other known players:

President Barack Obama, candidate for President of the United States, 2012.

“Bundlers” who gathered donations for Barack Obama:

 Min Max Name City State Employer $200,000$500,000 Hill, David Silver Spring MD Verizon Communications $200,000$500,000 Brown, Kathryn Oakton VA Verizon Communications $50,000$100,000 Milch, Randal Bethesda MD Verizon Communications

BTW, the Max category means more money may have been given, but that is the top reporting category.

I have informally “identified” the bundlers as follows:

• Kathryn C. Brown

Kathryn C. Brown is senior vice president – Public Policy Development and Corporate Responsibility. She has been with the company since June 2002. She is responsible for policy development and issues management, public policy messaging, strategic alliances and public affairs programs, including Verizon Reads.

Ms. Brown is also responsible for federal, state and international public policy development and international government relations for Verizon. In that role she develops public policy positions and is responsible for project management on emerging domestic and international issues. She also manages relations with think tanks as well as consumer, industry and trade groups important to the public policy process.

• David A. Hill, Bloomberg Business Week reports: David A. Hill serves as Director of Verizon Maryland Inc.

LinkedIn profile reports David A. Hill worked for Verizon, VP & General Counsel (2000 – 2006), Associate General Counsel (March 2006 – 2009), Vice President & Associate General Counsel (March 2009 – September 2011) “Served as a liaison between Verizon and the Obama Administration”

• Randal S. Milch Executive Vice President – Public Policy and General Counsel

What is Verizon making for each data delivery? Is this cash for cash given?

If someone gave your more than \$1 million (how much more is unknown), would you talk to them about such a court order?

If you read the “secret” court order, you will notice it was signed on April 23, 2013.

There isn’t a Kathryn C. Brown in Oakton in the White House visitor’s log, but I did find this record, where a “Kathryn C. Brown” made an appointment at the Whitehouse and was seen two (2) days later on the 17th of January 2013.

BROWN,KATHRYN,C,U69535,,VA,,,,,1/15/13 0:00,1/17/13 9:30,1/17/13 23:59,,176,CM,WIN,1/15/13 11:27,CM,,POTUS/FLOTUS,WH,State Floo,MCNAMARALAWDER,CLAUDIA,,,04/26/2013

I don’t have all the dots connected because I am lacking some unknown # of the players, internal Verizon communications, Verizon accounting records showing government payments, but it is enough to make you wonder about the purpose of the “secret” court order.

Was it a serious attempt at gathering data for national security reasons?

Or was it gathering data as a pretext for payments to Verizon or other contractors?

My vote goes for “pretext for payments.”

I say that because using data from different sources has always been hard.

In fact, about 60 to 80% of the time of a data analyst is spent “cleaning up data” for further processing.

The phrase “cleaning up data” is the colloquial form of “semantic impedance.”

Semantic impedance happens when the same people are known by different names in different data sets or different people are known by the same names in the same or different data sets.

Remember Kathryn Brown, of Oakton, VA? One of the Obama bundlers. Let’s use her as an example of “semantic impedance.”

The FEC has a record for Kathryn Brown of Oakton, VA.

But a search engine found:

Kathryn C. Brown

Same person? Or different?

I found another Kathryn Brown at Facebook:

And an image of Facebook Kathryn Brown:

And a photo from a vacation she took:

Not to mention the Kathryn Brown that I found at Twitter.

That’s only four (4) data sources and I have at least four (4) different Kathryn Browns.

Across the United States, a quick search shows 227,000 Kathryn Browns.

Remember that is just a personal name. What about different forms of addresses? Or names of employers? Or job descriptions? Or simple errors, like the 20% error rate in credit report records.

Take all the phones, plus names, addresses, employers, job descriptions, errors + other data and multiply that times 311.6 million Americans.

Can that problem be solved with petabytes of data and teraflops of processing?

Not a chance.

Remember that my identification of Kathryn “bundler” Brown with the Kathryn C. Brown of Verison was a human judgement, not an automatic rule. Nor would a computer think to check the White House visitor logs to see if another, possibly the same Kathryn C. Brown visited the White House before the secret order was signed.

Human judgement is required because all the data that the NSA has been collecting is “dirty” data, from one perspective or other. Either is is truly “dirty” in the sense of having errors or it is “dirty” in the sense it doesn’t play well with other data.

The Orwellian fearists can stop huffing and puffing about the coming eclipse of civil liberties. Those passed from view a short time after 9/11 with the passage of the Patriot Act.

That wasn’t the fault of ineffectual NSA data collection. American voters bear responsibility for the loss of civil liberties not voting leadership into office that would repeal the Patriot Act.

Ineffectual NSA data collection impedes the development of techniques that for a sanely scoped data collection effort could make a difference.

A sane scope for preventing terrorist attacks could be starting with a set of known or suspected terrorist phone numbers. Using all phone data (not just from Obama contributors), only numbers contacting or being contacted by those numbers would be subject to further analysis.

Using that much smaller set of phone numbers as identifiers, we could then collect other data, such as names and addresses associated with that smaller set of phone numbers. That doesn’t make the data any cleaner but it does give us a starting point for mapping “dirty” data sets into our starter set.

The next step would be create mappings from other data sets. If we say why we have created a mapping, others can evaluate the accuracy of our mappings.

Those tasks would require computer assistance, but they ultimately would be matters of human judgement.

Examples of such judgements exist, say for example in Palantir product line. If you watch Palantir Gotham being used to model biological relationships, take note of the results that were tagged by another analyst. And how the presenter tags additional material that becomes available to other researchers.

Computer assisted? Yes. Computer driven? No.

To be fair, human judgement is also involved in ineffectual NSA data collection efforts.

But it is human judgement that rewards sycophants and supporters, not serving the public interest.

### re3data.org

Thursday, May 30th, 2013

re3data.org

From the post:

An increasing number of universities and research organisations are starting to build research data repositories to allow permanent access in a trustworthy environment to data sets resulting from research at their institutions. Due to varying disciplinary requirements, the landscape of research data repositories is very heterogeneous. This makes it difficult for researchers, funding bodies, publishers, and scholarly institutions to select an appropriate repository for storage of research data or to search for data.

The re3data.org registry allows the easy identification of appropriate research data repositories, both for data producers and users. The registry covers research data repositories from all academic disciplines. Information icons display the principal attributes of a repository, allowing users to identify the functionalities and qualities of a data repository. These attributes can be used for multi-faceted searches, for instance to find a repository for geoscience data using a Creative Commons licence.

By April 2013, 338 research data repositories were indexed in re3data.org. 171 of these are described by a comprehensive vocabulary, which was developed by involving the data repository community (http://doi.org/kv3).

The re3data.org search at can be found at: http://www.re3data.org
The information icons are explained at: http://www.re3data.org/faq

Does this sound like any of these?:

DataOne

The Dataverse Network Project

IOGDS: International Open Government Dataset Search

PivotPaths: a Fluid Exploration of Interlinked Information Collections

Quandl [> 2 million financial/economic datasets]

Just to name five (5) that came to mind right off hand?

Addressing the heterogeneous nature of data repositories by creating another, semantically different data repository, seems like a non-solution to me.

What would be useful would be to create a mapping of this “new” classification, which I assume works for some group of users, against the existing classifications.

That would allow users of the “new” classification to access data in existing repositories, without having to learn their classification systems.

The heterogeneous nature of information is never vanquished but we can incorporate it into our systems.

### From data to analysis:… [Data Integration For a Purpose]

Friday, May 24th, 2013

From data to analysis: linking NWChem and Avogadro with the syntax and semantics of Chemical Markup Language by Wibe A de Jong, Andrew M Walker and Marcus D Hanwell. (Journal of Cheminformatics 2013, 5:25 doi:10.1186/1758-2946-5-25)

Abstract:

Background

Multidisciplinary integrated research requires the ability to couple the diverse sets of data obtained from a range of complex experiments and computer simulations. Integrating data requires semantically rich information. In this paper an end-to-end use of semantically rich data in computational chemistry is demonstrated utilizing the Chemical Markup Language (CML) framework. Semantically rich data is generated by the NWChem computational chemistry software with the FoX library and utilized by the Avogadro molecular editor for analysis and visualization.

Results

The NWChem computational chemistry software has been modified and coupled to the FoX library to write CML compliant XML data files. The FoX library was expanded to represent the lexical input files and molecular orbitals used by the computational chemistry software. Draft dictionary entries and a format for molecular orbitals within CML CompChem were developed. The Avogadro application was extended to read in CML data, and display molecular geometry and electronic structure in the GUI allowing for an end-to-end solution where Avogadro can create input structures, generate input files, NWChem can run the calculation and Avogadro can then read in and analyse the CML output produced. The developments outlined in this paper will be made available in future releases of NWChem, FoX, and Avogadro.

Conclusions

The production of CML compliant XML files for computational chemistry software such as NWChem can be accomplished relatively easily using the FoX library. The CML data can be read in by a newly developed reader in Avogadro and analysed or visualized in various ways. A community-based effort is needed to further develop the CML CompChem convention and dictionary. This will enable the long-term goal of allowing a researcher to run simple “Google-style” searches of chemistry and physics and have the results of computational calculations returned in a comprehensible form alongside articles from the published literature.

Aside from its obvious importance for cheminformatics, I think there is another lesson in this article.

Integration of data required “…semantically rich information…, but just as importantly, integration was not a goal in and of itself.

Integration was only part of a workflow that had other goals.

No doubt some topic maps are useful as end products of integrated data, but what of cases where integration is part of a workflow?

Think of the non-reusable data integration mappings that are offered by many enterprise integration packages.

### FuzzyLaw [FuzzyDBA, FuzzyRDF, FuzzySW?]

Monday, May 20th, 2013

FuzzyLaw

From the webpage:

(…)

FuzzyLaw has gathered explanations of legal terms from members of the public in order to get a sense of what the ‘person on the street’ has in mind when they think of a legal term. By making lay-people’s explanations of legal terms available to interpreters, police and other legal professionals, we hope to stimulate debate and learning about word meaning, public understanding of law and the nature of explanation.

The explanations gathered in FuzzyLaw are unusual in that they are provided by members of the public. These people, all aged over 18, regard themselves as ‘native speakers’, ‘first language speakers’ and ‘mother tongue’ speakers of English and have lived in England and/or Wales for 10 years or more. We might therefore expect that they will understand English legal terminology as well as any member of the public might. No one who has contributed has ever worked in the criminal law system or as an interpreter or translator. They therefore bring no special expertise to the task of explanation, beyond whatever their daily life has provided.

We have gathered explanations for 37 words in total. You can see a sample of these explanations on FuzzyLaw. The sample of explanations is regularly updated. You can also read responses to the terms and the explanations from mainly interpreters, police officers and academics. You are warmly invited to add your own responses and join in the discussion of each and every word. Check back regularly to see how discussions develop and consider bookmarking the site for future visits. The site also contains commentaries on interesting phenomena which have emerged through the site. You can respond to the commentaries too on that page, contributing to the developing research project.

(…)

Have you ever wondered that the ‘person on the street’ thinks about relational databases, RDF or the Semantic Web?

Those are the folks who are being pushed content based on interpretations not their own making.

Here’s a work experiment for you:

1. Take ten search terms from your local query log.
2. At each department staff meeting, distribute sheets with the words, requesting everyone to define the terms in their own words. No wrong answers.
3. Tally up the definitions per department and across the company.

I first saw this at: FuzzyLaw: Collection of lay citizens’ understandings of legal terminology.

### FuturICT:… [No Semantics?]

Monday, May 6th, 2013

FuturICT: Participatory Computing for Our Complex World

FuturICT is a visionary project that will deliver new science and technology to explore, understand and manage our connected world. This will inspire new information and communication technologies (ICT) that are socially adaptive and socially interactive, supporting collective awareness.

Revealing the hidden laws and processes underlying our complex, global, socially interactive systems constitutes one of the most pressing scientific challenges of the 21st Century. Integrating complexity science with ICT and the social sciences, will allow us to design novel robust, trustworthy and adaptive technologies based on socially inspired paradigms. Data from a variety of sources will help us to develop models of techno-socioeconomic systems. In turn, insights from these models will inspire a new generation of socially adaptive, self-organised ICT systems. This will create a paradigm shift and facilitate a symbiotic co-evolution of ICT and society. In response to the European Commission’s call for a ‘Big Science’ project, FuturICT will build a largescale, pan European, integrated programme of research which will extend for 10 years and beyond.

Did you know that the term “semantic” appears only twice in the FuturICT Project Outline? And both times as in the “semantic web?”

Not a word of how models, data sources, paradigms, etc., with different semantics are going to be wedded into a coherent whole.

View it as an opportunity to deliver FuturlCT results using topic maps beyond this project.

### Have you used Lua for MapReduce?

Wednesday, May 1st, 2013

Have you used Lua for MapReduce?

From the post:

Lua as a cross platform programming language has been popularly used in games and embedded systems. However, due to its excellent use for configuration, it has found wider acceptance in other user cases as well.

Lua was inspired from SOL (Simple Object Language) and DEL(Data-Entry Language) and created by Roberto Ierusalimschy, Waldemar Celes, and Luiz Henrique de Figueiredo at the Pontifical Catholic University of Rio de Janeiro, Brazil. Roughly translated to ‘Moon’ in Portuguese, it has found many big takers like Adobe, Nginx, Wikipedia.

Another scripting language to use with MapReduce and Hadoop.

Have you ever noticed the Tower of Babel seems to follow human activity around?

First, it was building a tower to heaven – confuse the workforce.

Then it was other community efforts.

And many, many thens, later, it has arrived at MapReduce/Hadoop configuration languages.

Like a kaleidoscope, it just gets richer the more semantic diversity we add.

Do you wonder what the opposite of semantic diversity must look like?

Or if we are the cause, what would it mean to eliminate semantic diversity?

### LevelGraph [Graph Databases and Semantic Diversity]

Sunday, April 28th, 2013

LevelGraph

From the webpage:

LevelGraph is a Graph Database. Unlike many other graph database, LevelGraph is built on the uber-fast key-value store LevelDB through the powerful LevelUp library. You can use it inside your node.js application.

LevelGraph loosely follows the Hexastore approach as presente in the article: Hexastore: sextuple indexing for semantic web data management C Weiss, P Karras, A Bernstein – Proceedings of the VLDB Endowment, 2008. Following this approach, LevelGraph uses six indices for every triple, in order to access them as fast as it is possible.

The family of graph databases gains another member.

The growth of graph database offerings is evidence the effort to reduce semantic diversity is a fool’s errand.

It isn’t hard to find graph database projects, yet new ones appear on a regular basis.

With every project starting over with the basic issues of graph representation and algorithms.

The reasons for that diversity are likely as diverse as the diversity itself.

If the world has been diverse, remains diverse and evidence is it will continue to be diverse, what are the odds in fighting diversity?

That’s what I thought.

Topic maps, embracing diversity.

I first saw this in a tweet by Frank Denis.

### Collaborative annotation… [Human + Machine != Semantic Monotony]

Sunday, April 21st, 2013

Collaborative annotation for scientific data discovery and reuse by Kirk Borne. (Borne, K. (2013), Collaborative annotation for scientific data discovery and reuse. Bul. Am. Soc. Info. Sci. Tech., 39: 44–45. doi: 10.1002/bult.2013.1720390414)

Abstract:

Human classification alone, unable to handle the enormous quantity of project data, requires the support of automated machine-based strategies. In collaborative annotation, humans and machines work together, merging editorial strengths in semantics and pattern recognition with the machine strengths of scale and algorithmic power. Discovery informatics can be used to generate common data models, taxonomies and ontologies. A proposed project of massive scale, the Large Synoptic Survey Telescope (LSST) project, will systematically observe the southern sky over 10 years, collecting petabytes of data for analysis. The combined work of professional and citizen scientists will be needed to tag the discovered astronomical objects. The tag set will be generated through informatics and the collaborative annotation efforts of humans and machines. The LSST project will demonstrate the development and application of a classification scheme that supports search, curation and reuse of a digital repository.

A persuasive call to arms to develop “collaborative annotation:”

Humans and machines working together to produce the best possible classification label(s) is collaborative annotation. Collaborative annotation is a form of human computation [1]. Humans can see patterns and semantics (context, content and relationships) more quickly, accurately and meaningfully than machines. Human computation therefore applies to the problem of annotating, labeling and classifying voluminous data streams.

And more specifically for the Large Synoptic Survey Telescope (LSST):

The discovery potential of this data collection would be enormous, and its long-term value (through careful data management and curation) would thus require (for maximum scientific return) the participation of scientists and citizen scientists as well as science educators and their students in a collaborative knowledge mark-up (annotation and tagging) data environment. To meet this need, we envision a collaborative tagging system called AstroDAS (Astronomy Distributed Annotation System). AstroDAS is similar to existing science knowledge bases, such as BioDAS (Biology Distributed Annotation System, www.biodas.org).

As you might expect, semantic diversity is going to be present with “collaborative annotation.”

Semantic Monotony (aka Semantic Web) has failed for machines alone.

No question it will fail for humans + machines.

Are you ready to step up to the semantic diversity of collaborative annotation (humans + machines)?

### Semantic Search Over The Web (SSW 2013)

Monday, March 18th, 2013

3RD International Workshop onSemantic Search Over The Web (SSW 2013)

Dates:

Abstract Papers submission: May 31, 2013 – 15:00 (3:00 pm) EDT
(Short) Full Paper submission: June 7, 2013 – 15:00 (3:00 pm) EDT
Author notification: July 19, 2013
Camera-ready copy due: August 2, 2013
Workshop date: During VLDB (Aug 26 – Aug 30)

From the webpage:

We are witnessing a smooth evolution of the Web from a worldwide information space of linked documents to a global knowledge base, composed of semantically interconnected resources. To date, the correlated and semantically annotated data available on the web amounts to 25 billion RDF triples, interlinked by around 395 million RDF links. The continuous publishing and the integration of the plethora of semantic datasets from companies, government and public sector projects is leading to the creation of the so-called Web of Knowledge. Each semantic dataset contributes to extend the global knowledge and increases its reasoning capabilities. As a matter of facts, researchers are now looking with growing interest to semantic issues in this huge amount of correlated data available on the Web. Many progresses have been made in the field of semantic technologies, from formal models to repositories and reasoning engines. While the focus of many practitioners is on exploiting such semantic information to contribute to IR problems from a document centric point of view, we believe that such a vast, and constantly growing, amount of semantic data raises data management issues that must be faced in a dynamic, highly distributed and heterogeneous environment such as the Web.

The third edition of the International Workshop on Semantic Search over the Web (SSW) will discuss about data management issues related to the search over the web and the relationships with semantic web technologies, proposing new models, languages and applications.

The research issues can be summarized by the following problems:

• How can we model and efficiently access large amounts of semantic web data?
• How can we effectively retrieve information exploiting semantic web technologies?
• How can we employ semantic search in real world scenarios?

The SSW Workshop invites researchers, engineers, service developers to present their research and works in the field of data management for semantic search. Papers may deal with methods, models, case studies, practical experiences and technologies.

Apologies for the uncertainty of the workshop date. (There is confusion about the date on the workshop site, one place says the 26th, the other the 30th. Check before you make reservation/travel arrangements.)

I differ with the organizers on some issues but on the presence of: “…data management issues that must be faced in a dynamic, highly distributed and heterogeneous environment such as the Web,” there is no disagreement.

That’s the trick isn’t it? In any confined or small group setting, just about any consistent semantic solution will work.

The hurly-burly of a constant stream of half-heard, partially understood communications across distributed and heterogeneous systems tests the true mettle of semantic solutions.

Not a quest for perfect communication but “good enough.”

### MetaNetX.org…

Saturday, March 16th, 2013

MetaNetX.org: a website and repository for accessing, analysing and manipulating metabolic networks by Mathias Ganter, Thomas Bernard, Sébastien Moretti, Joerg Stelling and Marco Pagni. (Bioinformatics (2013) 29 (6): 815-816. doi: 10.1093/bioinformatics/btt036)

Abstract:

MetaNetX.org is a website for accessing, analysing and manipulating genome-scale metabolic networks (GSMs) as well as biochemical pathways. It consistently integrates data from various public resources and makes the data accessible in a standardized format using a common namespace. Currently, it provides access to hundreds of GSMs and pathways that can be interactively compared (two or more), analysed (e.g. detection of dead-end metabolites and reactions, flux balance analysis or simulation of reaction and gene knockouts), manipulated and exported. Users can also upload their own metabolic models, choose to automatically map them into the common namespace and subsequently make use of the website’s functionality.

The authors are addressing a familiar problem:

Genome-scale metabolic networks (GSMs) consist of compartmentalized reactions that consistently combine biochemical, genetic and genomic information. When also considering a biomass reaction and both uptake and secretion reactions, GSMs are often used to study genotype–phenotype relationships, to direct new discoveries and to identify targets in metabolic engineering (Karr et al., 2012). However, a major difficulty in GSM comparisons and reconstructions is to integrate data from different resources with different nomenclatures and conventions for both metabolites and reactions. Hence, GSM consolidation and comparison may be impossible without detailed biological knowledge and programming skills. (emphasis added)

For which they propose an uncommon solution:

MetaNetX.org is implemented as a user-friendly and self-explanatory website that handles all user requests dynamically (Fig. 1a). It allows a user to access a collection of hundreds of published models, browse and select subsets for comparison and analysis, upload or modify new models and export models in conjunction with their results. Its functionality is based on a common namespace defined by MNXref (Bernard et al., 2012). In particular, all repository or user uploaded models are automatically translated with or without compartments into the common namespace; small deviations from the original model are possible due to the automatic reconciliation steps implemented by Bernard et al. (2012). However, a user can choose not to translate his model but still make use of the website’s functionalities. Furthermore, it is possible to augment the given reaction set by user-defined reactions, for example, for model augmentation.

The bioinformatics community recognizes the intellectual poverty of lock step models.

Wonder when the intelligence community is going to have that “a ha” moment?

Friday, February 22nd, 2013

Red Hat Unveils Big Data and Open Hybrid Cloud Direction

From the post:

Red Hat, Inc. (NYSE: RHT), the world’s leading provider of open source solutions, today announced its big data direction and solutions to satisfy enterprise requirements for highly reliable, scalable, and manageable solutions to effectively run their big data analytics workloads. In addition, Red Hat announced that the company will contribute its Red Hat Storage Hadoop plug-in to the ApacheTM Hadoop® open community to transform Red Hat Storage into a fully-supported, Hadoop-compatible file system for big data environments, and that Red Hat is building a robust network of ecosystem and enterprise integration partners to deliver comprehensive big data solutions to enterprise customers. This is another example of Red Hat’s strategic commitment to big data customers and its continuing efforts to provide them with enterprise solutions through community-driven innovation.

The more Hadoop grows, the more Hadoop silos will as well.

You will need Hadoop and semantic skills to wire Hadoop silos together.

Re-wire with topic maps to avoid re-wiring the same Hadoop silos over and over again.

I first saw this at Red Hat reveal big data plans, open sources HDFS replacement by Elliot Bentley.

### Hadoop silos need integration…

Thursday, February 21st, 2013

Hadoop silos need integration, manage all data as asset, say experts by Brian McKenna.

From the post:

Big data hype has caused infantile disorders in corporate organisations over the past year. Hadoop silos, an excess of experimentation, and an exaggeration of the importance of data scientists are among the teething problems of big data, according to experts, who suggest organisations should manage all data as an asset.

Steve Shelton, head of data services at consultancy Detica, part of BAE Systems, said Hadoop silos have become part of the enterprise IT landscape, both in the private and public sectors. “People focused on this new thing called big data and tried to isolate it [in 2011 and 2012],” he said.

The focus has been too concentrated on non-traditional data types, and that has been driven by the suppliers. The business value of data is more effectively understood when you look at it all together, big or otherwise, he said.

Have big data technologies been a distraction? “I think it has been an evolutionary learning step, but businesses are stepping back now. When it comes to information governance, you have to look at data across the patch,” said Shelton.

He said Detica had seen complaints about Hadoop silos, and these were created by people going through a proof-of-concept phase, setting up a Hadoop cluster quickly and building a team. But a Hadoop platform involves extra costs on top, in terms of managing it and integrating it into your existing business processes.

“It’s not been a waste of time and money, it is just a stage. And it is not an insurmountable challenge. The next step is to integrate those silos, but the thinking is immature relative to the technology itself,” said Shelton.

I take this as encouraging news for topic maps.

Semantically diverse data has been stores in semantically diverse datastores. Data, which if integrated, could provide business value.

Again.

There will always be a market for topic maps because people can’t stop creating semantically diverse data and data stores.

How’s that for long term market security?

No matter what data or data storage technology arises, semantic inconsistency will be with us always.

### Saving the “Semantic” Web (part 4)

Wednesday, February 13th, 2013

Democracy vs. Aristocracy

Part of a recent comment on this series reads:

What should we have been doing instead of the semantic web? ISO Topic Maps? There is some great work in there, but has it been a better success?

That is an important question and I wanted to capture it outside of comments on a prior post.

Earlier in this series of posts I pointed out the success of HTML, especially when contrasted with Semantic Web proposals.

Let me hasten to add the same observation is true for ISO Topic Maps (HyTime or later versions).

The critical difference between HTML (the early and quite serviceable versions) and Semantic Web/Topic Maps is that the former democratizes communication and the latter fosters a technical aristocracy.

Every user who can type and some who hunt-n-peck, can author HTML and publish their content for others around the world to read, discuss, etc.

That is a very powerful and democratizing notion about content creation.

The previous guardians, gate keepers, insiders, and their familiars, who didn’t add anything of value to prior publications processes, are still reeling from the blow.

Even as old aristocracies crumble, new ones evolve.

Technical aristocracies for example. A phrase relevant to both the Semantic Web and ISO Topic Maps.

Having tasted freedom, the crowds aren’t as accepting of the lash/leash as they once were. Nor of the aristocracies who would wield them. Nor should they be.

Which make me wonder: Why the emphasis on creating dumbed down semantics for computers?

We already have billions of people who are far more competent semantically than computers.

Where are our efforts to enable them to transverse the different semantics of other users?

Such as the semantics of the aristocrats who have self-anointed themselves to labor on their behalf?

If you have guessed that I have little patience with aristocracies, you are right in one.

I came by that aversion honestly.

I practiced law in a civilian jurisdiction for a decade. A specialist language, law, can be more precise, but it also excludes others from participation. The same experience was true when I studied theology and ANE languages. A bit later, in markup technologies (then SGML/HyTime), the same lesson was repeated. What I do with ODF and topic maps are two more specialized languages.

Yet a reasonably intelligent person can discuss issues in any of those fields, if they can get past the language barriers aristocrats take so much comfort in maintaining.

My answer to what we should be doing is:

Looking for ways to enable people to traverse and enjoy the semantic diversity that accounts for the richness of the human experience.

PS: Computers have a role to play in that quest, but a subordinate one.

### Content-Based Image Retrieval at the End of the Early Years

Tuesday, January 22nd, 2013

Content-Based Image Retrieval at the End of the Early Years by Arnold W.M. Smeulders, Marcel Worring, Simone Santini, Amarnath Gupta, and Ramesh Jain. (Smeulders, A.W.M.; Worring, M.; Santini, S.; Gupta, A.; Jain, R.; , “Content-based image retrieval at the end of the early years,” Pattern Analysis and Machine Intelligence, IEEE Transactions on , vol.22, no.12, pp.1349-1380, Dec 2000
doi: 10.1109/34.895972)

Abstract:

Presents a review of 200 references in content-based image retrieval. The paper starts with discussing the working conditions of content-based retrieval: patterns of use, types of pictures, the role of semantics, and the sensory gap. Subsequent sections discuss computational steps for image retrieval systems. Step one of the review is image processing for retrieval sorted by color, texture, and local geometry. Features for retrieval are discussed next, sorted by: accumulative and global features, salient points, object and shape features, signs, and structural combinations thereof. Similarity of pictures and objects in pictures is reviewed for each of the feature types, in close connection to the types and means of feedback the user of the systems is capable of giving by interaction. We briefly discuss aspects of system engineering: databases, system architecture, and evaluation. In the concluding section, we present our view on: the driving force of the field, the heritage from computer vision, the influence on computer vision, the role of similarity and of interaction, the need for databases, the problem of evaluation, and the role of the semantic gap.

Excellent survey article from 2000 (not 2002 as per the Ostermann paper).

I think you will appreciate the treatment of the “semantic gap,” both in terms of its description as well as ways to address it.

If you are using annotated images in your topic map application, definitely a must read.

### User evaluation of automatically generated keywords and toponyms… [of semantic gaps]

Tuesday, January 22nd, 2013

User evaluation of automatically generated keywords and toponyms for geo-referenced images by Frank O. Ostermann, Martin Tomko, Ross Purves. (Ostermann, F. O., Tomko, M. and Purves, R. (2013), User evaluation of automatically generated keywords and toponyms for geo-referenced images. J. Am. Soc. Inf. Sci.. doi: 10.1002/asi.22738)

Abstract:

This article presents the results of a user evaluation of automatically generated concept keywords and place names (toponyms) for geo-referenced images. Automatically annotating images is becoming indispensable for effective information retrieval, since the number of geo-referenced images available online is growing, yet many images are insufficiently tagged or captioned to be efficiently searchable by standard information retrieval procedures. The Tripod project developed original methods for automatically annotating geo-referenced images by generating representations of the likely visible footprint of a geo-referenced image, and using this footprint to query spatial databases and web resources. These queries return raw lists of potential keywords and toponyms, which are subsequently filtered and ranked. This article reports on user experiments designed to evaluate the quality of the generated annotations. The experiments combined quantitative and qualitative approaches: To retrieve a large number of responses, participants rated the annotations in standardized online questionnaires that showed an image and its corresponding keywords. In addition, several focus groups provided rich qualitative information in open discussions. The results of the evaluation show that currently the annotation method performs better on rural images than on urban ones. Further, for each image at least one suitable keyword could be generated. The integration of heterogeneous data sources resulted in some images having a high level of noise in the form of obviously wrong or spurious keywords. The article discusses the evaluation itself and methods to improve the automatic generation of annotations.

An echo of Steve Newcomb’s semantic impedance appears at:

Despite many advances since Smeulders et al.’s (2002) classic paper that set out challenges in content-based image retrieval, the quality of both nonspecialist text-based and content-based image retrieval still appears to lag behind the quality of specialist text retrieval, and the semantic gap, identiﬁed by Smeulders et al. as a fundamental issue in content-based image retrieval, remains to be bridged. Smeulders deﬁned the semantic gap as

the lack of coincidence between the information that one can extract from the visual data and the interpretation that the same data have for a user in a given situation. (p. 1353)

In fact, text-based systems that attempt to index images based on text thought to be relevant to an image, for example, by using image captions, tags, or text found near an image in a document, suffer from an identical problem. Since text is being used as a proxy by an individual in annotating image content, those querying a system may or may not have similar worldviews or conceptualizations as the annotator. (emphasis added)

That last sentence could have come out of a topic map book.

Curious what you make of the author’s claim that spatial locations provide an “external context” that bridges the “semantic gap?”

If we all use the same map of spatial locations, are you surprised by the lack of a “semantic gap?”

### The Twitter of Babel: Mapping World Languages through Microblogging Platforms

Friday, December 21st, 2012

The Twitter of Babel: Mapping World Languages through Microblogging Platforms by Delia Mocanu, Andrea Baronchelli, Bruno Gonçalves, Nicola Perra, Alessandro Vespignani.

Abstract:

Large scale analysis and statistics of socio-technical systems that just a few short years ago would have required the use of consistent economic and human resources can nowadays be conveniently performed by mining the enormous amount of digital data produced by human activities. Although a characterization of several aspects of our societies is emerging from the data revolution, a number of questions concerning the reliability and the biases inherent to the big data “proxies” of social life are still open. Here, we survey worldwide linguistic indicators and trends through the analysis of a large-scale dataset of microblogging posts. We show that available data allow for the study of language geography at scales ranging from country-level aggregation to specific city neighborhoods. The high resolution and coverage of the data allows us to investigate different indicators such as the linguistic homogeneity of different countries, the touristic seasonal patterns within countries and the geographical distribution of different languages in multilingual regions. This work highlights the potential of geolocalized studies of open data sources to improve current analysis and develop indicators for major social phenomena in specific communities.

So, rather on the surface homogeneous languages, users can use their own natural, heterogeneous languages, which we can analyze as such?

Cool!

Semantic and linguistic heterogeneity has persisted from the original Tower of Babel until now.

The smart money will be riding on managing semantic and linguistic heterogeneity.

Other money can fund emptying the semantic ocean with a tea cup.

### Most developers don’t really know any computer language

Saturday, November 17th, 2012

Most developers don’t really know any computer language by Derek Jones.

From the post:

What does it mean to know a language? I can count to ten in half a dozen human languages, say please and thank you, tell people I’m English and a few other phrases that will probably help me get by; I don’t think anybody would claim that I knew any of these languages.

It is my experience that most developers’ knowledge of the programming languages they use is essentially template based; they know how to write a basic instances of the various language constructs such as loops, if-statements, assignments, etc and how to define identifiers to have a small handful of properties, and they know a bit about how to glue these together.

[the topic map part]

discussions with developers: individuals and development groups invariabily have their own terminology for programming language constructs (my use of terminology appearing in the language definition usually draws blank stares and I have to make a stab at guessing what the local terms mean and using them if I want to be listened to); asking about identifier scoping or type compatibility rules (assuming that either of the terms ‘scope’ or ‘type compatibility’ is understood) usually results in a vague description of specific instances (invariably the commonly encountered situations),

What?! Semantic diversity in computer languages? Or at least as they are understood by programmers?

😉

I don’t see the problem with appreciating semantic diversity for the richness it offers.

There are use cases where semantic diversity interferes with some other requirement. Such as in accounting systems that depend upon normalized data for auditing purposes.

While there are other use cases, such as the history of ideas that depend upon preservation of the trail of semantic diversity. As part of the narrative of such histories.

And there are cases that fall in between, where the benefits of diverse points of view must be weighted against the cost of creating and maintaining a mapping between diverse viewpoints.

All of those use cases recognize that semantic diversity is the starting point. That is semantic diversity is always with us and the real question is the cost of its control for some particular use case.

I don’t view: “My software works if all users abandon semantic diversity.” as a use case. It is a confession of defective software.

I first saw this in a tweet from Computer Science Fact.

### Taming Big Data Is Not a Technology Issue [Knuth Exercise Rating]

Friday, November 16th, 2012

Taming Big Data Is Not a Technology Issue by Bill Franks.

From the post:

One thing that has struck me recently is that most of the focus when discussing big data is upon the technologies involved. The consensus seems to be that the biggest challenge with big data is a technological one, yet I don’t believe this to be the case. Sure, there are challenges today for organizations using big data, but, I would like to submit to you that technology is not the biggest problem. In fact, technology may be one of the easiest problems to solve when it comes time to tame big data.

The fact is that there are tools and technologies out there that can handle virtually all of the big data needs of the vast majority of organizations. As of today, you can find products and solutions that do whatever you need to do with big data. Technology itself is not the problem.

Then, what are the issues? The real problems are with resource availability, skills, process change, politics, and culture. While the technologies to solve your problems may be out there just waiting for you to implement them, it isn’t quite that easy, is it? You have to get budget, you have to do an implementation, you have to get your people up to speed on how to use the tools, you have to get buy in from various stakeholders, and you have to push against a culture averse to change.

The technology is right there, but you are unable to effectively put it to work. It FEELS like a technology issue since technology is front and center. However, it is really the cultural, people, and political issues surrounding the technology that are the problem. Let me illustrate with an example.

A refreshing view at the drive to build technology to “solve” the big data problem.

Once terabytes of data are accessible as soon as entering the data stream, for real time, reactive analysis, with n-dimensional graphic representations as a matter of course, the “big data” problem will still be the “big data” problem.

The often cited “volume, velocity, variety” characterization of “big data” are surface issues that in one manner or another, can be addressed using technology. Now.

A deeper, more persistent problem is that users expect their data, big or small, to have semantics. Whether express or implied. That problem, along with the others cited by Franks, has no technological solution.

Because semantics originate with us and not with our machines.

By all means, we need to solve the technology issues around “big data,” but that only gives us a start towards working on the more difficult problems, problems that original with us.

A much harder “programming” exercise. I suspect on Knuth’s scale of exercises, an 80 or 90.

### People and Process > Prescription and Technology

Monday, October 15th, 2012

Factors that affect software systems development project outcomes: A survey of research by Laurie McLeod and Stephen G. MacDonell. ACM Computing Surveys (CSUR) Surveys Volume 43 Issue 4, October 2011 Article No. 24, DOI: 10.1145/1978802.1978803.

Abstract:

Determining the factors that have an influence on software systems development and deployment project outcomes has been the focus of extensive and ongoing research for more than 30 years. We provide here a survey of the research literature that has addressed this topic in the period 1996–2006, with a particular focus on empirical analyses. On the basis of this survey we present a new classification framework that represents an abstracted and synthesized view of the types of factors that have been asserted as influencing project outcomes.

As with most survey work, particularly ones that summarize 177 papers, this is a long article, some fifty-six pages.

Let me try to tempt you into reading it by quoting from Angelica de Antonio’s review of it (in Computing Reviews, Oct. 2012):

An interesting discussion about the very concept of project outcome precedes the survey of factors, and an even more interesting discussion follows it. The authors stress the importance of institutional context in which the development project takes place (an aspect almost neglected in early research) and the increasing evidence that people and process have a greater effect on project outcomes than technology. A final reflection on what projects still continue to fail—even if we seem to know the factors that lead to success—raises a question on the utility of prescriptive factor-based research and leads to considerations that could inspire future research. (emphasis added)

Before you run off to the library or download a copy of the survey, two thoughts to keep in mind:

First, if “people and process” are more important than technology, where should we place the emphasis in projects involving semantics?

Second, if “prescription” can’t cure project failure, what are its chances with semantic diversity?

Thoughts?