Archive for the ‘Semantic Diversity’ Category

Appropriating IT: Glue Steps [Gluing Subject Representatives Together?]

Tuesday, October 9th, 2012

Appropriating IT: Glue Steps by Tony Hirst.

Over the years, I’ve been fortunate enough to have been gifted some very evocative, and powerful, ideas that immediately appealed to me when I first heard them and that I’ve been able to draw on, reuse and repurpose over and over again. One such example is “glue logic”, introduced to me by my original OU PhD supervisor George Kiss. The idea of glue logic is to provide a means by which two digital electronic circuits (two “logic” circuits) that don’t share a common interface can be “glued” together.

(diagrams and other material omitted)

This idea is powerful enough in its own right, but there was a second bit to it that made it really remarkable: the circuitry typically used to create the glue logic was a device known as a Field Programmable Gate Array, or FPGA. This is a type of digital circuit whose logical function can be configured, or programmed. That is, I can take my “shapeless” FPGA, and programme it so that it physically implements a particular digital circuit. Just think about that for a moment… You probably have a vague idea that the same computer can be reprogrammed to do particular things, using some vaguely mysterious and magical thing called software, instructions that computer processors follow in order to do incredible things. With an FPGA, the software actually changes the hardware: there is no processor that “runs a programme”; when you programme an FPGA, you change its hardware. FPGAs are, literally, programmable chips. (If you imagine digital circuits to be like bits of plastic, an FPGA is like polymorph.)

The notion of glue logic has stuck with me for two reasons, I think: firstly, because of what it made possible, the idea of flexibly creating an interface between two otherwise incompatible components; secondly, because of the way in which it could be achieved – using a flexible, repurposable, reprogrammable device – one that you could easily reprogramme if the mapping from one device to another wasn’t quite working properly.

If instead of “don’t share a common interface” you read “semantic diversity” and in place of Field Programmable Gate Array, or FPGA, you read “legend,” to “creat[e] an interface between two otherwise incompatible [subject representatives],” you would think Tony’s post was about the topic maps reference model.

Well, this post is and Tony’s is very close.

Particularly the part about being a “reprogrammable device.”

I can tell you: “black” = “schwarz,” but without more, you won’t be able to rely on or extend that statement.

For that, you need a “reprogrammable device” and some basis on which to do the reprogramming.

Legends anyone?

A Good Example of Semantic Inconsistency [C-Suite Appropriate]

Tuesday, October 9th, 2012

A Good Example of Semantic Inconsistency by David Loshin.

You can guide users through the intellectual minefield of Frege, Peirce, Russell, Carnap, Sowa and others to illustrate the need for topic maps, with stunning (as in daunting) graphics.

Or, you can use David’s story:

I was at an event a few weeks back talking about data governance, and a number of the attendees were from technology or software companies. I used the term “semantic inconsistency” and one of the attendees asked me to provide an example of what I meant.

Since we had been discussing customers, I thought about it for a second and then asked him what his definition was of a customer. He said that a customer was someone who had paid the company money for one of their products. I then asked if anyone in the audience was on the support team, and one person raised his hand. I asked him for a definition, and he said that a customer is someone to whom they provide support.

I then posed this scenario: the company issued a 30-day evaluation license to a prospect with full support privileges. Since the prospect had not paid any money for the product, according to the first definition that individual was not a customer. However, since that individual was provided full support privileges, according to the second definition that individual was a customer.

Within each silo, the associated definition is sound, but the underlying data sets are not compatible. An attempt to extract the two customer lists and merge them together into a single list will lead to inconsistent results. This may be even worse if separate agreements dictate how long a purchaser is granted full support privileges – this may lead to many inconsistencies across those two data sets.

Illustrating “semantic inconsistency,” one story at a time.

What’s your 250 – 300 word semantic inconsistency story?

PS: David also points to webinar that will be of interest. Visit his post.

Argumentation 2012

Thursday, May 3rd, 2012

Argumentation 2012: International Conference on Alternative Methods of Argumentation in Law

07-09-2012 Full paper submission deadline

21-09-2012 Notice of acceptance deadline

12-10-2012 Paper camera-ready deadline

26-10-2012 Main event, Masaryk University in Brno, Czech Republic

From the listing of topics for papers, semantic diversity going to run riot at this conference.

Checking around the website I was disappointed the papers from Argumentation 2011 are not online.

Semantically Diverse Christenings

Sunday, April 29th, 2012

Mark Liberman in Neutral Xi_b^star, Xi(b)^{*0}, Ξb*0, whatever at Language Log reports semantically diverse christenings of the same new subatomic particle.

I count eight or nine distinct names in Liberman’s report.

How many do you see?

This is just days after its discovery at the CERN.

Largely in the scientific literature. (It will get far worse if you include non-technical literature. Is non-technical literature/discussion relevant?)

Question for science librarians:

How many names for this new subatomic particle will you use in searches?

Technology speedup graph

Sunday, April 8th, 2012

Technology speedup graph

Andrew Gelman posts an interesting graphic showing the adoption of various technologies from 1900 forward. See the post for the lineage on the graph and the details. Good graphic.

What caught my eye for topic maps was the rapid adoption of the Internet/WWW and the now well recognized failure of the Semantic Web.

You may feel like disputing my evaluation of the Semantic Web. Recall that agents were predicted to be roaming the Semantic Web by this point in Tim Berners-Lee’s first puff piece in Scientific American. After a few heady years of announcements of realization is just around the corner, the 21st century technology equivalent of the long retreat (think Napoleon).

Now the last gasp is Linked Data, the “meaning” of URIs is be determined on mount W3C and then imposed on the rest of us.

Make no mistake, I think the WWW was a truly great technological achievement.

But the technological progress graph prompted me to wonder, yet again, how is the WWW different from the Semantic Web?

Not sure this is helpful but consider the level of agreement on semantics required by the WWW versus the Semantic Web.

For the WWW, there are a handful of RFCs that specify the treatment of syntax. That is addresses and the composition of resources that you find at those addresses. Users may attach semantics to those resources, but none of those semantics are required for processing or delivery of the resources.

That is for the WWW to succeed, all we need is agreement on the addressing and processing of resources and not at all on their semantics.

A resource can have a crazy quilt of semantics attached to it by users, diverse, inconsistent, contradictory, because its addressing and processing is independent of those semantics and those who would impose them.

Resources on the WWW certainly have semantics, but processing those resources doesn’t depend on our agreement on those semantics.

So, the semantic agreement of the WWW = ~ 0. (Leaving aside the certainly true contention that protocols have semantics.)

The semantic agreement required by the Semantic Web is “web scale agreement.” That is everyone who encounters a semantic has to either honor it or break that part of the Semantic Web.

Wait until after you watch the BBC News or Al Jazeera (English), الجزيرة.نت, before you suggest universal semantics are just around the corner.

Cry Me A River, But First Let’s Agree About What A River Is

Saturday, February 4th, 2012

Cry Me A River, But First Let’s Agree About What A River Is

The post starts off well enough:

How do you define a forest? How about deforestation? It sounds like it would be fairly easy to get agreement on those terms. But beyond the basics – that a definition for the first would reflect that a forest is a place with lots of trees and the second would reflect that it’s a place where there used to be lots of trees – it’s not so simple.

And that has consequences for everything from academic and scientific research to government programs. As explained by Krzysztof Janowicz, perfectly valid definitions for these and other geographic terms exist by the hundreds, in legal texts and government documents and elsewhere, and most of them don’t agree with each other. So, how can one draw good conclusions or make important decisions when the data informing those is all over the map, so to speak.


Having enough data isn’t the problem – there’s official data from the government, volunteer data, private organization data, and so on – but if you want to do a SPARQL query of it to discover all towns in the U.S., you’re going to wind up with results that include the places in Utah with populations of less than 5,000, and Los Angeles too – since California legally defines cities and towns as the same thing.

“So this clearly blows up your data, because your analysis is you thinking that you are looking at small rural places,” he says.

This Big Data challenge is not a new problem for the geographic-information sciences community. But it is one that’s getting even more complicated, given the tremendous influx of more and more data from more and more sources: Satellite data, rich data in the form of audio and video, smart sensor network data, volunteer location data from efforts like the Citizen Science Project and services like Facebook Places and Foursquare. “The heterogeneity of data is still increasing. Semantic web tools would help you if you had the ontologies but we don’t have them,” he says. People have been trying to build top-level global ontologies for a couple of decades, but that approach hasn’t yet paid off, he thinks. There needs to be more of a bottom-up take: “The biggest challenge from my perspective is coming up with the rules systems and ontologies from the data.”

All true, many of which objectors to the current Semantic Web approach have been saying for a very long time.

I am not sure about the line: “The heterogeneity of data is still increasing.”

In part because I don’t know of any reliable measure of heterogeneity by which a comparison could be made. True there is more data now than at some X point in the past, but that isn’t necessarily an indication of increased heterogeneity. But that is a minor point.

More serious is the a miracle occurs statement that follows:

How to do it, he thinks, is to make very small and really local ontologies directly mined with the help of data mining or machine learning techniques, and then interlink them and use new kinds of reasoning to see how to reason in the presence of inconsistencies. “That approach is local ontologies that arrive from real application needs,” he says. “So we need ontologies and semantic web reasoning to have neater data that is human and also machine readable. And more effective querying based on analogy or similarity reasoning to find data sets that are relevant to our work and exclude data that may use the same terms but has different ontological assumptions underlying it.”

Doesn’t that have the same feel as the original Semantic Web proposals that were going to eliminate semantic ambiguity from the top down? The very approach that is panned in this article?

And “new kinds of reasoning,” ones I assume have not been invented yet, are going “to reason in the presence of inconsistencies.” And excluding data that “…has different ontological assumptions underlying it.”

Since we are the source of ontological assumptions that underlie the use of terms, I am real curious about how those assumptions are going to become available to these to be invented reasoning techniques?

Oh, that’s right, we are all going to specify our ontological assumptions at the bottom to percolate up. Except that to be useful for machine reasoning, they will have to be as crude as the ones that were going to be imposed from the top down.

I wonder why the indeterminate nature of semantics continues to elude Semantic Web researchers. A snapshot of semantics today may be slightly incorrect tomorrow, probably incorrect in some respect in a month and almost surely incorrect in a year or more.

Take Saddam Hussein for example. One time friend and confidant of Donald Rumsfeld (there are pictures). But over time those semantics changed, largely because Hussein slipped the lease and was no longer a proper vassal to the US. Suddenly, the weapons of mass destruction, in part nerve gas we caused to be sold to him, became a concern. And so Hussein became an enemy of the US. Same person, same facts. Different semantics.

There are less dramatic examples but you get the idea.

We can capture even changing semantics but we need to decide what semantics we want to capture and at what cost? Perhaps that is a better way to frame my objection to most Semantic Web activities, they are not properly scoped. Yes?


Friday, January 27th, 2012


From the webpage:

Since Aryabhatta invented zero, Mathematicians such as John von Neuman have been in pursuit of efficient counting and architects have constantly built systems that computes counts quicker. In this age of social media, where 100s of 1000s events take place every second, we were inspired by twitter’s Rainbird project to develop distributed counting engine that can scale linearly.

Countandra is a hierarchical distributed counting engine on top of Cassandra (to increment/decrement hierarchical data) and Netty (HTTP Based Interface). It provides a complete http based interface to both posting events and getting queries. The syntax of a event posting is done in a FORMS compatible way. The result of the query is emitted in JSON to make it maniputable by browsers directly.


  • Geographically distributed counting.
  • Easy Http Based interface to insert counts.
  • Hierarchical counting such as
  • Retrieves counts, sums and square in near real time.
  • Simple Http queries provides desired output in JSON format
  • Queries can be sliced by period such as LASTHOUR,LASTYEAR and so on for MINUTELY,HOURLY,DAILY,MONTHLY values
  • Queries can be classified for anything in hierarchy such as com, com.mywebsite or
  • Open Source and Ready to Use!

Countandra illustrates that not every application need be a general purpose one. Countandra is designed to be a counting engine and to answer defined query types, nothing more.

There is a lesson there for semantic diversity solutions. It is better to attempt to solve part of the semantic diversity issue than to attempt a solution for everyone. At least partial solutions have a chance of being a benefit before being surpassed by changing technologies and semantics.

BTW, Countandra using a Java long for time values so in the words of the Unix Time Wikipedia entry:

In the negative direction, this goes back more than twenty times the age of the universe, and so suffices. In the positive direction, whether the approximately 293 billion representable years is truly sufficient depends on the ultimate fate of the universe, but it is certainly adequate for most practical purposes.

Rather than “suffices” and “most practical purposes” I would have said, “is adequate for present purposes” in both cases.

Oil Drop Semantics?

Sunday, January 15th, 2012

Interconnection of Communities of Practice: A Web Platform for Knowledge Management and some related material made me think of the French “oil drop” counter-insurgency strategy.

With one important difference.

In a counter-insurgency context, the oil drop strategy is being used to further the goals of counter-insurgency force. Whatever you think of those goals or the alleged benefits for the places covered by the oil drops, the fundamental benefit is to the counter-insurgency force.

In a semantic context, one that seeks to elicit the local semantics of a group, the goal is not the furtherance of an outside semantic, but the exposition of a local semantic with the goal of benefiting the group covered by the oil spot. That as the oil drop spreads, those semantics may be combined with other oil drop semantics, but that is a cost and effort borne by the larger community seeking that benefit.

There are several immediate advantages to this approach with semantics.

First, the discussion of semantics at every level is taking place with the users of those semantics. You can hardly get closer to a useful answer than being able to ask the users of a semantic what was meant or for examples of usage. I don’t have a formalism for it but I would postulate that as the distance from users increases, so does the usefulness of the semantics of those users.

Ask the FBI about the Virtual Case Management project. Didn’t ask users or at least enough of them and flushed lots of cash. Lesson: Asking management, IT, etc., about the semantics of users is a utter waste of time. Really.

If you want to know the semantics of user group X, then ask group X. If you ask Y about X, you will get Y’s semantics about X. If that is what you want, fine, but if you want the semantics of group X, you have wasted your time and resources.

Second, asking the appropriate group of users for their semantics means that you can make explicit the ROI from making their semantics explicit. That is to say if asked, the group will ask about semantics that are meaningful to them. That either solve some task or issue that they encounter. May or may not be the semantics that interest you but recall the issue is the group’s semantics, not yours.

The reason for the ROI question at the appropriate group level is so that the project is justified both to the group being asked to make the effort as well as those who must approve the resources for such a project. Answering that question up front helps get buy-in from group members and makes them realize this isn’t busy work but will have a positive benefit for them.

Third, such a bottom-up approach, whether you are using topic maps, RDF, etc. will mean that only the semantics that are important to users and justified by some positive benefit are being captured. Your semantics may not have the rigor of SUMO, for example, but they are a benefit to you. What other test would you apply?

Another way to think about geeks and repetitive tasks

Tuesday, January 10th, 2012

Another way to think about geeks and repetitive tasks

John Udell writes:

The other day Tim Bray tweeted a Google+ item entitled Geeks and repetitive tasks along with the comment: “Geeks win, eventually.”

…(material omitted)

In geek ideology the oppressors are pointy-haired bosses and clueless users. Geeks believe (correctly) that clueless users can’t imagine, never mind implement, automated improvements to repetitive manual chores. The chart divides the world into geeks and non-geeks, and it portrays software-assisted process improvement as a contest that geeks eventually win. This Manichean worldview is unhelpful.

I have no doubt that John’s conclusion:

Software-assisted automation of repetitive work isn’t an event, it’s a process. And if you see it as a contest with winners and losers you are, in my view, doing it wrong.

is the right one but I think it misses an important insight.

That “geeks” and their “oppressors” view the world with very different semantics. If neither one tries to communicate those semantics to the other, then software will continue to fail to meet the needs of its users. An unhappy picture for all concerned, geeks as well as their oppressors.

Being semantics, there is no “right” or “wrong” semantic.

True enough, the semantics of geeks works better with computers but if that fails to map in some meaningful way to the semantics of their oppressors, what’s the point?

Geeks can write highly efficient software for tasks but if the tasks aren’t something anyone is willing to pay for or even use, what’s the point?

Users and geeks need to both remember that communication is a two-way street. The only way for it to fail completely is for either side to stop trying to communicate with the other.

Have no doubt, I have experience the annoyance of trying to convince a geek that just because they have written software a particular way that has little to no bearing on some user request. (The case in point was a UI where the geek had decided on a “better” means of data entry. The users, who were going to be working with the data thought otherwise. I heard the refrain, “…if they would just use it they would get used to it.” Of course, the geek had written the interface without asking the users first.)

To be fair, users have to be willing to understand there are limitations on what can be requested.

And that users failing to complete written and detailed requirements for all aspects of a request, is almost a guarantee that the software result isn’t going to satisfy anyone.

Written requirements are where semantic understandings, mis-understandings and clashes can be made visible, resolved (hopefully) and documented. Burdensome, annoying, non-productive in the view of geeks who want to get to coding, but absolutely necessary in any sane software development environment.

That is to say any software environment that is going to represent a happy (well, workable) marriage of the semantics of geeks and users.

Google; almost 50 functions & resources killed in 2011

Saturday, December 17th, 2011

Google; almost 50 functions & resources killed in 2011 by Phil Bradley.

Just in case you want to think of other potential projects over the holidays! 😉

For my topic maps class:

  1. Pick one function or resource
  2. Outline how semantic integration could support or enhance such a function or resource. (3-5 pages, no cites)
  3. Bonus points: What resources would you want to integrate for such a function or resource? (1-2 pages)

John Giannandrea on Freebase – A Rosetta Stone for Entities

Tuesday, November 15th, 2011

John Giannandrea on Freebase – A Rosetta Stone for Entities by Daniel Tunkelang.

From the post:

John started by introducing Freebase as a representation of structured objects corresponding to real-world entities and connected by a directed graph of relationships. In other words, a semantic web. While it isn’t quite web-scale, Freebase is a large and growing knowledge base consisting of 25 million entities and 500 million connections — and doubling annually. The core concept in Freebase is a type, and an entity can have many types. For example, Arnold Schwarzenegger is a politician and an actor. John emphasized the messiness of the real world. For example, most actors are people, but what about the dog who played Lassie? It’s important to support exceptions.

The main technical challenge for Freebase is reconciliation — that is, determining how similar a set of data is to existing Freebase topics. John pointed out how critical it is for Freebase to avoid duplication of content, since the utility of Freebase depends on unique nodes in its graph corresponding to unique objects in the world. Freebase obtains many of its entities by reconciling large, open-source knowledge bases — including Wikipedia, WordNet, Library of Congress Authorities, and metadata from the Stanford Library. Freebase uses a variety of tools to implement reconciliation, including Google Refine (formerly known as Freebase Gridworks) and Matchmaker, a tool for gathering human judgments. While reconciliation is a hard technical problem, it is made possible by making inferences across the web of relationships that link entities to one another.

John then presented Freebase as a Rosetta Stone for entities on the web. Since an entity is simply a collection of keys (one of which is its name), Freebase’s job is to reverse engineer the key-value store that is distributed among the entity’s web references, e.g., the structured databases backing web sites and encoding keys in URL parameters. He noted that Freebase itself is schema-less (it is a graph database), and that even the concept of a type is itself an entity (“Type type is the only type that is an instance of itself”). Google makes Freebase available through an API and the Metaweb Query Language (MQL).

(emphasis added)

<tedious-self-justification>…., entity is a collection of keys indeed! Key/value pairs I would say, with no presumptions about the structure of either one.</tedious-self-justification>

There is not now nor will there ever be agreement on the “unique objects in the world.” And why should that be a value? If we have the key/value pairs, we can each arrive at our own conclusions about whether certain “unique nodes” correspond to what we think of as “unique objects in the world.”

I suspect, but don’t know having never asked former President Bush II, that we disagree on the existence of any unique objects in the world and it is unlikely there is any evidence that would persuade either one of us to change.

Remember the Rosetta Stone had three (3) version of the same inscription. It did not try to say one version was closer to the original than the others.

The Rosetta Stone is one of the earliest honorings of semantic diversity. Unlike systems that try to push only one common semantic or vision.

The Second International Workshop on Diversity in Document Retrieval (DDR-2012)

Tuesday, October 18th, 2011

The Second International Workshop on Diversity in Document Retrieval (DDR-2012)


When Feb 12, 2012 – Feb 12, 2012
Where Seattle WA, USA
Submission Deadline Dec 5, 2011
Notification Due Jan 10, 2012
Final Version Due Jan 17, 2012

From the webpage:

In conjunction with WSDM 2012 – the 5th ACM International Conference on Web Search and Data Mining

When an ambiguous query is received, a sensible approach is for the information retrieval (IR) system to diversify the results retrieved for this query, in the hope that at least one of the interpretations of the query intent will satisfy the user. Diversity is an increasingly important topic, of interest to both academic researchers (such as participants in the TREC Web and Blog track diversity tasks), as well as to search engines professionals. In this workshop, we solicit submissions both on approaches and models for diversity, the evaluation of diverse search results, and on applications and presentation of diverse search results.


  • Modelling Diversity:
    • Implicit diversification approaches
    • Explicit diversification approaches
    • Query log mining for diversity
    • Learning-to-rank for diversification
    • Clustering of results for diversification
    • Query intent understanding
    • Query type classification
  • Modelling Risk:
    • Probability ranking principle
    • Risk Minimization frameworks and role diversity
  • Evaluation:
    • Test collections for diversity
    • Evaluating of diverse search results
    • Measuring the ambiguity of queries
    • Measuring query aspects importance
  • Applications:
    • Product & review diversification
    • Opinion and sentiment diversification
    • Diversifying Web crawling policy
    • Graph analysis for diversity
    • Summarisation
    • Legal precedents & patents
    • Diverse recommender systems
    • Diversifying in real-time & news search
    • Diversification in other verticals (image/video search etc.)
    • Presentation of diverse search results

While typing this up, I remembered the “little search engine that could” post (Going Head to Head with Google (and winning)). Are we really condemned to have to manage unforeseeable complexity or is that a poor design choice we made for search engines?

After all, I am not really interested in the entire WWW. At least for this blog I am interested in probably less than 1/10 of 1% of the web (or less). So if I had a search engine for all the CS/Library/Informatics publications, blogs, subject domains relevant to data/information, I would pretty much be set. A big semantic field and one that is changing, but not anything like search everything that is connected (or not, for the DeepWeb) to the WWW.

I don’t have an answer for that but I think it is an issue that may enable management of semantic diversity. That is we get to declare the edge of the map. Yes, there are other things beyond the edge but we aren’t going to include them in this particular map.


Friday, October 7th, 2011


From the about documenation:

The Doppelganger service translates between IDs of entities in third party APIs. When you query Doppelganger with an entity ID, you’ll get back IDs of that same entity in other APIs. In addition, a persistent Uberblic ID serves as an anchor for your application that you can use for subsequent queries.

So why link APIs? is answered in a blog entry:

There is an ever-increasing amount of data available on the Web via APIs, waiting to be integrated by product developers. But actually integrating more than just one API into a product poses a problem to developers and their product managers: how do we make the data sources interoperable, both with one another and with our existing databases? Uberblic launches a service today to make that easy.

A location based product, for example, would aim to pull in information like checkins from Foursquare, reviews from Lonely Planet, concerts from LastFM and social connections from Facebook, and display all that along one place’s description. To do that, one would need to identify this particular place in all the APIs – identify the place’s ‘doppelgangers’, if you will. Uberblic does exactly that, mapping doppelgangers across APIs, as a web service. It’s like a dictionary for IDs, the Rosetta Stone of APIs. And geolocation is just the beginning.

Uberblic’s doppelganger engine links data across a variety of data APIs. By matching equivalent records, the engine connects an entity graph that spans APIs and data services. This entity graph provides rich contextual data for product developers, and Uberblic’s APIs serve as a switchboard and broker between data sources.

See the full post at:

Useful. But as you have already noticed, no associations, no types, no way to map to other identifiers.

Not that a topic map could not use Uberlic data if available, just not is all that is possible.

Artificial Intelligence Resources

Sunday, September 25th, 2011

Artificial Intelligence Resources

A collection of collections of resources on artificial intelligence. Useful but also illustrates a style of information delivery that has advantages over “search style foraging” and disadvantages as well.

It’s biggest advantage over “search style foraging” is that it presents a manageable listing of resources and not several thousand links. Even very dedicated researchers are unlikely to follow links > hundreds and even if you did, some of the material would be outdated by the time you reached it.

Another advantage is that one hopes (I haven’t tried all the links) that the resources have been vetted to some degree, with the superficial and purely advertising sites being filtered out. Results are more “hit” than “miss,” which with search results can be a very mixed bag.

But a manageable list is just that, manageable, the very link you need may have missed the cut-off point. Had to stop somewhere.

And you can’t know the author’s criteria for the listing. Their definition of “algorithm” may broader or narrower than your own.

In the days of professional indexes, researchers learned a sense for the categories used by indexing services. At least that was a smaller set than the vocabulary range of every author.

How would you use topic maps to bridge the gap between those two solutions?

Automatic transcription of 17th century English text in Contemporary English with NooJ: Method and Evaluation

Sunday, September 25th, 2011

Automatic transcription of 17th century English text in Contemporary English with NooJ: Method and Evaluation by Odile Piton (SAMM), Slim Mesfar (RIADI), and Hélène Pignot (SAMM).


Since 2006 we have undertaken to describe the differences between 17th century English and contemporary English thanks to NLP software. Studying a corpus spanning the whole century (tales of English travellers in the Ottoman Empire in the 17th century, Mary Astell’s essay A Serious Proposal to the Ladies and other literary texts) has enabled us to highlight various lexical, morphological or grammatical singularities. Thanks to the NooJ linguistic platform, we created dictionaries indexing the lexical variants and their transcription in CE. The latter is often the result of the validation of forms recognized dynamically by morphological graphs. We also built syntactical graphs aimed at transcribing certain archaic forms in contemporary English. Our previous research implied a succession of elementary steps alternating textual analysis and result validation. We managed to provide examples of transcriptions, but we have not created a global tool for automatic transcription. Therefore we need to focus on the results we have obtained so far, study the conditions for creating such a tool, and analyze possible difficulties. In this paper, we will be discussing the technical and linguistic aspects we have not yet covered in our previous work. We are using the results of previous research and proposing a transcription method for words or sequences identified as archaic.

Everyone working on search engines needs to print a copy of this article and read it at least once a month.

Seriously, the senses of both words and grammar evolve over centuries and even more quickly. What seem like correct search results from as recently as the 1950’s may be quite incorrect.

For example (I don’t have the episode reference, perhaps someone can suppy it) there was an “I Love Lucy” episode where Lucy says on the phone to RIcky that some visitor (at home) is “making love to her,” which meant nothing more than sweet talk. Not sexual intercourse.

I leave it for your imagination how large the semantic gap may be between English texts and originals composed in another language, culture, historical context and between 2,000 to 6,000 years ago. Flattening the complexities of ancient texts to bumper sticker snippets does a disservice them and ourselves.

GDB for the Data Driven Age (STI Summit Position Paper)

Saturday, July 30th, 2011

GDB for the Data Driven Age (STI Summit Position Paper) by Orri Erling.

From the post:

The Semantic Technology Institute (STI) is organizing a meeting around the questions of making semantic technology deliver on its promise. We were asked to present a position paper (reproduced below). This is another recap of our position on making graph databasing come of age. While the database technology matters are getting tackled, we are drawing closer to the question of deciding actually what kind of inference will be needed close to the data. My personal wish is to use this summit for clarifying exactly what is needed from the database in order to extract value from the data explosion. We have a good idea of what to do with queries but what is the exact requirement for transformation and alignment of schema and identifiers? What is the actual use case of inference, OWL or other, in this? It is time to get very concrete in terms of applications. We expect a mixed requirement but it is time to look closely at the details.

Interesting post that includes the following observation:

Real-world problems are however harder than just bundling properties, classes, or instances into sets of interchangeable equivalents, which is all we have mentioned thus far. There are differences of modeling (“address as many columns in customer table” vs. “address normalized away under a contact entity”), normalization (“first name” and “last name” as one or more properties; national conventions on person names; tags as comma-separated in a string or as a one-to-many), incomplete data (one customer table has family income bracket, the other does not), diversity in units of measurement (Imperial vs. metric), variability in the definition of units (seven different things all called blood pressure), variability in unit conversions (currency exchange rates), to name a few. What a world!

Yes, quite.

Worth a very close read.

The Lisp Curse

Tuesday, April 19th, 2011

The Lisp Curse by Rudolf Winestock begins:

This essay is yet another attempt to reconcile the power of the Lisp programming language with the inability of the Lisp community to reproduce their pre-AI Winter achievements. Without doubt, Lisp has been an influential source of ideas even during its time of retreat. That fact, plus the brilliance of the different Lisp Machine architectures, and the current Lisp renaissance after more than a decade in the wilderness demonstrate that Lisp partisans must have some justification for their smugness. Nevertheless, they have not been able to translate the power of Lisp into a movement with overpowering momentum.

In this essay, I argue that Lisp’s expressive power is actually a cause of its lack of momentum.

Read the essay, then come back here. I’ll wait.

… … … …

OK, good read, yes?

At first blush, I thought about HyTime and its expressiveness. Or of topic maps. Could there be a parallel?

But non-Lisp software projects proliferate.

Let’s use for examples.

Total projects for the database category – 906.

How many were written using Lisp?

Lisp 1

Compared to:

Java 282
C++ 106
PHP 298
Total: 686

That may not be fair.

Databases may not attract AI/Lisp programmers.

What about artificial intelligence?

Lisp 8
Schema 3
Total: 11

Compared to:

Java 115
C++ 111
C 42
Total: 268

Does that mean that Java, C++ and C are too expressive?

Or that their expressiveness has retarded their progress in some way?

Or is some other factor is responsible for proliferation of projects?

And a proliferation of semantics.

Correction: I corrected -> and made it a hyperlink. Fortunately sourceforge silently redirects my mistake in entering the domain name in a browser.


Saturday, February 12th, 2011

Managing and Reasoning in the Presence of Inconsistency

The International Journal of Semantic Computing describes this Call for Papers as follows:

Inconsistency is ubiquitous in the real world, in human behaviors, and in the computing systems we build. Inconsistency manifests itself in a plethora of phenomena at different level in the depth of knowledge, ranging from data, information, knowledge, meta-knowledge, to expertise. Data inconsistency arises when patterns in data do not conform to an established range, distribution or interpretation. The exponentially growing volumes of data stemming from almost all types of data being created in digital form, a proliferation of sensors and sensor networks, and other sources such as social networks, complex computer simulations, space explorations, and high-resolution imagery and video, have made data inconsistency an inevitability. Information inconsistency occurs when meanings of the same data values become conflicting or when the same attribute for an entity has different data values. Knowledge inconsistency happens when propositions of either declarative or procedural beliefs, in either explicit or tacit form, yield antagonistic outcomes for the same circumstance. Inconsistency can also emerge from meta-knowledge or from expertise. How to manage and reason in the presence of inconsistency in computing systems is a very important issue in semantic computing, social computing, and other data-rich or knowledge-rich computing paradigms. It requires that we understand the causes and circumstances of inconsistency, establish proper metrics for inconsistency, adopt formalisms to represent inconsistency, develop ways to recognize and analyze different types of inconsistency, and devise mechanisms and methodologies to manage and handle inconsistency.

Refreshing in that inconsistency is recognized as an omnipresent and everlasting fact of our environments. Including computing environments.

The phrase, “…establish proper metrics for inconsistency,…” betrays a world view that we can stand outside of our inconsistencies and those of others.

For all the useful work that will appear in this volume (and others like it), there is no place to stand outside of our environments and their inconsistencies.

Important Dates
Submission deadline: May 20, 2011
Review result notification: July 20, 2011
Revision due: August 20, 2011
Final version due: August 31, 2011
Tentative date of publication: September, 2011 (Vol.5, No.3)

ISO Concepts Database

Saturday, February 5th, 2011

ISO Concepts Database

The ISO Concepts Database has appeared online and will give us a window into semantic diversity at ISO.

As soon as the site stops crashing, I will be posting a report about the term subject. There are thirteen different definitions for that term.

Communicating Across the Academic Divide – Post

Friday, January 14th, 2011

Communicating Across the Academic Divide

Myra H. Strober writes:

However, while doing research for my new book, Interdisciplinary Conversations: Challenging Habits of Thought, I found an even more fundamental barrier to interdisciplinary work: Talking across disciplines is as difficult as talking to someone from another culture. Differences in language are the least of the problems; translations may be tedious and not entirely accurate, but they are relatively easy to accomplish. What is much more difficult is coming to understand and accept the way colleagues from different disciplines think—their assumptions and their methods of discerning, evaluating, and reporting “truth”—their disciplinary cultures and habits of mind.

I rather like the line: Talking across disciplines is as difficult as talking to someone from another culture.

That is the problem in a nutshell isn’t it?

What most solution proposers fail to recognize is that solutions to the problem are cultural artifacts themselves.

There is no place to stand outside of culture.

So we are always trying to talk to people from other cultures. Constantly.

Even as we try to solve the problem of talking to people from other cultures.

Realizing that does not make talking across cultures any easier.

It may help us realize that the equivalent of talking louder, isn’t likely to assist in the talking across cultural divides.

One of the reasons why I like topic maps is that it is possible, although not easy, to capture subject identifications from different cultures.

How well a topic map does that depends on the skill of its author and those contributing information to the map.

International Workshop on Semantic Technologies for Information-Integrated Collaboration (STIIC 2011)

Wednesday, January 12th, 2011

International Workshop on Semantic Technologies for Information-Integrated Collaboration (STIIC 2011) as part of the 2011 International Conference on Collaboration Technologies and Systems (CTS 2011), May 23 – 27, 2011, The Sheraton University City Hotel, Philadelphia, Pennsylvania, USA.

From the announcement:

Information-integrated collaboration networks have become an important part of today’s complex enterprise systems – this becomes obvious if we consider, as a prominent example, the high dynamics of network-centric systems, which need to react to changes at the level of their information and communication space by providing flexible mechanisms to manage a wide variety of information resources, heterogeneous, decentralized, and constantly evolving. Semantic technologies promise to deliver innovative and effective solutions to this problem, facilitating the realization of information integration mechanisms that allow collaboration systems to provide the added value they are expected to.

Two fundamental problems are inherent to the design of integrated collaboration solutions: (i) semantic inaccessibility, caused by the failure to explicitly specify the semantic content of the information contained within the subsystems that must share information in order to collaborate effectively; and (ii) logical disconnectedness: caused by the failure to explicitly represent constraints between the information managed by the different collaborating subsystems.

Mainstream EAI technologies deal with information and information management tasks at the syntactic level. Data protocols and standards that are used to facilitate seamless information exchange and ‘plug and play’ interoperability do not take into account the meaning of the underlying information and the view of the individual stakeholders on the information exchanged. What is lacking are mechanisms that have the ability to capture, store, and manage the meaning of the data and artifacts that need to be shared for collaborative problem solving, decision support, planning, and execution.

Important Dates:

Paper submissions: January 24, 2011

Acceptance notification: February 11, 2011

Camera ready papers and registration due: March 1, 2011

Conference dates: May 23 – 27, 2011

I rather like the line:

What is lacking are mechanisms that have the ability to capture, store, and manage the meaning of the data and artifacts that need to be shared for collaborative problem solving, decision support, planning, and execution.

Sorta says it all, doesn’t it?

Building Concept Structures/Concept Trails

Thursday, December 2nd, 2010

Automatically Building Concept Structures and Displaying Concept Trails for the Use in Brainstorming Sessions and Content Management Systems Authors: Christian Biemann, Karsten Böhm, Gerhard Heyer and Ronny Melz


The automated creation and the visualization of concept structures become more important as the number of relevant information continues to grow dramatically. Especially information and knowledge intensive tasks are relying heavily on accessing the relevant information or knowledge at the right time. Moreover the capturing of relevant facts and good ideas should be focused on as early as possible in the knowledge creation process.

In this paper we introduce a technology to support knowledge structuring processes already at the time of their creation by building up concept structures in real time. Our focus was set on the design of a minimal invasive system, which ideally requires no human interaction and thus gives the maximum freedom to the participants of a knowledge creation or exchange processes. The initial prototype concentrates on the capturing of spoken language to support meetings of human experts, but can be easily adapted for the use in Internet communities that have to rely on knowledge exchange using electronic communication channel.

I don’t share the author’s confidence that corpus linguistics are going to provide the level of accuracy expected.

But, I find the notion of a dynamic semantic map that grows, changes and evolves during a discussion to be intriguing.

This article was published in 2006 so I will follow up to see what later results have been reported.

The effect of audience design on labeling, organizing, and finding shared files (unexpected result – see below)

Tuesday, October 19th, 2010

The effect of audience design on labeling, organizing, and finding shared files Authors: Emilee Rader Keywords: audience design, common ground, file labeling and organizing, group information management


In an online experiment, I apply theory from psychology and communications to find out whether group information management tasks are governed by the same communication processes as conversation. This paper describes results that replicate previous research, and expand our knowledge about audience design and packaging for future reuse when communication is mediated by a co-constructed artifact like a file-and-folder hierarchy. Results indicate that it is easier for information consumers to search for files in hierarchies created by information producers who imagine their intended audience to be someone similar to them, independent of whether the producer and consumer actually share common ground. This research helps us better understand packaging choices made by information producers, and the direct implications of those choices for other users of group information systems.

Examples from the paper:

  • A scientist needs to locate procedures and results from an experiment conducted by another researcher in his lab.
  • A student learning the open-source, command-line statistical computing environment R needs to find out how to calculate the mode of her dataset.
  • A new member of a design team needs to review requirements analysis activities that took place before he joined the team.
  • An intelligence analyst needs to consult information collected by other agencies to assess a potential threat.

Do any of those sound familiar?

Unexpected result:

In general, Consumers performed best (fewest clicks to find the target file) when the Producer created a hierarchy for an Imagined Audience from the same community, regardless of the community the Consumer community. Consumers had the most difficulty when searching in hierarchies created by a Producer for a dissimilar Imagined Audience.

In other words, imagining an audience is a bad strategy. Create a hierarchy that works for you. (And with a topic map you could let others create hierarchies that work for them.)

(Apologies for the length of this post but unexpected interface results merit the space.)

IEEE Computer Society Technical Committee on Semantic Computing (TCSEM)

Sunday, October 17th, 2010

The IEEE Computer Society Technical Committee on Semantic Computing (TCSEM)

addresses the derivation and matching of the semantics of computational content to that of naturally expressed user intentions in order to retrieve, manage, manipulate or even create content, where “content” may be anything including video, audio, text, software, hardware, network, process, etc.

Being organized by Phillip C-Y Sheu (UC Irvine),, Phone: +1 949 824 2660. Volunteers are needed for both organizational and technical committees.

This is a good way to meet people, make a positive contribution and, have a lot of fun.

Satrap: Data and Network Heterogeneity Aware P2P Data-Mining

Monday, October 11th, 2010

Satrap: Data and Network Heterogeneity Aware P2P Data-Mining Authors: Hock Hee Ang, Vivekanand Gopalkrishnan, Anwitaman Datta, Wee Keong Ng, Steven C. H. Hoi Keywords: Distributed classification, P2P network, cascade SVM


Distributed classification aims to build an accurate classifier by learning from distributed data while reducing computation and communication cost. A P2P network where numerous users come together to share resources like data content, bandwidth, storage space and CPU resources is an excellent platform for distributed classification. However, two important aspects of the learning environment have often been overlooked by other works, viz., 1) location of the peers which results in variable communication cost and 2) heterogeneity of the peers’ data which can help reduce redundant communication. In this paper, we examine the properties of network and data heterogeneity and propose a simple yet efficient P2P classification approach that minimizes expensive inter-region communication while achieving good generalization performance. Experimental results demonstrate the feasibility and effectiveness of the proposed solution.

Among the other claims for Satrap:

  • achieves the best accuracy-to-communication cost ratio given that data exchange is performed to improve global accuracy.
  • allows users to control the trade-off between accuracy and communication cost with the user-specified parameters.

I find these two the most interesting.

In part because semantic integration, whether explicit or not, is always a question of cost ratio and tradeoffs.

It would be refreshing to see papers that say what semantic integration would be too costly with method X or that aren’t possible with method Y.

The Matching Web (semantics supplied by users)?

Thursday, September 2nd, 2010

Why do we call it ‘The Semantic Web? The web is nothing but a collection of electronic files. Where is the “semantic” in those files? (Even with linking, same question.)

Where was the “semantic” in Egyptian hieroglyphic texts? They had one semantic in the view of Horapollo, Atahanasius Kircher, and others. They have a different semantic in the view of later researchers, Jean-François_Champollion and see the Wikipedia article Egyptian Hieroglyphics.

Same text, different semantics. With the later ones viewed as being “correct.” Yet it would be essential to record the hieroglyphic semantics of Kircher to understand the discussions of his contemporaries and those who relied on his work. One text, multiple semantics.

All our search, reasoning, etc., engines can do is to mechanically apply patterns and return content to us. The returned content has no known “semantic” until we assign it one.  Different users may supply different semantics to the same content.

Perhaps a better name would be “The Matching Web (semantics supplied by users)”.*


*Then we could focus on managing the semantics supplied by users. A different task than the one underway in the “Semantic Web” at present.

Don’t Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources (1997)

Wednesday, July 28th, 2010

Don’t Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources (1997) by Mary Tork Roth isn’t the latest word on wrappers but is well written. (longer version, A Wrapper Architecture for Legacy Data Sources (1997) )

The wrapper idea is a good one, although Roth uses it in the context of a unified schema, which is then queried. With a topic map, you could query on the basis of any of the underlying schemas and get the data from all the underlying data sources.

That result is possible because a topic map has one representative for a subject and can have any number of sources for information about that single subject.

I haven’t done a user survey but suspect most users would prefer to search for/access data using familiar schemas rather than new “unified” schemas.

When Federated Search Bites (Jeff Jonas)

Monday, July 26th, 2010

When Federated Search Bites by Jeff Jonas is a bit of a rant but makes a number of telling points.

I think topic maps qualify as federated fetch to use Jonas’ terminology.

Not surprising since I think of topic maps as navigational overlays (where navigation includes subject sameness) and not as a data storage format.

But there is a lot of interest topic map software that stores data locally.

Both approaches work and have different advantages. Has anyone outlined how you would choose between those two approaches?


Monday, July 26th, 2010

OneSource describes itself as:

OneSource is an evolving data analysis and exploration tool used internally by the USAF Air Force Command and Control Integration Center (AFC2IC) Vocabulary Services Team, and provided at no additional cost to the greater Department of Defense (DoD) community. It empowers its users with a consistent view of syntactical, lexical, and semantic data vocabularies through a community-driven web environment, directly supporting the DoD Net-Centric Data Strategy of visible, understandable, and accessible data assets.

Video guides to the site:

OneSource includes 158 vocabularies of interest to the greater U.S. Department of Defense (DoD) community. (My first post to answer Lars Heuer’s question “…where is the money?”)

Following posts will explore OneSource and what we can learn from each other.

Lost In Translation – Article

Sunday, July 25th, 2010

Lost In Translation is a summary of recent research on language and its impact on our thinking by Lera Boroditsky (Professor of psychology at Stanford University and editor in chief of Frontiers in Cultural Psychology).

Read the article for the details but concepts such as causality, space and others aren’t as fixed as you may have thought.

Another teaser:

It turns out that if you change how people talk, that changes how they think. If people learn another language, they inadvertently also learn a new way of looking at the world. When bilingual people switch from one language to another, they start thinking differently, too.

Topic maps show different ways to identify the same subject. Put enough alternative identifications together and you will learn to think in another language.

Question: Should topic maps come with the following warning?

Caution: Topic Map – You May Start Thinking Differently