Archive for the ‘Semantic Diversity’ Category

FuturICT:… [No Semantics?]

Monday, May 6th, 2013

FuturICT: Participatory Computing for Our Complex World

From the FuturICT FET Flagship Project Summary:

FuturICT is a visionary project that will deliver new science and technology to explore, understand and manage our connected world. This will inspire new information and communication technologies (ICT) that are socially adaptive and socially interactive, supporting collective awareness.

Revealing the hidden laws and processes underlying our complex, global, socially interactive systems constitutes one of the most pressing scientific challenges of the 21st Century. Integrating complexity science with ICT and the social sciences, will allow us to design novel robust, trustworthy and adaptive technologies based on socially inspired paradigms. Data from a variety of sources will help us to develop models of techno-socioeconomic systems. In turn, insights from these models will inspire a new generation of socially adaptive, self-organised ICT systems. This will create a paradigm shift and facilitate a symbiotic co-evolution of ICT and society. In response to the European Commission’s call for a ‘Big Science’ project, FuturICT will build a largescale, pan European, integrated programme of research which will extend for 10 years and beyond.

Did you know that the term “semantic” appears only twice in the FuturICT Project Outline? And both times as in the “semantic web?”

Not a word of how models, data sources, paradigms, etc., with different semantics are going to be wedded into a coherent whole.

View it as an opportunity to deliver FuturlCT results using topic maps beyond this project.

Have you used Lua for MapReduce?

Wednesday, May 1st, 2013

Have you used Lua for MapReduce?

From the post:

Lua as a cross platform programming language has been popularly used in games and embedded systems. However, due to its excellent use for configuration, it has found wider acceptance in other user cases as well.

Lua was inspired from SOL (Simple Object Language) and DEL(Data-Entry Language) and created by Roberto Ierusalimschy, Waldemar Celes, and Luiz Henrique de Figueiredo at the Pontifical Catholic University of Rio de Janeiro, Brazil. Roughly translated to ‘Moon’ in Portuguese, it has found many big takers like Adobe, Nginx, Wikipedia.

Another scripting language to use with MapReduce and Hadoop.

Have you ever noticed the Tower of Babel seems to follow human activity around?

First, it was building a tower to heaven – confuse the workforce.

Then it was other community efforts.

And many, many thens, later, it has arrived at MapReduce/Hadoop configuration languages.

Like a kaleidoscope, it just gets richer the more semantic diversity we add.

Do you wonder what the opposite of semantic diversity must look like?

Or if we are the cause, what would it mean to eliminate semantic diversity?

LevelGraph [Graph Databases and Semantic Diversity]

Sunday, April 28th, 2013

LevelGraph

From the webpage:

LevelGraph is a Graph Database. Unlike many other graph database, LevelGraph is built on the uber-fast key-value store LevelDB through the powerful LevelUp library. You can use it inside your node.js application.

LevelGraph loosely follows the Hexastore approach as presente in the article: Hexastore: sextuple indexing for semantic web data management C Weiss, P Karras, A Bernstein – Proceedings of the VLDB Endowment, 2008. Following this approach, LevelGraph uses six indices for every triple, in order to access them as fast as it is possible.

The family of graph databases gains another member.

The growth of graph database offerings is evidence the effort to reduce semantic diversity is a fool’s errand.

It isn’t hard to find graph database projects, yet new ones appear on a regular basis.

With every project starting over with the basic issues of graph representation and algorithms.

The reasons for that diversity are likely as diverse as the diversity itself.

If the world has been diverse, remains diverse and evidence is it will continue to be diverse, what are the odds in fighting diversity?

That’s what I thought.

Topic maps, embracing diversity.

I first saw this in a tweet by Frank Denis.

Collaborative annotation… [Human + Machine != Semantic Monotony]

Sunday, April 21st, 2013

Collaborative annotation for scientific data discovery and reuse by Kirk Borne. (Borne, K. (2013), Collaborative annotation for scientific data discovery and reuse. Bul. Am. Soc. Info. Sci. Tech., 39: 44–45. doi: 10.1002/bult.2013.1720390414)

Abstract:

Human classification alone, unable to handle the enormous quantity of project data, requires the support of automated machine-based strategies. In collaborative annotation, humans and machines work together, merging editorial strengths in semantics and pattern recognition with the machine strengths of scale and algorithmic power. Discovery informatics can be used to generate common data models, taxonomies and ontologies. A proposed project of massive scale, the Large Synoptic Survey Telescope (LSST) project, will systematically observe the southern sky over 10 years, collecting petabytes of data for analysis. The combined work of professional and citizen scientists will be needed to tag the discovered astronomical objects. The tag set will be generated through informatics and the collaborative annotation efforts of humans and machines. The LSST project will demonstrate the development and application of a classification scheme that supports search, curation and reuse of a digital repository.

A persuasive call to arms to develop “collaborative annotation:”

Humans and machines working together to produce the best possible classification label(s) is collaborative annotation. Collaborative annotation is a form of human computation [1]. Humans can see patterns and semantics (context, content and relationships) more quickly, accurately and meaningfully than machines. Human computation therefore applies to the problem of annotating, labeling and classifying voluminous data streams.

And more specifically for the Large Synoptic Survey Telescope (LSST):

The discovery potential of this data collection would be enormous, and its long-term value (through careful data management and curation) would thus require (for maximum scientific return) the participation of scientists and citizen scientists as well as science educators and their students in a collaborative knowledge mark-up (annotation and tagging) data environment. To meet this need, we envision a collaborative tagging system called AstroDAS (Astronomy Distributed Annotation System). AstroDAS is similar to existing science knowledge bases, such as BioDAS (Biology Distributed Annotation System, www.biodas.org).

As you might expect, semantic diversity is going to be present with “collaborative annotation.”

Semantic Monotony (aka Semantic Web) has failed for machines alone.

No question it will fail for humans + machines.

Are you ready to step up to the semantic diversity of collaborative annotation (humans + machines)?

Semantic Search Over The Web (SSW 2013)

Monday, March 18th, 2013

3RD International Workshop onSemantic Search Over The Web (SSW 2013)

Dates:

Abstract Papers submission: May 31, 2013 – 15:00 (3:00 pm) EDT
(Short) Full Paper submission: June 7, 2013 – 15:00 (3:00 pm) EDT
Author notification: July 19, 2013
Camera-ready copy due: August 2, 2013
Workshop date: During VLDB (Aug 26 – Aug 30)

From the webpage:

We are witnessing a smooth evolution of the Web from a worldwide information space of linked documents to a global knowledge base, composed of semantically interconnected resources. To date, the correlated and semantically annotated data available on the web amounts to 25 billion RDF triples, interlinked by around 395 million RDF links. The continuous publishing and the integration of the plethora of semantic datasets from companies, government and public sector projects is leading to the creation of the so-called Web of Knowledge. Each semantic dataset contributes to extend the global knowledge and increases its reasoning capabilities. As a matter of facts, researchers are now looking with growing interest to semantic issues in this huge amount of correlated data available on the Web. Many progresses have been made in the field of semantic technologies, from formal models to repositories and reasoning engines. While the focus of many practitioners is on exploiting such semantic information to contribute to IR problems from a document centric point of view, we believe that such a vast, and constantly growing, amount of semantic data raises data management issues that must be faced in a dynamic, highly distributed and heterogeneous environment such as the Web.

The third edition of the International Workshop on Semantic Search over the Web (SSW) will discuss about data management issues related to the search over the web and the relationships with semantic web technologies, proposing new models, languages and applications.

The research issues can be summarized by the following problems:

  • How can we model and efficiently access large amounts of semantic web data?
  • How can we effectively retrieve information exploiting semantic web technologies?
  • How can we employ semantic search in real world scenarios?

The SSW Workshop invites researchers, engineers, service developers to present their research and works in the field of data management for semantic search. Papers may deal with methods, models, case studies, practical experiences and technologies.

Apologies for the uncertainty of the workshop date. (There is confusion about the date on the workshop site, one place says the 26th, the other the 30th. Check before you make reservation/travel arrangements.)

I differ with the organizers on some issues but on the presence of: “…data management issues that must be faced in a dynamic, highly distributed and heterogeneous environment such as the Web,” there is no disagreement.

That’s the trick isn’t it? In any confined or small group setting, just about any consistent semantic solution will work.

The hurly-burly of a constant stream of half-heard, partially understood communications across distributed and heterogeneous systems tests the true mettle of semantic solutions.

Not a quest for perfect communication but “good enough.”

MetaNetX.org…

Saturday, March 16th, 2013

MetaNetX.org: a website and repository for accessing, analysing and manipulating metabolic networks by Mathias Ganter, Thomas Bernard, Sébastien Moretti, Joerg Stelling and Marco Pagni. (Bioinformatics (2013) 29 (6): 815-816. doi: 10.1093/bioinformatics/btt036)

Abstract:

MetaNetX.org is a website for accessing, analysing and manipulating genome-scale metabolic networks (GSMs) as well as biochemical pathways. It consistently integrates data from various public resources and makes the data accessible in a standardized format using a common namespace. Currently, it provides access to hundreds of GSMs and pathways that can be interactively compared (two or more), analysed (e.g. detection of dead-end metabolites and reactions, flux balance analysis or simulation of reaction and gene knockouts), manipulated and exported. Users can also upload their own metabolic models, choose to automatically map them into the common namespace and subsequently make use of the website’s functionality.

http://metanetx.org.

The authors are addressing a familiar problem:

Genome-scale metabolic networks (GSMs) consist of compartmentalized reactions that consistently combine biochemical, genetic and genomic information. When also considering a biomass reaction and both uptake and secretion reactions, GSMs are often used to study genotype–phenotype relationships, to direct new discoveries and to identify targets in metabolic engineering (Karr et al., 2012). However, a major difficulty in GSM comparisons and reconstructions is to integrate data from different resources with different nomenclatures and conventions for both metabolites and reactions. Hence, GSM consolidation and comparison may be impossible without detailed biological knowledge and programming skills. (emphasis added)

For which they propose an uncommon solution:

MetaNetX.org is implemented as a user-friendly and self-explanatory website that handles all user requests dynamically (Fig. 1a). It allows a user to access a collection of hundreds of published models, browse and select subsets for comparison and analysis, upload or modify new models and export models in conjunction with their results. Its functionality is based on a common namespace defined by MNXref (Bernard et al., 2012). In particular, all repository or user uploaded models are automatically translated with or without compartments into the common namespace; small deviations from the original model are possible due to the automatic reconciliation steps implemented by Bernard et al. (2012). However, a user can choose not to translate his model but still make use of the website’s functionalities. Furthermore, it is possible to augment the given reaction set by user-defined reactions, for example, for model augmentation.

The bioinformatics community recognizes the intellectual poverty of lock step models.

Wonder when the intelligence community is going to have that “a ha” moment?

Hadoop Adds Red Hat [More Hadoop Silos Coming]

Friday, February 22nd, 2013

Red Hat Unveils Big Data and Open Hybrid Cloud Direction

From the post:

Red Hat, Inc. (NYSE: RHT), the world’s leading provider of open source solutions, today announced its big data direction and solutions to satisfy enterprise requirements for highly reliable, scalable, and manageable solutions to effectively run their big data analytics workloads. In addition, Red Hat announced that the company will contribute its Red Hat Storage Hadoop plug-in to the ApacheTM Hadoop® open community to transform Red Hat Storage into a fully-supported, Hadoop-compatible file system for big data environments, and that Red Hat is building a robust network of ecosystem and enterprise integration partners to deliver comprehensive big data solutions to enterprise customers. This is another example of Red Hat’s strategic commitment to big data customers and its continuing efforts to provide them with enterprise solutions through community-driven innovation.

The more Hadoop grows, the more Hadoop silos will as well.

You will need Hadoop and semantic skills to wire Hadoop silos together.

Re-wire with topic maps to avoid re-wiring the same Hadoop silos over and over again.

I first saw this at Red Hat reveal big data plans, open sources HDFS replacement by Elliot Bentley.

Hadoop silos need integration…

Thursday, February 21st, 2013

Hadoop silos need integration, manage all data as asset, say experts by Brian McKenna.

From the post:

Big data hype has caused infantile disorders in corporate organisations over the past year. Hadoop silos, an excess of experimentation, and an exaggeration of the importance of data scientists are among the teething problems of big data, according to experts, who suggest organisations should manage all data as an asset.

Steve Shelton, head of data services at consultancy Detica, part of BAE Systems, said Hadoop silos have become part of the enterprise IT landscape, both in the private and public sectors. “People focused on this new thing called big data and tried to isolate it [in 2011 and 2012],” he said.

The focus has been too concentrated on non-traditional data types, and that has been driven by the suppliers. The business value of data is more effectively understood when you look at it all together, big or otherwise, he said.

Have big data technologies been a distraction? “I think it has been an evolutionary learning step, but businesses are stepping back now. When it comes to information governance, you have to look at data across the patch,” said Shelton.

He said Detica had seen complaints about Hadoop silos, and these were created by people going through a proof-of-concept phase, setting up a Hadoop cluster quickly and building a team. But a Hadoop platform involves extra costs on top, in terms of managing it and integrating it into your existing business processes.

“It’s not been a waste of time and money, it is just a stage. And it is not an insurmountable challenge. The next step is to integrate those silos, but the thinking is immature relative to the technology itself,” said Shelton.

I take this as encouraging news for topic maps.

Semantically diverse data has been stores in semantically diverse datastores. Data, which if integrated, could provide business value.

Again.

There will always be a market for topic maps because people can’t stop creating semantically diverse data and data stores.

How’s that for long term market security?

No matter what data or data storage technology arises, semantic inconsistency will be with us always.

Saving the “Semantic” Web (part 4)

Wednesday, February 13th, 2013

Democracy vs. Aristocracy

Part of a recent comment on this series reads:

What should we have been doing instead of the semantic web? ISO Topic Maps? There is some great work in there, but has it been a better success?

That is an important question and I wanted to capture it outside of comments on a prior post.

Earlier in this series of posts I pointed out the success of HTML, especially when contrasted with Semantic Web proposals.

Let me hasten to add the same observation is true for ISO Topic Maps (HyTime or later versions).

The critical difference between HTML (the early and quite serviceable versions) and Semantic Web/Topic Maps is that the former democratizes communication and the latter fosters a technical aristocracy.

Every user who can type and some who hunt-n-peck, can author HTML and publish their content for others around the world to read, discuss, etc.

That is a very powerful and democratizing notion about content creation.

The previous guardians, gate keepers, insiders, and their familiars, who didn’t add anything of value to prior publications processes, are still reeling from the blow.

Even as old aristocracies crumble, new ones evolve.

Technical aristocracies for example. A phrase relevant to both the Semantic Web and ISO Topic Maps.

Having tasted freedom, the crowds aren’t as accepting of the lash/leash as they once were. Nor of the aristocracies who would wield them. Nor should they be.

Which make me wonder: Why the emphasis on creating dumbed down semantics for computers?

We already have billions of people who are far more competent semantically than computers.

Where are our efforts to enable them to transverse the different semantics of other users?

Such as the semantics of the aristocrats who have self-anointed themselves to labor on their behalf?

If you have guessed that I have little patience with aristocracies, you are right in one.

I came by that aversion honestly.

I practiced law in a civilian jurisdiction for a decade. A specialist language, law, can be more precise, but it also excludes others from participation. The same experience was true when I studied theology and ANE languages. A bit later, in markup technologies (then SGML/HyTime), the same lesson was repeated. What I do with ODF and topic maps are two more specialized languages.

Yet a reasonably intelligent person can discuss issues in any of those fields, if they can get past the language barriers aristocrats take so much comfort in maintaining.

My answer to what we should be doing is:

Looking for ways to enable people to traverse and enjoy the semantic diversity that accounts for the richness of the human experience.

PS: Computers have a role to play in that quest, but a subordinate one.


Content-Based Image Retrieval at the End of the Early Years

Tuesday, January 22nd, 2013

Content-Based Image Retrieval at the End of the Early Years by Arnold W.M. Smeulders, Marcel Worring, Simone Santini, Amarnath Gupta, and Ramesh Jain. (Smeulders, A.W.M.; Worring, M.; Santini, S.; Gupta, A.; Jain, R.; , “Content-based image retrieval at the end of the early years,” Pattern Analysis and Machine Intelligence, IEEE Transactions on , vol.22, no.12, pp.1349-1380, Dec 2000
doi: 10.1109/34.895972)

Abstract:

Presents a review of 200 references in content-based image retrieval. The paper starts with discussing the working conditions of content-based retrieval: patterns of use, types of pictures, the role of semantics, and the sensory gap. Subsequent sections discuss computational steps for image retrieval systems. Step one of the review is image processing for retrieval sorted by color, texture, and local geometry. Features for retrieval are discussed next, sorted by: accumulative and global features, salient points, object and shape features, signs, and structural combinations thereof. Similarity of pictures and objects in pictures is reviewed for each of the feature types, in close connection to the types and means of feedback the user of the systems is capable of giving by interaction. We briefly discuss aspects of system engineering: databases, system architecture, and evaluation. In the concluding section, we present our view on: the driving force of the field, the heritage from computer vision, the influence on computer vision, the role of similarity and of interaction, the need for databases, the problem of evaluation, and the role of the semantic gap.

Excellent survey article from 2000 (not 2002 as per the Ostermann paper).

I think you will appreciate the treatment of the “semantic gap,” both in terms of its description as well as ways to address it.

If you are using annotated images in your topic map application, definitely a must read.

User evaluation of automatically generated keywords and toponyms… [of semantic gaps]

Tuesday, January 22nd, 2013

User evaluation of automatically generated keywords and toponyms for geo-referenced images by Frank O. Ostermann, Martin Tomko, Ross Purves. (Ostermann, F. O., Tomko, M. and Purves, R. (2013), User evaluation of automatically generated keywords and toponyms for geo-referenced images. J. Am. Soc. Inf. Sci.. doi: 10.1002/asi.22738)

Abstract:

This article presents the results of a user evaluation of automatically generated concept keywords and place names (toponyms) for geo-referenced images. Automatically annotating images is becoming indispensable for effective information retrieval, since the number of geo-referenced images available online is growing, yet many images are insufficiently tagged or captioned to be efficiently searchable by standard information retrieval procedures. The Tripod project developed original methods for automatically annotating geo-referenced images by generating representations of the likely visible footprint of a geo-referenced image, and using this footprint to query spatial databases and web resources. These queries return raw lists of potential keywords and toponyms, which are subsequently filtered and ranked. This article reports on user experiments designed to evaluate the quality of the generated annotations. The experiments combined quantitative and qualitative approaches: To retrieve a large number of responses, participants rated the annotations in standardized online questionnaires that showed an image and its corresponding keywords. In addition, several focus groups provided rich qualitative information in open discussions. The results of the evaluation show that currently the annotation method performs better on rural images than on urban ones. Further, for each image at least one suitable keyword could be generated. The integration of heterogeneous data sources resulted in some images having a high level of noise in the form of obviously wrong or spurious keywords. The article discusses the evaluation itself and methods to improve the automatic generation of annotations.

An echo of Steve Newcomb’s semantic impedance appears at:

Despite many advances since Smeulders et al.’s (2002) classic paper that set out challenges in content-based image retrieval, the quality of both nonspecialist text-based and content-based image retrieval still appears to lag behind the quality of specialist text retrieval, and the semantic gap, identified by Smeulders et al. as a fundamental issue in content-based image retrieval, remains to be bridged. Smeulders defined the semantic gap as

the lack of coincidence between the information that one can extract from the visual data and the interpretation that the same data have for a user in a given situation. (p. 1353)

In fact, text-based systems that attempt to index images based on text thought to be relevant to an image, for example, by using image captions, tags, or text found near an image in a document, suffer from an identical problem. Since text is being used as a proxy by an individual in annotating image content, those querying a system may or may not have similar worldviews or conceptualizations as the annotator. (emphasis added)

That last sentence could have come out of a topic map book.

Curious what you make of the author’s claim that spatial locations provide an “external context” that bridges the “semantic gap?”

If we all use the same map of spatial locations, are you surprised by the lack of a “semantic gap?”

The Twitter of Babel: Mapping World Languages through Microblogging Platforms

Friday, December 21st, 2012

The Twitter of Babel: Mapping World Languages through Microblogging Platforms by Delia Mocanu, Andrea Baronchelli, Bruno Gonçalves, Nicola Perra, Alessandro Vespignani.

Abstract:

Large scale analysis and statistics of socio-technical systems that just a few short years ago would have required the use of consistent economic and human resources can nowadays be conveniently performed by mining the enormous amount of digital data produced by human activities. Although a characterization of several aspects of our societies is emerging from the data revolution, a number of questions concerning the reliability and the biases inherent to the big data “proxies” of social life are still open. Here, we survey worldwide linguistic indicators and trends through the analysis of a large-scale dataset of microblogging posts. We show that available data allow for the study of language geography at scales ranging from country-level aggregation to specific city neighborhoods. The high resolution and coverage of the data allows us to investigate different indicators such as the linguistic homogeneity of different countries, the touristic seasonal patterns within countries and the geographical distribution of different languages in multilingual regions. This work highlights the potential of geolocalized studies of open data sources to improve current analysis and develop indicators for major social phenomena in specific communities.

So, rather on the surface homogeneous languages, users can use their own natural, heterogeneous languages, which we can analyze as such?

Cool!

Semantic and linguistic heterogeneity has persisted from the original Tower of Babel until now.

The smart money will be riding on managing semantic and linguistic heterogeneity.

Other money can fund emptying the semantic ocean with a tea cup.

Most developers don’t really know any computer language

Saturday, November 17th, 2012

Most developers don’t really know any computer language by Derek Jones.

From the post:

What does it mean to know a language? I can count to ten in half a dozen human languages, say please and thank you, tell people I’m English and a few other phrases that will probably help me get by; I don’t think anybody would claim that I knew any of these languages.

It is my experience that most developers’ knowledge of the programming languages they use is essentially template based; they know how to write a basic instances of the various language constructs such as loops, if-statements, assignments, etc and how to define identifiers to have a small handful of properties, and they know a bit about how to glue these together.


[the topic map part]

discussions with developers: individuals and development groups invariabily have their own terminology for programming language constructs (my use of terminology appearing in the language definition usually draws blank stares and I have to make a stab at guessing what the local terms mean and using them if I want to be listened to); asking about identifier scoping or type compatibility rules (assuming that either of the terms ‘scope’ or ‘type compatibility’ is understood) usually results in a vague description of specific instances (invariably the commonly encountered situations),

What?! Semantic diversity in computer languages? Or at least as they are understood by programmers?

;-)

I don’t see the problem with appreciating semantic diversity for the richness it offers.

There are use cases where semantic diversity interferes with some other requirement. Such as in accounting systems that depend upon normalized data for auditing purposes.

While there are other use cases, such as the history of ideas that depend upon preservation of the trail of semantic diversity. As part of the narrative of such histories.

And there are cases that fall in between, where the benefits of diverse points of view must be weighted against the cost of creating and maintaining a mapping between diverse viewpoints.

All of those use cases recognize that semantic diversity is the starting point. That is semantic diversity is always with us and the real question is the cost of its control for some particular use case.

I don’t view: “My software works if all users abandon semantic diversity.” as a use case. It is a confession of defective software.

I first saw this in a tweet from Computer Science Fact.

Taming Big Data Is Not a Technology Issue [Knuth Exercise Rating]

Friday, November 16th, 2012

Taming Big Data Is Not a Technology Issue by Bill Franks.

From the post:

One thing that has struck me recently is that most of the focus when discussing big data is upon the technologies involved. The consensus seems to be that the biggest challenge with big data is a technological one, yet I don’t believe this to be the case. Sure, there are challenges today for organizations using big data, but, I would like to submit to you that technology is not the biggest problem. In fact, technology may be one of the easiest problems to solve when it comes time to tame big data.

The fact is that there are tools and technologies out there that can handle virtually all of the big data needs of the vast majority of organizations. As of today, you can find products and solutions that do whatever you need to do with big data. Technology itself is not the problem.

Then, what are the issues? The real problems are with resource availability, skills, process change, politics, and culture. While the technologies to solve your problems may be out there just waiting for you to implement them, it isn’t quite that easy, is it? You have to get budget, you have to do an implementation, you have to get your people up to speed on how to use the tools, you have to get buy in from various stakeholders, and you have to push against a culture averse to change.

The technology is right there, but you are unable to effectively put it to work. It FEELS like a technology issue since technology is front and center. However, it is really the cultural, people, and political issues surrounding the technology that are the problem. Let me illustrate with an example.

A refreshing view at the drive to build technology to “solve” the big data problem.

Once terabytes of data are accessible as soon as entering the data stream, for real time, reactive analysis, with n-dimensional graphic representations as a matter of course, the “big data” problem will still be the “big data” problem.

The often cited “volume, velocity, variety” characterization of “big data” are surface issues that in one manner or another, can be addressed using technology. Now.

A deeper, more persistent problem is that users expect their data, big or small, to have semantics. Whether express or implied. That problem, along with the others cited by Franks, has no technological solution.

Because semantics originate with us and not with our machines.

By all means, we need to solve the technology issues around “big data,” but that only gives us a start towards working on the more difficult problems, problems that original with us.

A much harder “programming” exercise. I suspect on Knuth’s scale of exercises, an 80 or 90.

People and Process > Prescription and Technology

Monday, October 15th, 2012

Factors that affect software systems development project outcomes: A survey of research by Laurie McLeod and Stephen G. MacDonell. ACM Computing Surveys (CSUR) Surveys Volume 43 Issue 4, October 2011 Article No. 24, DOI: 10.1145/1978802.1978803.

Abstract:

Determining the factors that have an influence on software systems development and deployment project outcomes has been the focus of extensive and ongoing research for more than 30 years. We provide here a survey of the research literature that has addressed this topic in the period 1996–2006, with a particular focus on empirical analyses. On the basis of this survey we present a new classification framework that represents an abstracted and synthesized view of the types of factors that have been asserted as influencing project outcomes.

As with most survey work, particularly ones that summarize 177 papers, this is a long article, some fifty-six pages.

Let me try to tempt you into reading it by quoting from Angelica de Antonio’s review of it (in Computing Reviews, Oct. 2012):

An interesting discussion about the very concept of project outcome precedes the survey of factors, and an even more interesting discussion follows it. The authors stress the importance of institutional context in which the development project takes place (an aspect almost neglected in early research) and the increasing evidence that people and process have a greater effect on project outcomes than technology. A final reflection on what projects still continue to fail—even if we seem to know the factors that lead to success—raises a question on the utility of prescriptive factor-based research and leads to considerations that could inspire future research. (emphasis added)

Before you run off to the library or download a copy of the survey, two thoughts to keep in mind:

First, if “people and process” are more important than technology, where should we place the emphasis in projects involving semantics?

Second, if “prescription” can’t cure project failure, what are its chances with semantic diversity?

Thoughts?

Appropriating IT: Glue Steps [Gluing Subject Representatives Together?]

Tuesday, October 9th, 2012

Appropriating IT: Glue Steps by Tony Hirst.

Over the years, I’ve been fortunate enough to have been gifted some very evocative, and powerful, ideas that immediately appealed to me when I first heard them and that I’ve been able to draw on, reuse and repurpose over and over again. One such example is “glue logic”, introduced to me by my original OU PhD supervisor George Kiss. The idea of glue logic is to provide a means by which two digital electronic circuits (two “logic” circuits) that don’t share a common interface can be “glued” together.

(diagrams and other material omitted)

This idea is powerful enough in its own right, but there was a second bit to it that made it really remarkable: the circuitry typically used to create the glue logic was a device known as a Field Programmable Gate Array, or FPGA. This is a type of digital circuit whose logical function can be configured, or programmed. That is, I can take my “shapeless” FPGA, and programme it so that it physically implements a particular digital circuit. Just think about that for a moment… You probably have a vague idea that the same computer can be reprogrammed to do particular things, using some vaguely mysterious and magical thing called software, instructions that computer processors follow in order to do incredible things. With an FPGA, the software actually changes the hardware: there is no processor that “runs a programme”; when you programme an FPGA, you change its hardware. FPGAs are, literally, programmable chips. (If you imagine digital circuits to be like bits of plastic, an FPGA is like polymorph.)

The notion of glue logic has stuck with me for two reasons, I think: firstly, because of what it made possible, the idea of flexibly creating an interface between two otherwise incompatible components; secondly, because of the way in which it could be achieved – using a flexible, repurposable, reprogrammable device – one that you could easily reprogramme if the mapping from one device to another wasn’t quite working properly.

If instead of “don’t share a common interface” you read “semantic diversity” and in place of Field Programmable Gate Array, or FPGA, you read “legend,” to “creat[e] an interface between two otherwise incompatible [subject representatives],” you would think Tony’s post was about the topic maps reference model.

Well, this post is and Tony’s is very close.

Particularly the part about being a “reprogrammable device.”

I can tell you: “black” = “schwarz,” but without more, you won’t be able to rely on or extend that statement.

For that, you need a “reprogrammable device” and some basis on which to do the reprogramming.

Legends anyone?

A Good Example of Semantic Inconsistency [C-Suite Appropriate]

Tuesday, October 9th, 2012

A Good Example of Semantic Inconsistency by David Loshin.

You can guide users through the intellectual minefield of Frege, Peirce, Russell, Carnap, Sowa and others to illustrate the need for topic maps, with stunning (as in daunting) graphics.

Or, you can use David’s story:

I was at an event a few weeks back talking about data governance, and a number of the attendees were from technology or software companies. I used the term “semantic inconsistency” and one of the attendees asked me to provide an example of what I meant.

Since we had been discussing customers, I thought about it for a second and then asked him what his definition was of a customer. He said that a customer was someone who had paid the company money for one of their products. I then asked if anyone in the audience was on the support team, and one person raised his hand. I asked him for a definition, and he said that a customer is someone to whom they provide support.

I then posed this scenario: the company issued a 30-day evaluation license to a prospect with full support privileges. Since the prospect had not paid any money for the product, according to the first definition that individual was not a customer. However, since that individual was provided full support privileges, according to the second definition that individual was a customer.

Within each silo, the associated definition is sound, but the underlying data sets are not compatible. An attempt to extract the two customer lists and merge them together into a single list will lead to inconsistent results. This may be even worse if separate agreements dictate how long a purchaser is granted full support privileges – this may lead to many inconsistencies across those two data sets.

Illustrating “semantic inconsistency,” one story at a time.

What’s your 250 – 300 word semantic inconsistency story?

PS: David also points to webinar that will be of interest. Visit his post.

Argumentation 2012

Thursday, May 3rd, 2012

Argumentation 2012: International Conference on Alternative Methods of Argumentation in Law


07-09-2012 Full paper submission deadline

21-09-2012 Notice of acceptance deadline

12-10-2012 Paper camera-ready deadline

26-10-2012 Main event, Masaryk University in Brno, Czech Republic

From the listing of topics for papers, semantic diversity going to run riot at this conference.

Checking around the website I was disappointed the papers from Argumentation 2011 are not online.

Semantically Diverse Christenings

Sunday, April 29th, 2012

Mark Liberman in Neutral Xi_b^star, Xi(b)^{*0}, Ξb*0, whatever at Language Log reports semantically diverse christenings of the same new subatomic particle.

I count eight or nine distinct names in Liberman’s report.

How many do you see?

This is just days after its discovery at the CERN.

Largely in the scientific literature. (It will get far worse if you include non-technical literature. Is non-technical literature/discussion relevant?)

Question for science librarians:

How many names for this new subatomic particle will you use in searches?

Technology speedup graph

Sunday, April 8th, 2012

Technology speedup graph

Andrew Gelman posts an interesting graphic showing the adoption of various technologies from 1900 forward. See the post for the lineage on the graph and the details. Good graphic.

What caught my eye for topic maps was the rapid adoption of the Internet/WWW and the now well recognized failure of the Semantic Web.

You may feel like disputing my evaluation of the Semantic Web. Recall that agents were predicted to be roaming the Semantic Web by this point in Tim Berners-Lee’s first puff piece in Scientific American. After a few heady years of announcements of realization is just around the corner, the 21st century technology equivalent of the long retreat (think Napoleon).

Now the last gasp is Linked Data, the “meaning” of URIs is be determined on mount W3C and then imposed on the rest of us.

Make no mistake, I think the WWW was a truly great technological achievement.

But the technological progress graph prompted me to wonder, yet again, how is the WWW different from the Semantic Web?

Not sure this is helpful but consider the level of agreement on semantics required by the WWW versus the Semantic Web.

For the WWW, there are a handful of RFCs that specify the treatment of syntax. That is addresses and the composition of resources that you find at those addresses. Users may attach semantics to those resources, but none of those semantics are required for processing or delivery of the resources.

That is for the WWW to succeed, all we need is agreement on the addressing and processing of resources and not at all on their semantics.

A resource can have a crazy quilt of semantics attached to it by users, diverse, inconsistent, contradictory, because its addressing and processing is independent of those semantics and those who would impose them.

Resources on the WWW certainly have semantics, but processing those resources doesn’t depend on our agreement on those semantics.

So, the semantic agreement of the WWW = ~ 0. (Leaving aside the certainly true contention that protocols have semantics.)

The semantic agreement required by the Semantic Web is “web scale agreement.” That is everyone who encounters a semantic has to either honor it or break that part of the Semantic Web.

Wait until after you watch the BBC News or Al Jazeera (English), الجزيرة.نت, before you suggest universal semantics are just around the corner.

Cry Me A River, But First Let’s Agree About What A River Is

Saturday, February 4th, 2012

Cry Me A River, But First Let’s Agree About What A River Is

The post starts off well enough:

How do you define a forest? How about deforestation? It sounds like it would be fairly easy to get agreement on those terms. But beyond the basics – that a definition for the first would reflect that a forest is a place with lots of trees and the second would reflect that it’s a place where there used to be lots of trees – it’s not so simple.

And that has consequences for everything from academic and scientific research to government programs. As explained by Krzysztof Janowicz, perfectly valid definitions for these and other geographic terms exist by the hundreds, in legal texts and government documents and elsewhere, and most of them don’t agree with each other. So, how can one draw good conclusions or make important decisions when the data informing those is all over the map, so to speak.

….

Having enough data isn’t the problem – there’s official data from the government, volunteer data, private organization data, and so on – but if you want to do a SPARQL query of it to discover all towns in the U.S., you’re going to wind up with results that include the places in Utah with populations of less than 5,000, and Los Angeles too – since California legally defines cities and towns as the same thing.

“So this clearly blows up your data, because your analysis is you thinking that you are looking at small rural places,” he says.

This Big Data challenge is not a new problem for the geographic-information sciences community. But it is one that’s getting even more complicated, given the tremendous influx of more and more data from more and more sources: Satellite data, rich data in the form of audio and video, smart sensor network data, volunteer location data from efforts like the Citizen Science Project and services like Facebook Places and Foursquare. “The heterogeneity of data is still increasing. Semantic web tools would help you if you had the ontologies but we don’t have them,” he says. People have been trying to build top-level global ontologies for a couple of decades, but that approach hasn’t yet paid off, he thinks. There needs to be more of a bottom-up take: “The biggest challenge from my perspective is coming up with the rules systems and ontologies from the data.”

All true, many of which objectors to the current Semantic Web approach have been saying for a very long time.

I am not sure about the line: “The heterogeneity of data is still increasing.”

In part because I don’t know of any reliable measure of heterogeneity by which a comparison could be made. True there is more data now than at some X point in the past, but that isn’t necessarily an indication of increased heterogeneity. But that is a minor point.

More serious is the a miracle occurs statement that follows:

How to do it, he thinks, is to make very small and really local ontologies directly mined with the help of data mining or machine learning techniques, and then interlink them and use new kinds of reasoning to see how to reason in the presence of inconsistencies. “That approach is local ontologies that arrive from real application needs,” he says. “So we need ontologies and semantic web reasoning to have neater data that is human and also machine readable. And more effective querying based on analogy or similarity reasoning to find data sets that are relevant to our work and exclude data that may use the same terms but has different ontological assumptions underlying it.”

Doesn’t that have the same feel as the original Semantic Web proposals that were going to eliminate semantic ambiguity from the top down? The very approach that is panned in this article?

And “new kinds of reasoning,” ones I assume have not been invented yet, are going “to reason in the presence of inconsistencies.” And excluding data that “…has different ontological assumptions underlying it.”

Since we are the source of ontological assumptions that underlie the use of terms, I am real curious about how those assumptions are going to become available to these to be invented reasoning techniques?

Oh, that’s right, we are all going to specify our ontological assumptions at the bottom to percolate up. Except that to be useful for machine reasoning, they will have to be as crude as the ones that were going to be imposed from the top down.

I wonder why the indeterminate nature of semantics continues to elude Semantic Web researchers. A snapshot of semantics today may be slightly incorrect tomorrow, probably incorrect in some respect in a month and almost surely incorrect in a year or more.

Take Saddam Hussein for example. One time friend and confidant of Donald Rumsfeld (there are pictures). But over time those semantics changed, largely because Hussein slipped the lease and was no longer a proper vassal to the US. Suddenly, the weapons of mass destruction, in part nerve gas we caused to be sold to him, became a concern. And so Hussein became an enemy of the US. Same person, same facts. Different semantics.

There are less dramatic examples but you get the idea.

We can capture even changing semantics but we need to decide what semantics we want to capture and at what cost? Perhaps that is a better way to frame my objection to most Semantic Web activities, they are not properly scoped. Yes?

Countandra

Friday, January 27th, 2012

Countandra

From the webpage:

Since Aryabhatta invented zero, Mathematicians such as John von Neuman have been in pursuit of efficient counting and architects have constantly built systems that computes counts quicker. In this age of social media, where 100s of 1000s events take place every second, we were inspired by twitter’s Rainbird project to develop distributed counting engine that can scale linearly.

Countandra is a hierarchical distributed counting engine on top of Cassandra (to increment/decrement hierarchical data) and Netty (HTTP Based Interface). It provides a complete http based interface to both posting events and getting queries. The syntax of a event posting is done in a FORMS compatible way. The result of the query is emitted in JSON to make it maniputable by browsers directly.

Features

  • Geographically distributed counting.
  • Easy Http Based interface to insert counts.
  • Hierarchical counting such as com.mywebsite.music.
  • Retrieves counts, sums and square in near real time.
  • Simple Http queries provides desired output in JSON format
  • Queries can be sliced by period such as LASTHOUR,LASTYEAR and so on for MINUTELY,HOURLY,DAILY,MONTHLY values
  • Queries can be classified for anything in hierarchy such as com, com.mywebsite or com.mywebsite.music
  • Open Source and Ready to Use!

Countandra illustrates that not every application need be a general purpose one. Countandra is designed to be a counting engine and to answer defined query types, nothing more.

There is a lesson there for semantic diversity solutions. It is better to attempt to solve part of the semantic diversity issue than to attempt a solution for everyone. At least partial solutions have a chance of being a benefit before being surpassed by changing technologies and semantics.

BTW, Countandra using a Java long for time values so in the words of the Unix Time Wikipedia entry:

In the negative direction, this goes back more than twenty times the age of the universe, and so suffices. In the positive direction, whether the approximately 293 billion representable years is truly sufficient depends on the ultimate fate of the universe, but it is certainly adequate for most practical purposes.

Rather than “suffices” and “most practical purposes” I would have said, “is adequate for present purposes” in both cases.

Oil Drop Semantics?

Sunday, January 15th, 2012

Interconnection of Communities of Practice: A Web Platform for Knowledge Management and some related material made me think of the French “oil drop” counter-insurgency strategy.

With one important difference.

In a counter-insurgency context, the oil drop strategy is being used to further the goals of counter-insurgency force. Whatever you think of those goals or the alleged benefits for the places covered by the oil drops, the fundamental benefit is to the counter-insurgency force.

In a semantic context, one that seeks to elicit the local semantics of a group, the goal is not the furtherance of an outside semantic, but the exposition of a local semantic with the goal of benefiting the group covered by the oil spot. That as the oil drop spreads, those semantics may be combined with other oil drop semantics, but that is a cost and effort borne by the larger community seeking that benefit.

There are several immediate advantages to this approach with semantics.

First, the discussion of semantics at every level is taking place with the users of those semantics. You can hardly get closer to a useful answer than being able to ask the users of a semantic what was meant or for examples of usage. I don’t have a formalism for it but I would postulate that as the distance from users increases, so does the usefulness of the semantics of those users.

Ask the FBI about the Virtual Case Management project. Didn’t ask users or at least enough of them and flushed lots of cash. Lesson: Asking management, IT, etc., about the semantics of users is a utter waste of time. Really.

If you want to know the semantics of user group X, then ask group X. If you ask Y about X, you will get Y’s semantics about X. If that is what you want, fine, but if you want the semantics of group X, you have wasted your time and resources.

Second, asking the appropriate group of users for their semantics means that you can make explicit the ROI from making their semantics explicit. That is to say if asked, the group will ask about semantics that are meaningful to them. That either solve some task or issue that they encounter. May or may not be the semantics that interest you but recall the issue is the group’s semantics, not yours.

The reason for the ROI question at the appropriate group level is so that the project is justified both to the group being asked to make the effort as well as those who must approve the resources for such a project. Answering that question up front helps get buy-in from group members and makes them realize this isn’t busy work but will have a positive benefit for them.

Third, such a bottom-up approach, whether you are using topic maps, RDF, etc. will mean that only the semantics that are important to users and justified by some positive benefit are being captured. Your semantics may not have the rigor of SUMO, for example, but they are a benefit to you. What other test would you apply?

Another way to think about geeks and repetitive tasks

Tuesday, January 10th, 2012

Another way to think about geeks and repetitive tasks

John Udell writes:

The other day Tim Bray tweeted a Google+ item entitled Geeks and repetitive tasks along with the comment: “Geeks win, eventually.”

…(material omitted)

In geek ideology the oppressors are pointy-haired bosses and clueless users. Geeks believe (correctly) that clueless users can’t imagine, never mind implement, automated improvements to repetitive manual chores. The chart divides the world into geeks and non-geeks, and it portrays software-assisted process improvement as a contest that geeks eventually win. This Manichean worldview is unhelpful.

I have no doubt that John’s conclusion:

Software-assisted automation of repetitive work isn’t an event, it’s a process. And if you see it as a contest with winners and losers you are, in my view, doing it wrong.

is the right one but I think it misses an important insight.

That “geeks” and their “oppressors” view the world with very different semantics. If neither one tries to communicate those semantics to the other, then software will continue to fail to meet the needs of its users. An unhappy picture for all concerned, geeks as well as their oppressors.

Being semantics, there is no “right” or “wrong” semantic.

True enough, the semantics of geeks works better with computers but if that fails to map in some meaningful way to the semantics of their oppressors, what’s the point?

Geeks can write highly efficient software for tasks but if the tasks aren’t something anyone is willing to pay for or even use, what’s the point?

Users and geeks need to both remember that communication is a two-way street. The only way for it to fail completely is for either side to stop trying to communicate with the other.

Have no doubt, I have experience the annoyance of trying to convince a geek that just because they have written software a particular way that has little to no bearing on some user request. (The case in point was a UI where the geek had decided on a “better” means of data entry. The users, who were going to be working with the data thought otherwise. I heard the refrain, “…if they would just use it they would get used to it.” Of course, the geek had written the interface without asking the users first.)

To be fair, users have to be willing to understand there are limitations on what can be requested.

And that users failing to complete written and detailed requirements for all aspects of a request, is almost a guarantee that the software result isn’t going to satisfy anyone.

Written requirements are where semantic understandings, mis-understandings and clashes can be made visible, resolved (hopefully) and documented. Burdensome, annoying, non-productive in the view of geeks who want to get to coding, but absolutely necessary in any sane software development environment.

That is to say any software environment that is going to represent a happy (well, workable) marriage of the semantics of geeks and users.

Google; almost 50 functions & resources killed in 2011

Saturday, December 17th, 2011

Google; almost 50 functions & resources killed in 2011 by Phil Bradley.

Just in case you want to think of other potential projects over the holidays! ;-)

For my topic maps class:

  1. Pick one function or resource
  2. Outline how semantic integration could support or enhance such a function or resource. (3-5 pages, no cites)
  3. Bonus points: What resources would you want to integrate for such a function or resource? (1-2 pages)

John Giannandrea on Freebase – A Rosetta Stone for Entities

Tuesday, November 15th, 2011

John Giannandrea on Freebase – A Rosetta Stone for Entities by Daniel Tunkelang.

From the post:

John started by introducing Freebase as a representation of structured objects corresponding to real-world entities and connected by a directed graph of relationships. In other words, a semantic web. While it isn’t quite web-scale, Freebase is a large and growing knowledge base consisting of 25 million entities and 500 million connections — and doubling annually. The core concept in Freebase is a type, and an entity can have many types. For example, Arnold Schwarzenegger is a politician and an actor. John emphasized the messiness of the real world. For example, most actors are people, but what about the dog who played Lassie? It’s important to support exceptions.

The main technical challenge for Freebase is reconciliation — that is, determining how similar a set of data is to existing Freebase topics. John pointed out how critical it is for Freebase to avoid duplication of content, since the utility of Freebase depends on unique nodes in its graph corresponding to unique objects in the world. Freebase obtains many of its entities by reconciling large, open-source knowledge bases — including Wikipedia, WordNet, Library of Congress Authorities, and metadata from the Stanford Library. Freebase uses a variety of tools to implement reconciliation, including Google Refine (formerly known as Freebase Gridworks) and Matchmaker, a tool for gathering human judgments. While reconciliation is a hard technical problem, it is made possible by making inferences across the web of relationships that link entities to one another.

John then presented Freebase as a Rosetta Stone for entities on the web. Since an entity is simply a collection of keys (one of which is its name), Freebase’s job is to reverse engineer the key-value store that is distributed among the entity’s web references, e.g., the structured databases backing web sites and encoding keys in URL parameters. He noted that Freebase itself is schema-less (it is a graph database), and that even the concept of a type is itself an entity (“Type type is the only type that is an instance of itself”). Google makes Freebase available through an API and the Metaweb Query Language (MQL).

(emphasis added)

<tedious-self-justification>…., entity is a collection of keys indeed! Key/value pairs I would say, with no presumptions about the structure of either one.</tedious-self-justification>

There is not now nor will there ever be agreement on the “unique objects in the world.” And why should that be a value? If we have the key/value pairs, we can each arrive at our own conclusions about whether certain “unique nodes” correspond to what we think of as “unique objects in the world.”

I suspect, but don’t know having never asked former President Bush II, that we disagree on the existence of any unique objects in the world and it is unlikely there is any evidence that would persuade either one of us to change.

Remember the Rosetta Stone had three (3) version of the same inscription. It did not try to say one version was closer to the original than the others.

The Rosetta Stone is one of the earliest honorings of semantic diversity. Unlike systems that try to push only one common semantic or vision.

The Second International Workshop on Diversity in Document Retrieval (DDR-2012)

Tuesday, October 18th, 2011

The Second International Workshop on Diversity in Document Retrieval (DDR-2012)

Dates:

When Feb 12, 2012 – Feb 12, 2012
Where Seattle WA, USA
Submission Deadline Dec 5, 2011
Notification Due Jan 10, 2012
Final Version Due Jan 17, 2012

From the webpage:

In conjunction with WSDM 2012 – the 5th ACM International Conference on Web Search and Data Mining

Overview
=======
When an ambiguous query is received, a sensible approach is for the information retrieval (IR) system to diversify the results retrieved for this query, in the hope that at least one of the interpretations of the query intent will satisfy the user. Diversity is an increasingly important topic, of interest to both academic researchers (such as participants in the TREC Web and Blog track diversity tasks), as well as to search engines professionals. In this workshop, we solicit submissions both on approaches and models for diversity, the evaluation of diverse search results, and on applications and presentation of diverse search results.

Topics:

  • Modelling Diversity:
    • Implicit diversification approaches
    • Explicit diversification approaches
    • Query log mining for diversity
    • Learning-to-rank for diversification
    • Clustering of results for diversification
    • Query intent understanding
    • Query type classification
  • Modelling Risk:
    • Probability ranking principle
    • Risk Minimization frameworks and role diversity
  • Evaluation:
    • Test collections for diversity
    • Evaluating of diverse search results
    • Measuring the ambiguity of queries
    • Measuring query aspects importance
  • Applications:
    • Product & review diversification
    • Opinion and sentiment diversification
    • Diversifying Web crawling policy
    • Graph analysis for diversity
    • Summarisation
    • Legal precedents & patents
    • Diverse recommender systems
    • Diversifying in real-time & news search
    • Diversification in other verticals (image/video search etc.)
    • Presentation of diverse search results

While typing this up, I remembered the “little search engine that could” post (Going Head to Head with Google (and winning)). Are we really condemned to have to manage unforeseeable complexity or is that a poor design choice we made for search engines?

After all, I am not really interested in the entire WWW. At least for this blog I am interested in probably less than 1/10 of 1% of the web (or less). So if I had a search engine for all the CS/Library/Informatics publications, blogs, subject domains relevant to data/information, I would pretty much be set. A big semantic field and one that is changing, but not anything like search everything that is connected (or not, for the DeepWeb) to the WWW.

I don’t have an answer for that but I think it is an issue that may enable management of semantic diversity. That is we get to declare the edge of the map. Yes, there are other things beyond the edge but we aren’t going to include them in this particular map.

Uberlic

Friday, October 7th, 2011

Uberlic

From the about documenation:

The Doppelganger service translates between IDs of entities in third party APIs. When you query Doppelganger with an entity ID, you’ll get back IDs of that same entity in other APIs. In addition, a persistent Uberblic ID serves as an anchor for your application that you can use for subsequent queries.

So why link APIs? is answered in a blog entry:

There is an ever-increasing amount of data available on the Web via APIs, waiting to be integrated by product developers. But actually integrating more than just one API into a product poses a problem to developers and their product managers: how do we make the data sources interoperable, both with one another and with our existing databases? Uberblic launches a service today to make that easy.

A location based product, for example, would aim to pull in information like checkins from Foursquare, reviews from Lonely Planet, concerts from LastFM and social connections from Facebook, and display all that along one place’s description. To do that, one would need to identify this particular place in all the APIs – identify the place’s ‘doppelgangers’, if you will. Uberblic does exactly that, mapping doppelgangers across APIs, as a web service. It’s like a dictionary for IDs, the Rosetta Stone of APIs. And geolocation is just the beginning.

Uberblic’s doppelganger engine links data across a variety of data APIs. By matching equivalent records, the engine connects an entity graph that spans APIs and data services. This entity graph provides rich contextual data for product developers, and Uberblic’s APIs serve as a switchboard and broker between data sources.

See the full post at: http://uberblic.com/2011/08/one-api-to-link-them-all/

Useful. But as you have already noticed, no associations, no types, no way to map to other identifiers.

Not that a topic map could not use Uberlic data if available, just not is all that is possible.

Artificial Intelligence Resources

Sunday, September 25th, 2011

Artificial Intelligence Resources

A collection of collections of resources on artificial intelligence. Useful but also illustrates a style of information delivery that has advantages over “search style foraging” and disadvantages as well.

It’s biggest advantage over “search style foraging” is that it presents a manageable listing of resources and not several thousand links. Even very dedicated researchers are unlikely to follow links > hundreds and even if you did, some of the material would be outdated by the time you reached it.

Another advantage is that one hopes (I haven’t tried all the links) that the resources have been vetted to some degree, with the superficial and purely advertising sites being filtered out. Results are more “hit” than “miss,” which with search results can be a very mixed bag.

But a manageable list is just that, manageable, the very link you need may have missed the cut-off point. Had to stop somewhere.

And you can’t know the author’s criteria for the listing. Their definition of “algorithm” may broader or narrower than your own.

In the days of professional indexes, researchers learned a sense for the categories used by indexing services. At least that was a smaller set than the vocabulary range of every author.

How would you use topic maps to bridge the gap between those two solutions?