Archive for February, 2011

Data Engine Roundup

Wednesday, February 23rd, 2011

Data Engine Roundup

Mathew Hurst provides a quick listing of data engines (including his own, which merits a close look).

AI Mashup Challenge 2011

Wednesday, February 23rd, 2011

AI Mashup Challenge 2011

Due date: 1 April 2011

From the website:

The AI mashup challenge accepts and awards mashups that use AI technology, including but not restricted to machine learning and data mining, machine vision, natural language processing, reasoning, ontologies and the semantic web.
Imagine for example:

  • Information extraction or automatic text summarization to create a task-oriented overview mashup for mobile devices.
  • Semantic Web technology and data sources adapting to user and task-specific configurations.
  • Semantic background knowledge (such as ontologies, WordNet or Cyc) to improve search and content combination.
  • Machine translation for mashups that cross language borders.
  • Machine vision technology for novel ways of aggregating images, for instance mixing real and virtual environments.
  • Intelligent agents taking over simple household planning tasks.
  • Text-to-speech technology creating a voice mashup with intelligent and emotional intonation.
  • The display of Pub Med articles on a map based on geographic entity detection referring to diseases or health centers.

The emphasis is not on providing and consuming semantic markup, but rather on using intelligence to mashup these resources in a more powerful way.

This looks like an opportunity for an application that assists users in explicit identification or confirmation of identification of subjects.

Rather than auto-correcting, human-correcting.

Assuming we can capture the corrections, wouldn’t that mean that our apps would incrementally get “smarter?” Rather than starting off from ground zero with each request? (True, a lot of analysis goes on with logs, etc. Why not just ask?)

Big Oil and Big Data

Wednesday, February 23rd, 2011

Big Oil and Big Data Mike Betron, Marketing Director of Infoglide says that it is becoming feasible to mine “big data” and to exploit “entity resolution.”

Those who want to exploit the availability of big data have another powerful tool at their disposal – entity resolution. The ability to search across multiple databases with disparate forms residing in different locations can tame large amounts of data very quickly, efficiently resolving multiple entities into one and finding hidden connections without human intervention in many application areas, including detecting financial fraud.

By exploiting advancing technologies like entity resolution, systems can give organizations a distinct competitive advantage over those who lag in technology adoption.

I have to quibble about the …without human intervention… part, although I am quite happy with augmented human supervision.

Well, that and the implication that entity resolution is a new technology. In various guises, entity resolution has been in use for decades in medical epidemiology, for example.

Preventing subject identifications from languishing in reports, summaries, and the other information debris of a modern organization. So that organizational memories, documented and accessible organization memories prosper and grow, now that would be something different. (It could also be called a topic map.)

Do You Tweet?

Wednesday, February 23rd, 2011

This is probably old news to most people who Tweet but I stumbled across:, which maintains a directory of hash tags.

Hash tags for which you can submit definitions.

I have entered a definition for #odf.

There already is a definition for #topicmaps.

When you tweet about topic maps related material, events, etc., please use #topicmaps.

I try to use that plus hashtags from other relevant areas in hopes people will follow the topic maps tag as something unfamiliar that may be of interest.

Hard to say if it will be effective but I suspect no less effective than some marketing strategies I have seen.

Update: Apparently this service has been moved from this address to: And as far as I can tell, the hashtags for topicmaps and ODF have been lost.

Do note that the URL listed in the original tweet does not go to the tweet site but to “What the Trend? API.”

LingPipe Baseline for MITRE Name Matching Challenge

Tuesday, February 22nd, 2011

LingPipe Baseline for MITRE Name Matching Challenge.

Bob Carpenter walks though the use of LingPipe in connection with the MITRE Name Matching Challenge.

There are many complex issues in data mining but doing well on basic tasks is always a good starting place.


Tuesday, February 22nd, 2011


From the website:

Lucene is an Open Source, mature and high-performance Java search engine. It is highly flexible, and scalable from hundreds to millions of documents.

Luke is a handy development and diagnostic tool, which accesses already existing Lucene indexes and allows you to display and modify their content in several ways:

  • browse by document number, or by term
  • view documents / copy to clipboard
  • retrieve a ranked list of most frequent terms
  • execute a search, and browse the results
  • analyze search results
  • selectively delete documents from the index
  • reconstruct the original document fields, edit them and re-insert to the index
  • optimize indexes
  • open indexes consisting of multiple parts, and located on Hadoop filesystem
  • and much more…

Searching is interesting and I have several more search engines to report this week, but the real payoff is finding.

And recording the finding so that other users can benefit from it.

We could all develop our own maps of the London Underground, at the expense of repeating the effort of others.

Or, we can purchase a copy of the London Underground.

Which one seems more cost effective for your organization?

LOTR Timeline Graph

Tuesday, February 22nd, 2011

LOTR Timeline Graph

Interesting take on visualizing the time line in the Lord of the Rings.

Based on the entertaining but fairly flat movie time lines you will find at XKCD.

A much more subtle work could be based on the books. 😉

Not that movie time lines have to be flat, but there is only so much that be be done in the usual movie length presentation.

For LOTR enthusiasts:

  1. What other time lines or perspectives would you add?
  2. How would you represent deceptions concerning gender?
  3. What of Denethor’s “wrestling” with Sauron? How would you represent that?
  4. How would you deal with cross-overs between time lines, such as time as experienced by the Ents versus others?
  5. Your suggested issues/problems…

Quantum GIS

Tuesday, February 22nd, 2011

Quantum GIS

From the website:

QGIS is a cross-platform (Linux, Windows, Mac) open source application with many common GIS features and functions. The major features include:

1. View and overlay vector and raster data in different formats and projections without conversion to an internal or common format.

Supported formats include:

  • spatially-enabled PostgreSQL tables using PostGIS and SpatiaLite,
  • most vector formats supported by the OGR library*, including ESRI shapefiles, MapInfo, SDTS and GML.
  • raster formats supported by the GDAL library*, such as digital elevation models, aerial photography or landsat imagery,
  • GRASS locations and mapsets,
  • online spatial data served as OGC-compliant WMS , WMS-C (Tile cache), WFS and WFS-T

2. Create maps and interactively explore spatial data with a friendly graphical user interface. The many helpful tools available in the GUI include:

  • on the fly projection,
  • print composer,
  • overview panel,
  • spatial bookmarks,
  • identify/select features,
  • edit/view/search attributes,
  • feature labeling,
  • vector diagram overlay
  • change vector and raster symbology,
  • add a graticule layer,
  • decorate your map with a north arrow, scale bar and copyright label,
  • save and restore projects

3. Create, edit and export spatial data using:

  • digitizing tools for GRASS and shapefile formats,
  • the georeferencer plugin,
  • GPS tools to import and export GPX format, convert other GPS formats to GPX, or down/upload directly to a GPS unit

4. Perform spatial analysis using the fTools plugin for Shapefiles or the integrated GRASS plugin, including:

  • map algebra,
  • terrain analysis,
  • hydrologic modeling,
  • network analysis,
  • and many others

5. Publish your map on the internet using the export to Mapfile capability (requires a webserver with UMN MapServer installed)

6. Adapt Quantum GIS to your special needs through the extensible plugin architecture.

I didn’t find this on my own. 😉 This and the T I G E R data source were both mentioned Paul Smith’s Mapping with Location Data presentation.

Data and manipulations you usually find have no explicit basis in subject identity but that is your opportunity to really shine.

Assuming you can discover some user need that can be met with explicit subject identity or met better with explicit subject identity than not.

Let’s try not to be like some vendors I could mention where a user’s problem has to fit the solution they are offering. I turned down an opportunity like that, some thirty years ago now, and see no reason to re-visit that decision.

At least in my view, any software solution has to fit my problem, not vice versa.

T I G E R – Topologically Integrated Geographic Encoding and Referencing system

Tuesday, February 22nd, 2011

T I G E R – Topologically Integrated Geographic Encoding and Referencing system

From the US Census Bureau.

From the website:

Latest TIGER/Line® Shapefile Release

  • TIGER/Line®Shapefiles are spatial extracts from the Census Bureau’s MAF/TIGER database, containing features such as roads, railroads, rivers, as well as legal and statistical geographic areas.
  • They are made available to the public for no charge and are typically used to provide the digital map base for a Geographic Information System or for mapping software.
  • They are designed for use with geographic information system (GIS) software. The TIGER/Line®Shapefiles do not include demographic data, but they contain geographic entity codes that can be linked to the Census Bureau’s demographic data, available on American FactFinder

2010 TIGER/Line® Shapefiles Main Page — Released on a rolling basis beginning November 30, 2010.


TIGER®-Related Products

Great source of geographic and other data.

Can use it for mashups or, you can push beyond mashups to creating topic maps.

For example, plotting all the crime in an area is a mashup.

Interesting I suppose for real estate agents pushing housing in better neighborhoods.

Having the crime reported in an area and the location of crimes committed by the same person (based on arrest reports) and known associates of that person, that is starting to sound like a topic map. Then add in real time observations and conversations of officers working the area.

Enhancing traditional law enforcement, the most effective way to prevent terrorism.

“Mapping with Location Data” by Paul Smith (February 2011)

Tuesday, February 22nd, 2011

“Mapping with Location Data” by Paul Smith (February 2011)

From the description:

With the recent announcement of Baltimore’s open-data initiative, “OpenBaltimore”, there’s been lots of buzz about what people can do with some of this city data. Enter Paul Smith, an expert on data and mapping. Paul will be talking about EveryBlock and how his company uses city data in their neighborhood maps, as well as showing off some cool map visualizations. He’ll also be providing some insight on how folks might be able to jump in and create their own maps based on their own location data.

Our speaker Paul Smith is co-founder and software engineer for EveryBlock, a “what’s going on in my neighborhood” website. He has been developing sites and software for the web since 1994. Originally from Maryland, he recently moved to Baltimore after more than a decade in Chicago, where we co-founded Friends of the Bloomingdale Trail and produced the Election Day Advent Calendar.

Great source of ideas for city data and use of the same.

Two ways where topic maps are a value-add:

1) The relationships between data sets are subjects that can be represented and additional information recorded about those relationships.

2) Identification of subjects can support reliable attachment of other data to the same subjects.

See an Error at the Washington Post? Now You Can Easily Report It

Tuesday, February 22nd, 2011

See an Error at the Washington Post? Now You Can Easily Report It

I’m not sure what would leave me less impressed but I can say this story doesn’t do much for me.

How about you?

The news is that every story will have have a link to a form for reader feedback.

That’s better than current practice but here are a couple of things that might make more of a difference:

  • Where possible, embed links directly in stories for websites or other online resources that are mentioned.

    Why cite a report from an agency, commission, etc. that is online and not provide a link?

    Leaves me with the impression you want me take your word for it.

  • Provide permalinks so users can create mappings to news stories to be used with their data.

I would say that the permalinks should contain explicit subject identity but that is expecting too much.

If we can link to it, we can add explicit subject identity.


Tuesday, February 22nd, 2011


From the website:

So, we build a web site or an application and want to add search to it, and then it hits us: getting search working is hard. We want our search solution to be fast, we want a painless setup and a completely free search schema, we want to be able to index data simply using JSON over HTTP, we want our search server to be always available, we want to be able to start with one machine and scale to hundreds, we want real-time search, we want simple multi-tenancy, and we want a solution that is built for the cloud.

This should be easier“, we declared, “and cool, bonsai cool“.

elasticsearch aims to solve all these problems and more. It is an Open Source (Apache 2), Distributed, RESTful, Search Engine built on top of Lucene.

Another contender in the space for search engines.

Do you have a favorite search engine? If so, what about it makes it your favorite?

Soylent: A Word Processor with a Crowd Inside

Monday, February 21st, 2011

Soylent: A Word Processor with a Crowd Inside

I know, I know, won’t even go there. As the librarians say: “Look it up!”

From the abstract:

This paper introduces architectural and interaction patterns for integrating crowdsourced human contributions directly into user interfaces. We focus on writing and editing, complex endeavors that span many levels of conceptual and pragmatic activity. Authoring tools offer help with pragmatics, but for higher-level help, writers commonly turn to other people. We thus present Soylent, a word processing interface that enables writers to call on Mechanical Turk workers to shorten, proofread, and otherwise edit parts of their documents on demand. To improve worker quality, we introduce the Find-Fix-Verify crowd programming pattern, which splits tasks into a series of generation and review stages. Evaluation studies demonstrate the feasibility of crowdsourced editing and investigate questions of reliability, cost, wait time, and work time for edits.

When I first started reading the article, it seemed obvious to me that the Human Macro option could be useful for topic map authoring. At least if the tasks were sufficiently constrained.

I was startled to see a 30% error rate for the “corrections” was considered a baseline, hence the necessity for correction/control mechanisms.

The authors acknowledge that the bottom line cost of out-sourcing may weigh against its use in commercial contexts.

Perhaps so but I would run the same tests against published papers and books. To determine the error rate without an out-sourced correction loop.

I think the idea is basically sound, although for some topic maps it might be better to place qualification requirements on the outsourcing.

Introducing Vector Maps

Monday, February 21st, 2011

Introducing Vector Maps

From the post:

Modern distributed data stores such as CouchDB and Riak, use variants of Multi-Version Concurrency Control to detect conflicting database updates and present these as multi-valued responses.

So, if I and my buddy Ola both update the same data record concurrently, the result may be that the data record now has multiple values – both mine and Ola’s – and it will be up to the eventual consumer of the data record to resolve the problem. The exact schemes used to manage the MVCC differs from system to system, but the effect is the same; the client is left with the turd to sort out.

This let me to an idea, of trying to create a data structure which is by it’s very definition itself able to be merged, and then store such data in these kinds of databases. So, if you are handed two versions, there is a reconciliation function that will take those two records and “merge” them into one sound record, by some definition of “sound”.

Seems to me that reconciliation should not be limited to records differing based on time stamps. 😉

Will have to think about this one for while but it looks deeply similar to issues we are confronting in topic maps.

BTW, saw this first at Alex Popescu’s myNoSQL blog.

Still Building Memex & Topic Maps Part 1

Monday, February 21st, 2011

Still Building the Memex, Stephen Davies writes:

We define a Personal Knowledge Base – or PKB – as an electronic tool through which an individual can express, capture, and later retrieve the personal knowledge he or she has acquired.

Personal: Like Bush’s memex, a PKB is intended for private use, and its contents are custom tailored to the individual. It contains trends, relationships, categories, and personal observations that its owner sees but which no one else may agree with. Many of the issues involved in PKB design are also relevant in collaborative settings, as when a homogeneous group of people is jointly building a shared knowledge base. In this case, the knowledge base could simply reflect the consensus view of all contributors; or, perhaps better, it could simultaneously store and present alternate views of its contents, so as to honor several participants who may organize it or view it differently. This can introduce another level of complexity.

I am not sure that having “ear buds” from an intellectual IPod like a PKB is a good idea.

The average reader is already well insulated from the inconvenience of information or opinions dissimilar from their own. Reflecting the “consensus view of all contributors,” is a symptom, and not a desirable one at that.

We have had the equivalents of PKBs over both Republican and Democratic administrations.

The consequences of PKB or Personal Knowledge Bases, weapons of mass destruction in Iraq, the collapse of the housing market, serious mis-steps in both foreign and domestic policy, are too well known to need elaboration.

The problem posed by a PKB is simpler but what we need are PbKB, Public Knowledge Bases.

Even though more complex, a PbKB has the potential to put all citizens on a even footing with regard to debates over policy choices.

It may be difficult to achieve in practice and only every partially successful, but the result could hardly be worse than the echo chamber of a PKB.

(Still Building Memex & Topic Maps Part 2 – Beyond the Echo Chamber)


Monday, February 21st, 2011

Sphinx – Open Source Search Server

Benjamin Bock mentioned Sphinx in a Twitter posting and so I had to go see what he was reading about.

The short version from the website:

Sphinx is an open source full text search server, designed from the ground up with performance, relevance (aka search quality), and integration simplicity in mind. It’s written in C++ and works on Linux (RedHat, Ubuntu, etc), Windows, MacOS, Solaris, FreeBSD, and a few other systems.

Sphinx lets you either batch index and search data stored in an SQL database, NoSQL storage, or just files quickly and easily — or index and search data on the fly, working with Sphinx pretty much as with a database server.

A variety of text processing features enable fine-tuning Sphinx for your particular application requirements, and a number of relevance functions ensures you can tweak search quality as well.

Searching via SphinxAPI is as simple as 3 lines of code, and querying via SphinxQL is even simpler, with search queries expressed in good old SQL.

For more specific details, see: About Sphinx.

Comments welcome.

Topic Maps and the Semantic Web

Monday, February 21st, 2011

As I pointed out in Topic Maps in < 5 Minutes, topic maps take it as given that:


sodium chloride


are all legitimate identifiers for the same subject.

That is called semantic diversity. (There are more ways to identify salt but let’s stick with those for the moment.)

Contrast that with the Semantic Web, that wants to have one identifier for salt and consequently, no semantic diversity.

You may ask yourself, what happens to all the previous identifications of salt, literally thousands of different ways to identify it?

Do we have to re-write all those identifiers across the vast sweep of human literature?

Or you may ask yourself, given the conditions that lead to semantic diversity still exist, how is future semantic diversity to be avoided?

Good questions.

Or for that matter, the changing information structures and growing information structures where we are storing petabytes of data. What about semantic diversity there as well?

I don’t know.

Maybe we should ask Tim Berners-Lee, timbl @

PS: And this is a really easy subject to identify. Just think about democracy, human rights, justice, or any of the others of thousands of subjects, all of which are works in progress.

TF-IDF Weight Vectors With Lucene And Mahout

Monday, February 21st, 2011

How To Easily Build And Observe TF-IDF Weight Vectors With Lucene And Mahout

From the website:

You have a collection of text documents, and you want to build their TF-IDF weight vectors, probably before doing some clustering on the collection or other related tasks.

You would like to be able for instance to see what are the tokens with the biggest TF-IDF weights in any given document of the collection.

Lucene and Mahout can help you to do that almost in a snap.

Why is this important for topic maps?

Wikipedia reports:

The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document’s relevance given a user query. (, cited in this posting)

Knowing the important terms in a document collection is one step towards a useful topic map. May not be definitive but it is a step in the right direction.

Topics Maps in < 5 Minutes

Monday, February 21st, 2011

Get a cup of coffee and set your timer!


The best way to explain topic maps is to say why we want them in the first place.

I have a set of food/recipe/FDA documents.

Some use the term salt.

Some use the term sodium chloride.

Some use the term NaCl.

Walmart wants to find information on salt/sodium chloride/NaCl, no matter how it is identified.

Not a hypothetical use case: Walmart Launches Major Initiative to Make Food Healthier and Healthier Food More Affordable

Topic maps would enable Walmart to use any one identification for salt, and to retrieve all the information about it, recorded using any of its identifications.*


Topics represent subjects and can have multiple, independent identifications of the same subject.

That sounds cool!

Can I talk about relationships too? Like they do in the tabloids? 😉

Well, sure, but we call those associations.

Let’s keep with the salt theme.

What about: Mark Kurlansky wrote “Salt: A World History”.

How many subjects to you see?

Mark Kurlansky (subject) wrote “Salt: A World History” (subject).

Two right?

But Mark is probably in a lot of associations, both professional and personal. How do we keep those straight?

Topic maps identify another three subjects:

Mark Kurlansky (subject) (role – author) wrote (subject – written-by) “Salt: A World History” (subject) (role – work).

Mark is said to “play” the role of author in the association. “Salt: A World History” plays the role of a work in the association.

Means we can find all the places where Mark plays the author role, as opposed to husband role, speaker role, etc.

One more thing rounds out the typical topic map, occurrences.

Occurrences are used to write down places where we have seen a subject discussed.

That is we say that Mark Kurlansky (subject) occurs at:

(How am I doing for time?)

That’s it.

To review:

  1. Topics represent subjects that can have multiple identifications.
  2. Use of any identifier should return information recorded using any identifiers for the same subject.
  3. Associations are relationships between subjects (role-players) playing roles.
  4. Occurrences are pointers to where subjects are discussed.


Choose any subject that you like to talk about. Now answer the following questions about your subject:

  1. Name at least three ways it is identified when people talk about it.
  2. Name at least one association and the roles in it for your subject.
  3. Name at least three occurrences (discussions) of your subject that you would like to find again.

Congratulations! Except for the syntax, you have just gathered all the information you need for your first topic map.

* The salt example is only one of literally hundreds of thousands of multiple identifier type issues that confronts any consumer of information.

Walmart could use topic maps to collate information of interest to its 140 million customers every week and to deliver that as a service both to its customers as well as other commercial consumers of information.

A contrast to the U-Sort-It model of Google, which delivers dumpster loads of information, over and over again, for the same request.

PS: There are more complex issues and nuances of the syntaxes but this is the essence of topic maps.

Detecting Defense Fraud With Topic Maps

Sunday, February 20th, 2011

The New York Times, reported Sunday, Hiding Details of Dubious Deal, U.S. Invokes National Security extraordinary fraud in defense contracting.

Gen. Victor E. Renuart, Jr. of the Air Force, former commander of the Northern Command is quoted as saying:

We’ve seen so many folks with a really great idea, who truly believe their technology is a breakthrough, but it turns out not to be.

OK, but the technology in question was alleged to detect messages in broadcasts from Al Jazeera. (I suppose something more than playing them backwards and hearing, “number 9, number 9, number 9, ….”)

The fact that some nut-job wraps up in a flag and proclaim they want to save American lives should not short circuit routine sanity checks.

Here’s one where topic maps would be handy:

Build a topic map interface to the Internet Movie Database that has extracted all the technologies used in all the movies listed.

Give each of those technologies a set of attributes so a contracting officer can check them off while reading contract proposals.

For example, in this case:

  • hidden messages
  • TV broadcasts
  • by enemy

Which would return (along with possibly others): Independence Day.

That should terminate the funding review with a referral to the U.S. Attorney for attempted fraud or the local district attorney for competency hearings.

New technologies are developed all the time but non-fraudulent proposals based upon them can be independently verified before funding.

If it works only for the inventor or in their lab, pass on the opportunity.

BTW, a topic map of who was being funded for what efforts could have made the Air Force aware other departments were terminating funding with this applicant.

Caution: It is always possible to construct topic maps (or other analysis) in hindsight, that avoid problems that have already happened. That said, topic maps can be used to navigate existing information systems, providing a low impact/risk way to evaluate the utility of topic maps in addressing particular problems.

On Distributed Consistency — Part 1 (MongoDB)

Sunday, February 20th, 2011

On Distributed Consistency — Part 1 (MongoDB)

The first of a six part series on consistency in distributed databases.

From the website:

See also:

  • Part 2 – Eventual Consistency
  • Part 3 – Network Partitions
  • Part 4 – Multi Data Center
  • Part 5 – Multi Writer Eventual Consistency
  • Part 6 – Consistency Chart

For distributed databases, consistency models are a topic of huge importance. We’d like to delve a bit deeper on this topic with a series of articles, discussing subjects such as what model is right for a particular use case. Please jump in and help us in the comments.

Consistency is an issue that will confront distributed topic maps so best to start learning the options now.

Mio- Distributed Skip Graph based ordered KVS

Sunday, February 20th, 2011

Mio- Distributed Skip Graph based ordered KVS by Taro Minowa.

From the slide deck:

Mio is…

  • a distributed orderd KVS
  • memcached + range search
  • Skip Graph based
  • Written in Erlang
  • In alpha quality

On skip graphs, see James Aspnes:

Skip graphs, with Gauri Shah. ACM Transactions on Algorithms, 3(4):37, November 2007. An earlier version appeared in Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, January 2003, pp. 384–393. Available as: arXiv:cs.DS/0306043. [PDF]

Not going to be important in all cases but perhaps important to have in your tool kit.

It will be interesting to see as topic maps develop, the techniques that are successful in particular fields.

So those lessons can be applied in similar situations.

No patterns just yet but not excluding the possibility.

Riak Search

Sunday, February 20th, 2011

Riak Search

From the website:

Riak Search is a distributed, easily-scalable, failure-tolerant, real-time, full-text search engine built around Riak Core and tightly integrated with Riak KV.

Riak Search allows you to find and retrieve your Riak objects using the objects’ values. When a Riak KV bucket has been enabled for Search integration (by installing the Search pre-commit hook), any objects stored in that bucket are also indexed seamlessly in Riak Search.

The Riak Client API can then be used to perform Search queries that return a list of bucket/key pairs matching the query. Alternatively, the query results can be used as the input to a Riak map/reduce operation. Currently the PHP, Python, Ruby, and Erlang APIs support integration with Riak Search.

The indexing of XML data (it takes path/element name as key) is plausible enough. Made me wonder about a slightly different operation.

What if as part of the indexing operation, additional properties were added to the key?

Could be as simple as the DTD/Schema that defines the element or more complex information about the field.

HSearch NoSQL Search Engine Built on HBase

Sunday, February 20th, 2011

HSearch NoSQL Search Engine Built on HBase

HSearch features include:

  • Multi-XML formats
  • Record and document level search access control
  • Continuous index updation
  • Parallel indexing using multi-machines
  • Embeddable inside application
  • A REST-ful Web service gateway that supports XML
  • Auto sharding
  • Auto replication

Original title and link: HSearch: NoSQL Search Engine Built on HBase (NoSQL databases © myNoSQL)

Another entry in the NoSQL arena.

I don’t recall but was parallel querying discussed for TMQL?

AllegroMCOCE: GPU-accelerated Cytoscape Plugin
TM Explorer?

Sunday, February 20th, 2011

AllegroMCOCE: GPU-accelerated Cytoscape Plugin

From the website:

AllegroMCODE is a high-performance Cytoscape plugin to find clusters, or highly interconnected groups of nodes in a huge complex network such as a protein interaction network and a social network in real time. AllegroMCODE finds the same clusters as the MCODE plugin does, but the analysis usually takes less than a second even for a large complex network. The plugin user interface of AllegroMCODE is based on MOCDE and has additional features. AllegroMCODE is an open source software and freely available under LGPL.

Cluster has various meanings according to the sources of networks. For instance, a protein-protein interaction network is represented as proteins are nodes and interactions between proteins are edges. Clusters in the network can be considered as protein complexes and functional modules, which can be identified as highly interconnected subgraphs. For social networks, people and their relationships are represented as nodes and edges, respectively. A cluster in the network can be considered as a community which has strong inter-relationship among their members.

AllegroMCODE exploits our high performance GPU computing architecture to make your analysis task faster than ever. The analysis task of the MCODE algorithm to find the clusters can be long for large complex networks even though the MCODE is a relatively fast method of clustering. AllegroMCODE provides our parallel algorithm implementation base on the original sequential MCODE algorithm. It can achieve two orders of magnitude speedup for the analysis of a large complex network by using the latest graphics card. You can also exploit the GPU acceleration without any special graphics hardware since it provides the seamless remote processing in our free GPU computing server.

You do not need to purchase any special GPU hardware or systems and also not to care about the tedious installation task of them. All you have to do are to install the AllegroMCODE plugin module on your computer and create a free account on our server.

Simply awesome!

The ability to dynamically explore and configure topic maps will be priceless.

A greater gap than between hot-lead type and a modern word processor.

Will take weeks/months to fully explore but wanted to bring it to your attention.

A thought on Hard vs Soft – Post
(nonIdentification vs. multiIdentification?)

Sunday, February 20th, 2011

A thought on Hard vs Soft by Dru Sellers starts off with:

With the move from RDBMS to NoSQL are we seeing the same shift that we saw when we moved from Hardware to Software. Are we seeing a shift from Harddata to Softdata? (emphasis in original)

See his post for the rest of the post and the replies.

Do topic maps address a similar hardIdentification vs. softIdentification?

By hardIdentification I mean a single identification.

But it goes further than that doesn’t it?

There isn’t even a single identification in most information systems.

Think about it. You and I both see the same column names and have different ideas of what they mean.

I remember reading in Doan’s dissertation (see Auditable Reconciliation) that a schema reconciliation project would have taken 12 person years but for the original authors being available.

We don’t have any idea what has been identified in most systems and no way to compare it to other “identifications.”

What is this? Write once, Wonder Many Times (WOWMT)?

So, topic maps really are a leap from nonIdentification to multiIdentification.

No wonder it is such a hard sell!

People aren’t accustomed to avoiding the cost of nonIdentification and here we are pitching the advantages of multiIdentification.

Pull two tables at random for your database and have a contest to see who outside the IT department can successfully identify what the column headers represent. No data, just the column headers.*

What other ways can we illustrate the issue of nonIdentification?

Interested in hearing your suggestions.

*I will be posting column headers from public data sets and asking you to guess their identifications.

BTW, some will argue that documentation exists for at least some of these columns.

True enough, but from a processing standpoint it may as well be on a one way mission to Mars.

If the system doesn’t have access to it, it doesn’t exist. (full stop)

Gives you an idea of how impoverished our systems truly are.

IBM’s Watson (the computer, not IBM’s founder, who was also soulless) has been described as deaf and blind. Not only that, but it has no more information than it is given. It cannot ask for more. The life of pocket calculator, if it had emotions, is sad.

Cascalog: Clojure-based Query Language for Hadoop – Post

Saturday, February 19th, 2011

Cascalog: Clojure-based Query Language for Hadoop

From the post:

Cascalog, introduced in the linked article, is a query language for Hadoop featuring:

  • Simple – Functions, filters, and aggregators all use the same syntax. Joins are implicit and natural.
  • Expressive – Logical composition is very powerful, and you can run arbitrary Clojure code in your query with little effort.
  • Interactive – Run queries from the Clojure REPL.
  • Scalable – Cascalog queries run as a series of MapReduce jobs.
  • Query anything – Query HDFS data, database data, and/or local data by making use of Cascading’s “Tap” abstraction
  • Careful handling of null values – Null values can make life difficult. Cascalog has a feature called “non-nullable variables” that makes dealing with nulls painless.
  • First class interoperability with Cascading – Operations defined for Cascalog can be used in a Cascading flow and vice-versa
  • First class interoperability with Clojure – Can use regular Clojure functions as operations or filters, and since Cascalog is a Clojure DSL, you can use it in other Clojure code.

From Alex Popescu’s myNoSQL

There are a number of NoSQL query languages.

Which should be considered alongside TMQL4J in TMQL discussions.


Saturday, February 19th, 2011


From the website:

ElephantDB is a database that specializes in exporting key/value data from Hadoop. ElephantDB is composed of two components. The first is a library that is used in MapReduce jobs for creating an indexed key/value dataset that is stored on a distributed filesystem. The second component is a daemon that can download a subset of a dataset and serve it in a read-only, random-access fashion. A group of machines working together to serve a full dataset is called a ring.

Since ElephantDB server doesn’t support random writes, it is almost laughingly simple. Once the server loads up its subset of the data, it does very little. This leads to ElephantDB being rock-solid in production, since there’s almost no moving parts.

ElephantDB server has a Thrift interface, so any language can make reads from it. The database itself is implemented in Clojure.

I rather like that, “…almost no moving parts.”

That has to pay real dividends over the id shuffle in some topic map implementations. Both in terms of processing overhead as well as in auditing.

Group Theoretical Methods and Machine Learning

Saturday, February 19th, 2011

Group Theory and Machine Learning


The use of algebraic methods—specifically group theory, representation theory, and even some concepts from algebraic geometry—is an emerging new direction in machine learning. The purpose of this tutorial is to give an entertaining but informative introduction to the background to these developments and sketch some of the many possible applications, including multi-object tracking, learning rankings, and constructing translation and rotation invariant features for image recognition. The tutorial is intended to be palatable by a non-specialist audience with no prior background in abstract algebra.

Be forewarned, tough sledding if you are not already a machine learning sort of person.

But, since I don’t post what I haven’t watched, I did watch the entire video.

It suddenly got interesting just past 93:08 when Risi Kondor started talking about blobs on radar screens and associating information with them…., wait, run that by once again, …blobs on radar screens and associating information with them.

Oh, that is what I thought he said.

I suppose for fire control systems and the like as well as civilian applications.

I am so much of a text and information navigation person that I don’t often think about other applications for “pattern recognition” and the like.

With all the international traveling I used to do, being a blob on a radar screen got my attention!

Has applications in tracking animals in the wild and other tracking with sensor data.

Another illustration of why topic maps need an open-ended and extensible notion of subject identification.

What we think of as methods of subject identification may not be what others think of as methods of subject identification.

IOM-NAE Health Data Collegiate Challenge – 27 April 2010 Deadline

Saturday, February 19th, 2011

IOM-NAE Health Data Collegiate Challenge

From the website:

The IOM and the National Academy of Engineering (NAE) of the National Academies invite college and university students to participate in an exciting, new initiative to transform health data into innovative, effective new applications (apps) and tools that take on the nation’s pressing health issues. With reams of U.S. Department of Health and Human Services (HHS) data and other health data newly available as part of the Health Data Initiative (HDI), students have an unprecedented opportunity to create interactive apps and other tools that engage and empower people in ways that lead to better health. Working in interdisciplinary teams that meld technological skills with health knowledge, the IOM and NAE believe that college students can generate powerful new products—the next “viral app”— to improve health for communities and individuals.

Along with spreading this one on college campuses, need to also point out the advantages of topic maps!