Archive for April, 2011

When Data Mining Goes Horribly Wrong

Saturday, April 30th, 2011

In When Data Mining Goes Horribly Wrong, Matthew Hurst brings us a cautionary tale about what can happen when “merging” decisions are made badly.

From the blog:

Consequently, when you see a details page – either on Google, Bing or some other search engine with a local search product – you are seeing information synthesized from multiple sources. Of course, these sources may differ in terms of their quality and, as a result, the values they provide for certain attributes.

When combining data from different sources, decisions have to be made as to firstly when to match (that is to say, assert that the data is about the same real world entity) and secondly how to merge (for example: should you take the phone number found in one source or another?).

This process – the conflation of data – is where you either succeed or fail.

Read Matthew’s post for encouraging signs that there is plenty of room for the use of topic maps.

What I find particularly amusing is that repair of the merging in this case doesn’t help prevent it from happening again and again.

Not much of a repair if the problem continues to happen elsewhere.

CS 533: Natural Language Processing

Saturday, April 30th, 2011

CS 533: Natural Language Processing

Don’t be mis-led by the title! This isn’t just another NLP course.

From the Legal Informatics Blog:

Professor Dr. L. Thorne McCarty of the Rutgers University Department of Computer Science has posted lecture videos and other materials in connection with his recent graduate course on Natural Language Processing. The course uses examples from a judicial decision: Carter v. Exxon Company USA, 177 F.3d 197 (3d Cir. 1999).

From Professor McCarty’s post on LinkedIn:

To access most of the files, you will need the username: cs533 and the password: shrdlu. To access the videos, use the same password: shrdlu. Comments are welcome!

NLP and legal materials? Now there is a winning combination!

I must confess that years of practicing law and even now continuing to read legal materials in some areas may be influencing my estimate of the appeal of this course. 😉 I will watch the lectures and get back to you.

Saturday, April 30th, 2011

Collection of sites, blogs, resources, etc., for legal research.

I wasn’t surprised by its U.S. centric offerings until I saw that it also offered fair coverage of Latin and South America.

What happened to European countries, the EU, Africa, Asia, etc.?

If anyone knows of legal resource collections for other countries or regions, I would like to report them here.

Natural Language Processing for the Working Programmer

Saturday, April 30th, 2011

Natural Language Processing for the Working Programmer

Daniël de Kok and Harm Brouwer have started a book on natural language processing using Haskell.

Functional programming meets NLP!

A work in progress so I am sure the authors would appreciate comments, suggestions, etc.

BTW, there is a blog, ? Try ?t H?ske?? in ?inguistics with posts working through the book.

Artificial Intelligence | Natural Language
Processing – Videos

Saturday, April 30th, 2011

I first blogged about Christopher Manning’s Artificial Intelligence | Natural Language Processing back in February, 2011.

Video’s of the lectures are now online and I thought that merited a separate update to that entry.

I have also edited that entry to point to the videos.


Bridging the Gulf:…

Saturday, April 30th, 2011

Bridging the Gulf: Communication and Information in Society, Technology, and Work

October 9-13, 2011, New Orleans, Louisiana

From the website:

The ASIST Annual Meeting is the main venue for disseminating research centered on advances in the information sciences and related applications of information technology.

ASIST 2011 builds on the success of the 2010 conference structure and will have the integrated program that is an ASIST strength. This will be achieved using the six reviewing tracks pioneered in 2010, each with its own committee of respected reviewers to ensure that the conference meets your high expectations for standards and quality. These reviewers, experts in their fields, will assist with a rigorous peer-review process.

Important Dates:

  1. Papers, Panels, Workshops & Tutorials
    • Deadline for submissions: May 31
    • Notification to authors: June 28
    • Final copy: July 15
  2. Posters, Demos & Videos:
    • Deadline for submissions: July 1
    • Notification to authors: July 20
    • Final copy: July 27

One of the premier technical conferences for librarians and information professionals in the United States.

The track listings are:

  • Track 1 – Information Behaviour
  • Track 2 – Knowledge Organization
  • Track 3 – Interactive Information & Design
  • Track 4 – Information and Knowledge Management
  • Track 5 – Information Use
  • Track 6 – Economic, Social, and Political Issues

A number of opportunities for topic map based presentations.

The conference being located in New Orleans is yet another reason to attend! The food, music, and street life has to be experienced to be believed. No description would be adequate.

BigGarbage In -> BigGarbage Out

Friday, April 29th, 2011

Taking Hadoop Mainstream

This is just one example of any number of articles that lament how hard Hadoop is to explain to non-technical users.

Apparently there is an anticipated flood of applications that will have Hadoop “under the hood” so to speak that are due out later this year.

While I don’t doubt it will be true that enormous amounts of data will be analyzed by those applications, without some underlying understanding of the data, will the results be meaningful?

Note that I said the data and not Hadoop.

Understanding Hadoop is just a technicality.

An important one but whether one uses a cigar box with paper and pencil or the latest non-down cloud infrastructure with Hadoop, understanding the data and the operations to be performed upon it are far more important.

Processing large amounts of data will not be cheap and so the results of necessity will be seen as reliable. Yes? Or else we would not have spent all that money and you can see the answer to the problem is….

You can hear the future conversations as clearly as I can.

BigData simply means you have a big pile of data. (I forego the use of the other term.)

Whether you can extract meaningful results depends on the same factors as before the advent of “BigData.”

The principal one being an understanding of the data and its limitations. Which means human analysis of the data set and its gathering.

Data sets (large or not) are typically generated or used by staff and capturing their insights into particular aspects of a data set can be easily done using a topic map.

A topic map can collect and coordinate multiple views on the use and limitations of data sets.

Subsequent users don’t discover too late that a particular data set is unreliable or limited in some unforeseen way.

Hadoop is an important emerging technology subject to the rule:

BigGarbage In -> BigGarbage Out.

Duolingo: The Next Chapter in Human Communication

Friday, April 29th, 2011

Duolingo: The Next Chapter in Human Communication

By one of the co-inventors of CAPTCHA and reCAPTCHA, Luis von Ahn, so his arguments should give us pause.

Luis wants to address the problem of translating the web into multiple languages.

Yes, you heard that right, translate the web into multiple languages.

Whatever you think now, watch the video and decide if you still feel the same way.

My question is how to adapt his techniques to subject identification?

Tamana – Release

Friday, April 29th, 2011

Tamana: A generic topic map browser.

A new release from the TopicmapsLab

I haven’t looked at it yet but plan to over the weekend.

Building Erlang Applications with Rebar

Friday, April 29th, 2011

Since Erlang underlies Riak, a NoSQL distributed key-value store, I thought this might be of interest:

Building Erlang Applications with Rebar

From the description:

Rebar is an Open Source project that provides a set of standardized build tools for OTP applications and releases. This talk will explore how to use Rebar in a typical development environment for everything from simple Erlang code to port drivers and embedded Erlang servers.

Talk Objectives: Introduce the major features and functionality that rebar provides to Erlang developers. Examine the architecture of rebar and discuss how it can be extended.

Target Audience: Any Erlang developer interested in using or extending rebar for their build system.

Additional references:

Rebar: Erlang Build Tool

Erlang App. Management with Rebar (another tutorial)

Rebar Mailing List

Horton: online query execution on large distributed graphs

Friday, April 29th, 2011

Horton: online query execution on large distributed graphs by Sameh Elnikety, Microsoft Research.

The presentation addresses three problems with large, distributed graphs:

  1. How to partition the graph
  2. How to query the graph
  3. How to update the graph

Investigates a graph query language, execution engine and optimizer, and concludes with initial results.

ARF Graph Layout

Friday, April 29th, 2011

ARF Graph Layout

From the website:

As Networks and their structure have become a major field of research a strong demand to visualize these networks has emerged. We address this challenge by formalizing the well established spring layout in terms of dynamic equations. We thus opening up the design space for new algorithms. Drawing from the knowledge of systems design, we derive a layout algorithm that remedies several drawbacks of the original spring layout. This new algorithm relies on the balancing of two antagonistic forces. We thus call it ARF for “attractive and repulsive forces”. It is, as we claim, particularly suited for dynamic layout of smaller networks. We back this claim with several application examples from on going complex systems research.

Reference and Response

Thursday, April 28th, 2011

Reference and Response by Louis deRosset. Australian Journal of Philosophy, March 2011, Vol. 89, No.1.

Before you skip this entry, realize that this article may shine light on why Linked Data works at all and quite possibly how to improve subject identification for Linked Data and topic maps as well.


A standard view of reference holds that a speaker’s use of a name refers to a certain thing in virtue of the speaker’s associating a condition with that use that singles the referent out. This view has been criticized by Saul Kripke as empirically inadequate. Recently, however, it has been argued that a version of the standard view, a response-based theory of reference, survives the charge of empirical inadequacy by allowing that associated conditions may be largely or even entirely implicit. This paper argues that response-based theories of reference are prey to a variant of the empirical inadequacy objection, because they are ill-suited to accommodate the successful use of proper names by pre-school children. Further, I argue that there is reason to believe that normal adults are, by and large, no different from children with respect to how the referents of their names are determined. I conclude that speakers typically refer positionally: the referent of a use of a proper name is typically determined by aspects of the speaker’s position, rather than by associated conditions present, however implicitly, in her psychology.

With apologies to the author but I would sum up his position (sorry) on referents to be that we use proper nouns to identify particular people because we have learned those references from others, that is our position in a community of users of that referent.

That is to say that all the characteristics that we can recite when called upon to say why we have identified a particular person are much like logic that justifies, after the fact, mathematical theorems and insights. Mathematical theorems and insights being “seen” first and then “proved” as justification for others.

Interesting. Another reason why computers do so poorly at subject identification. Computers are asked to act as we imagine ourselves identifying subjects and not how we identify them in fact.

How does that help with Linked Data and topic maps?

First, I would extend the author’s argument to all referents.

Second, it reveals that the URI/L versus properties to identify a subject is really a canard.

What is important, in terms of subject identification, is the origin of the identification.

For example, if “positionally” I am using .lg as used in Unix in a Nutshell, page 12-7, that is all you need to know to distinguish its reference from all the trash that a web search engine returns.

Adding up properties of “Ligature mode” of Nroff/Troff isn’t going to get you any closer to the referent of .lg. Because that isn’t how anyone used .lg in the same sense I did.

The hot question is how to capture our positional identification of subjects.

Which would include when two or more references are for the same subject.

PS: I rather like deRosset’s conclusion:

Someone, long ago, was well-placed to refer to Cicero. Now, because of our de facto historical position, we are well-placed to refer to Cicero, even though we (or those of us without classical education) wouldn’t know Cicero from Seneca. We don’t need to be able to point to him, or apprehend some condition which singles him out (other, perhaps, than being Cicero). Possessing an appropriately-derived use of ‘Cicero’ suffices. According to the theory of evolution by natural selection, so long as we are appropriately situated (i.e., so long as our local environment is relevantly similar to our ancestors’), we benefit from our biological ancestors’ reproductive successes. Similarly, when we refer positionally, so long as we are appropriately situated, we benefit from our linguistic ancestors’ referential successes. In neither case do the conditions by which we benefit have to be present, even implicitly, in our psychology.25

The Rise of Hadoop…

Thursday, April 28th, 2011

The Rise of Hadoop: How many Hadoop-related solutions exist?

Alex Popescu of myNoSQL enhances a CMSWire listing of fourteen (14) different Hadoop solutions by adding pointers to most of the solutions.

Thanks to Alex for that!

It always puzzles me when “content providers” refer to an online site or software that can be reached online, but don’t include a link.

Easy enough to search and including the link takes time, but only once. Every reader is saved time by the presence of a link.

PS: From the CMSWire article:

Hadoop is Hard

Hadoop is not the most intuitive and easy-to-use technology. Many of the recent startups that have emerged to challenge Cloudera’s dominance have the exclusive value proposition that they make it easier to get answers from the software by abstracting the functions to higher-level products. But none of the companies has found the magic solution to bring the learning curve to a reasonable level.

Do you think topic maps could assist in ….bring[ing] the learning curve to a reasonable level?

If so, how?

Scientific graphs Generators plugin

Thursday, April 28th, 2011

Scientific graphs Generators plugin

A new plugin for Gephi, described as:

Cezary Bartosiak and Rafa? Kasprzyk just released the Complex Generators plugin, introducing many awaited scientific generators. These generators are extremely useful for scientists, as they help to simulate various real networks. They can test their models and algorithms on well-studied graph examples. For instance, the Watts-Strogatz generator creates networks as described by Duncan Watts in his Six Degrees book.

The plugin contains the following generators:

  • Balanced Tree
  • Barabasi Albert
  • Barabasi Albert Generalized
  • Barabasi Albert Simplified A
  • Barabasi Albert Simplified B
  • Erdos Renyi Gnm
  • Erdos Renyi Gnp
  • Kleinberg
  • Watts Strogatz Alpha
  • Watts Strogatz Beta

NodeBox – Graphs

Thursday, April 28th, 2011

Nodebox – Graphs

Part of NodeBox, a Mac OS X application for visualization using Python.

The graph part is described as:

The NodeBox Graph library includes algorithms from NetworkX for betweenness centrality and eigenvector centrality, Connelly Barnes’ implementation of Dijksta shortest paths (here) and the spring layout for JavaScript by Aslak Hellesoy and Dave Hoover (here). The goal of this library is visualization of small graphs (<200 elements), if you need something more robust we recommend using NetworkX.

PS: See my post: NetworkX Introduction… for an introduction to NetworkX.

Dataset linkage recommendation on the Web of Data

Thursday, April 28th, 2011

Dataset linkage recommendation on the Web of Data by Martijn van der Plaat (Master thesis).


We address the problem of, given a particular dataset, which candidate dataset(s) from the Web of Data have the highest chance of holding co-references, in order to increase the efficiency of coreference resolution. Currently, data publishers manually discover and select the right dataset to perform a co-reference resolution. However, in the near future the size of the Web of Data will be such that data publishers can no longer determine which datasets are candidate to map to. A solution for this problem is finding a method to automatically recommend a list of candidate datasets from the Web of Data and present this to the data publisher as an input for the mapping.

We proposed two solutions to perform the dataset linkage recommendation. The general idea behind our solutions is predicting the chance a particular dataset on the Web of Data holds co-references with respect to the dataset from the data publisher. This prediction is done by generating a profile for each dataset from the Web of Data. A profile is meta-data that represents the structure of a dataset, in terms of used vocabularies, class types, and property types. Subsequently, dataset profiles that correspond with the dataset profile from the data publisher, get a specific weight value. Datasets with the highest weight values have the highest chance of holding co-references.

A useful exercise but what happens when data sets have inconsistent profiles from different sources?

And for all the drum banging, only a very tiny portion of all available datasets are part of Linked Data.

How do we evaluate the scalability of such a profiling technique?

Hard economic lessons for news

Wednesday, April 27th, 2011

Hard economic lessons for news

I saw this in the TechDirt Daily Email. Mike Masnick offered the following summary:

  • Tradition is not a business model. The past is no longer a reliable guide to future success.
  • “Should” is not a business model. You can say that people “should” pay for your product but they will only if they find value in it.
  • Virtue is not a business model. Just because you do good does not mean you deserve to be paid for it.
  • Business models are not made of entitlements and emotions. They are made of hard economics. Money has no heart.
  • Begging is not a business model. It’s lazy to think that foundations and contributions can solve news’ problems. There isn’t enough money there.
  • No one cares what you spent. Arguing that news costs a lot is irrelevant to the market.

One or more of these themes have been offered as justifications for semantic technologies, including topic maps.

I would add for semantic technologies:

  • Saving the world isn’t a business model. Try for something with more immediate and measurable results.
  • Cult like faith in linking isn’t a solution, it’s a delusion. Linking per se is not a value, successful results (by whatever means) are.
  • Sharing my data isn’t a goal. Sharing in someone else’s data is.

MaJorToM server v. 1.0.0 released

Wednesday, April 27th, 2011

MaJorToM server v. 1.0.0 released

From the post:

The MaJorToM server is a Spring application which provides a TMQL HTTP interface to query topic maps. The server is intended as back end for Topic Maps based applications.

Today the Topic Maps Lab has released the version 1.0.0 of the MaJorToM server. The server acts as TMQL endpoint for Topic Maps based applications. You can play around with an instance of the server here.

The sources of the MaJorToM server are available at Google code. In the installation instructions you will learn how you can build and deploy your own instances of the MaJorToM server. Once you have deployed the server for your own, you will have an administration interface and a TMQL interface. But the server is not only a TMQL endpoint for the data. It also provides full text search (based on Beru) and a SPARQL enpoint to the hosted topic maps.

With the next release Maiana will act as a frontend for topic maps delivered by remote MaJorToM server. Besides the Maiana integration the MaJorToM server is very well integrated with TM2O – the OData provider for Topic Maps.

Solr Result Grouping / Field Collapsing Improvements

Wednesday, April 27th, 2011

Solr Result Grouping / Field Collapsing Improvements by Yonik Seeley

From the post:

I previously introduced Solr’s Result Grouping, also called Field Collapsing, that limits the number of documents shown for each “group”, normally defined as the unique values in a field or function query.

Since then, there have been a number of bug fixes, performance improvements, and feature enhancements. You’ll need a recent nightly build of Solr 4.0-dev, or the newly released LucidWorks Enterprise v1.6, our commercial version of Solr.

A short but useful article on new grouping capabilities in Solr.

What you do with results once they are grouped, which could include “merging,” is up to you.

Migration from a Commercial Search Platform (specifically FAST ESP) to Lucene/Solr

Wednesday, April 27th, 2011

Migration from a Commercial Search Platform (specifically FAST ESP) to Lucene/Solr posted by Mitchell Pronschinske.

From the post:

There are many reasons that an IT department with a large scale search installation would want to move from a proprietary platform to Lucene Solr. In the case of FAST Search, the company’s purchase by Microsoft and discontinuation of the Linux platform has created an urgency for FAST users. This presentation will compare Lucene/Solr to FAST ESP on a feature basis, and as applied to an enterprise search installation. We will further explore how various advanced features of commercial enterprise search platforms can be implemented as added functions for Lucene/Solr. Actual cases will be presented describing how to map the various functions between systems.

Excellent presentation if you need to make the case for Lucene/Solr, function by function.

Graph Theory

Wednesday, April 27th, 2011

Graph Theory by Reinhard Diestel.

With all the graph database offerings and papers you may feel the need to brush up on graph theory.

The student edition is quite attractively priced ($9.99)

Springer-Verlag, Heidelberg
Graduate Texts in Mathematics, Volume 173
ISBN 978-3-642-14278-9
July 2010 (2005, 2000, 1997)
451 pages; 125 figures

…Efficient Subgraph Matching on Huge Networks (or, > 1 billion edges < 1 second)

Tuesday, April 26th, 2011

A Budget-Based Algorithm for Efficient Subgraph Matching on Huge Networks by Matthais Br&oul;cheler, Andrea Pugliese, V.S. Subrahmanian. (Presented at GDM 2011.)


As social network and RDF data grow dramatically in size to billions of edges, the ability to scalably answer queries posed over graph datasets becomes increasingly important. In this paper, we consider subgraph matching queries which are often posed to social networks and RDF databases — for such queries, we want to find all matching instances in a graph database. Past work on subgraph matching queries uses static cost models which can be very inaccurate due to long-tailed degree distributions commonly found in real world networks. We propose the BudgetMatch query answering algorithm. BudgetMatch costs and recosts query parts adaptively as it executes and learns more about the search space. We show that using this strategy, BudgetMatch can quickly answer complex subgraph queries on very large graph data. Specifically, on a real world social media data set consisting of 1.12 billion edges, we can answer complex subgraph queries in under one second and significantly outperform existing subgraph matching algorithms.

Built on top of Neo4J, BudgetMatch, dynamically updates budgets assigned to vertexes.

Aggressive pruning gives some rather attractive results.

Key-Key-Value Stores for Efficiently Processing Graph Data in the Cloud

Tuesday, April 26th, 2011

Key-Key-Value Stores for Efficiently Processing Graph Data in the Cloud by Alexander G. Connor, Panos K. Chrysanthis, Alexandros Labrinidis.

More slides from GDM 2011.

Some of the slides don’t present well but you can get the gist of what is being said.

Develops an extension to key-value, key-key-value store, which partitions data into sets of related keys.

Not enough detail. Will watch for the paper.

Data Beats Math

Tuesday, April 26th, 2011

Data Beats Math

A more recent post by Jeff Jonas.

Topic maps can capture observations, judgments, conclusions from human analysts.

Do those beat math as well?

It’s All About the Librarian! New Paradigms in Enterprise Discovery and Awareness – Post

Tuesday, April 26th, 2011

It’s All About the Librarian! New Paradigms in Enterprise Discovery and Awareness

This is a must read post by Jeff Jonas.

I won’t spoil your fun but Jeff defines terms such as:

  • Context-less Card Catalogs
  • Semantically Reconciled Directories
  • Semantically Reconciled and Relationship Aware Directories

and a number of others.

Looks very much like he is interested in the same issues as topic maps.

Take the time to read it and see what you think.

Inside Horizon: interactive analysis at cloud scale

Monday, April 25th, 2011

Inside Horizon: interactive analysis at cloud scale

From the website:

Late last year, we were honored to be invited to talk at Reflections|Projections, ACM@UIUC’s annual student-run computing conference. We decided to bring a talk about Horizon, our system for doing aggregate analysis and filtering across very large amounts of data. The video of the talk was posted a few weeks back on the conference website.

Horizon started as research project / technology demonstrator built as part of Palantir’s Hack Week – a periodic innovation sprint that our engineering team uses to build brand new ideas from whole cloth. It was then used to by the Center For Public Integrity in their Who’s Behind The Subprime Meltdown report. We produced a short video on the subject, Beyond the Cloud: Project Horizon, released on our analysis blog. Subsequently, it was folded into our product offering, under the name Object Explorer.

In this hour-long talk, two of the engineers that built this technology tell the story of how Horizon came to be, how it works, and show a live demo of doing analysis on hundreds of millions of records in interactive time.

From the presentation:

Mission statement: Organize the world’s information and make it universally accessible and useful. -> Google’s statement

Which should say:

Organize the world’s [public] information and make it universally accessible and useful.

Palantir’s misson:

Organize the world’s [private] information and make it universally accessible and useful.

Closes on human-driven analysis.

A couple of points:

The demo was of a pre-beta version even though the product version shipped several months prior to the presentation. What’s with that?

Long on general statements and short on any specifics.

Did mention this is a column-store solution. Appears to work well with very clean data, but then what solution doesn’t?

Good emphasis on user interface and interactive responses to queries.

I wonder if the emphasis on interactive responses creates unrealistic expectations among customers?

Or an emphasis on problems that can be solved or appear to be solvable, interactively?

My comments about intelligence community bias the other day for example. You can measure and visualize tweets that originate in Tahrir Square, but if they are mostly from Western media, how meaningful is that?

GO Annotation Tools

Monday, April 25th, 2011

GO Annotation Tools

Sponsored by the GO Ontology Consortium that I covered here, but I thought the tools merited separate mention.

Many of these tools will be directly applicable to bioinformatics use of topic maps and/or will give you ideas for similar tools in other domains.

The igraph library

Monday, April 25th, 2011

The igraph library

From the website:

igraph is a free software package for creating and manipulating undirected and directed graphs. It includes implementations for classic graph theory problems like minimum spanning trees and network flow, and also implements algorithms for some recent network analysis methods, like community structure search.

The efficient implementation of igraph allows it to handle graphs with millions of vertices and edges. The rule of thumb is that if your graph fits into the physical memory then igraph can handle it.


  • igraph contains functions for generating regular and random graphs according to many algorithms and models from the network theory literature.
  • igraph provides routines for manipulating graphs, adding and removing edges and vertices.
  • You can assign numeric or textual attribute to the vertices or edges of the graph, like edge weights or textual vertex ids.
  • A rich set of functions calculating various structural properties, eg. betweenness, PageRank, k-cores, network motifs, etc. are also included.
  • Force based layout generators for small and large graphs
  • The R package and the Python module can visualize graphs many ways, in 2D and 3D, interactively or non-interactively.
  • igraph provides data types for implementing your own algorithm in C, R, Python or Ruby.
  • Community structure detection algorithms using many recently developed heuristics.
  • igraph can read and write many file formats, e.g., GraphML, GML or Pajek.
  • igraph contains efficient functions for deciding graph isomorphism and subgraph isomorphism
  • It also contains an implementation of the push/relabel algorithm for calculating maximum network flow, and this way minimum cuts, vertex and edge connectivity.
  • igraph is well documented both for users and developers.
  • igraph is open source and distributed under GNU GPL.