Archive for March, 2011

Wandora – New Release

Thursday, March 31st, 2011

Wandora – New Release

Latest new feature:

GATE (General Architecture for Text Engineering) is a mature and actively used software framework for computational tasks involving human language. GATE has been developed in The University of Sheffield. It is open source and free software. ANNIE (A Nearly-New Information Extraction System) is a component of GATE used for information extraction. ANNIE extracts information out of given unstructured text. Wandora features a tool called GATE Annie that uses GATE and ANNIE to extract topics and associations out of given text, an occurrence for example. Tool locates in Wandora application menu File > Extract > Classification. It is also available in occurrence editor and browser plugin.

GATE and ANNIE are included in Wandora distribution package and embedded tool GATE Annie processes given text locally.

See: for more details.

….object coreference on the semantic web (and a question)

Thursday, March 31st, 2011

A self-training approach for resolving object coreference on the semantic web by Wei Hu, Jianfeng Chen, and Yuzhong Qu, all of Nanjing University, Nanjing, China.


An object on the Semantic Web is likely to be denoted with multiple URIs by different parties. Object coreference resolution is to identify “equivalent” URIs that denote the same object. Driven by the Linking Open Data (LOD) initiative, millions of URIs have been explicitly linked with owl:sameAs statements, but potentially coreferent ones are still considerable. Existing approaches address the problem mainly from two directions: one is based upon equivalence inference mandated by OWL semantics, which finds semantically coreferent URIs but probably omits many potential ones; the other is via similarity computation between property-value pairs, which is not always accurate enough. In this paper, we propose a self-training approach for object coreference resolution on the Semantic Web, which leverages the two classes of approaches to bridge the gap between semantically coreferent URIs and potential candidates. For an object URI, we firstly establish a kernel that consists of semantically coreferent URIs based on owl:sameAs, (inverse) functional properties and (max-)cardinalities, and then extend such kernel iteratively in terms of discriminative property-value pairs in the descriptions of URIs. In particular, the discriminability is learnt with a statistical measurement, which not only exploits key characteristics for representing an object, but also takes into account the matchability between properties from pragmatics. In addition, frequent property combinations are mined to improve the accuracy of the resolution. We implement a scalable system and demonstrate that our approach achieves good precision and recall for resolving object coreference, on both benchmark and large-scale datasets.

Interesting work.

In particular the use of property-value pairs in the service of discovering similarity.

So, why are users limited to owl:sameAs?

If machines can discover property-value pairs that identify “objects,” then why not enable users to declare property-value pairs that identify the same “objects?”

Such declarations could be used by both machines and users.

The web of topics: discovering the topology of topic evolution in a corpus

Thursday, March 31st, 2011

The web of topics: discovering the topology of topic evolution in a corpus by Yookyung Jo, John E. Hopcroft, and, Carl Lagoze, Cornell University, Ithaca, NY, USA.


In this paper we study how to discover the evolution of topics over time in a time-stamped document collection. Our approach is uniquely designed to capture the rich topology of topic evolution inherent in the corpus. Instead of characterizing the evolving topics at fixed time points, we conceptually define a topic as a quantized unit of evolutionary change in content and discover topics with the time of their appearance in the corpus. Discovered topics are then connected to form a topic evolution graph using a measure derived from the underlying document network. Our approach allows inhomogeneous distribution of topics over time and does not impose any topological restriction in topic evolution graphs. We evaluate our algorithm on the ACM corpus. The topic evolution graphs obtained from the ACM corpus provide an effective and concrete summary of the corpus with remarkably rich topology that are congruent to our background knowledge. In a finer resolution, the graphs reveal concrete information about the corpus that were previously unknown to us, suggesting the utility of our approach as a navigational tool for the corpus.

The term topic is being used in this paper to mean a subject in topic map parlance.

From the paper:

Our work is built on the premise that the words relevant to a topic are distributed over documents such that the distribution is correlated with the underlying document network such as a citation network. Specifically, in our topic discovery methodology, in order to test if a multinomial word distribution derived from a document constitutes a new topic, the following heuristic is used. We check that the distribution is exclusively correlated to the document network by requiring it to be significantly present in other documents that are network neighbors of the given document while suppressing the nondiscriminative words using the background model.

Navigation of a corpus on the basis of such a process would indeed be rich, but it would be even richer were multiple ways to represent the same subjects mapped together.

It would also be interesting to see how the resulting graphs, which included only the document titles and abstracts, compared to graphs constructed using the entire documents.

Unified analysis of streaming news

Thursday, March 31st, 2011

Unified analysis of streaming news by Amr Ahmed, Qirong Ho, Jacob Eisenstein, and, Eric Xing Carnegie Mellon University, Pittsburgh, USA, and Alexander J. Smola and Choon Hui Teo of Yahoo! Research, Santa Clara, CA, USA.

News clustering, categorization and analysis are key components of any news portal. They require algorithms capable of dealing with dynamic data to cluster, interpret and to temporally aggregate news articles. These three tasks are often solved separately. In this paper we present a unified framework to group incoming news articles into temporary but tightly-focused storylines, to identify prevalent topics and key entities within these stories, and to reveal the temporal structure of stories as they evolve. We achieve this by building a hybrid clustering and topic model. To deal with the available wealth of data we build an efficient parallel inference algorithm by sequential Monte Carlo estimation. Time and memory costs are nearly constant in the length of the history, and the approach scales to hundreds of thousands of documents. We demonstrate the efficiency and accuracy on the publicly available TDT dataset and data of a major internet news site.

From the article:

Such an approach combines the strengths of clustering and topic models. We use topics to describe the content of each cluster, and then we draw articles from the associated story. This is a more natural fit for the actual process of how news is created: after an event occurs (the story), several journalists write articles addressing various aspects of the story. While their vocabulary and their view of the story may differ, they will by necessity agree on the key issues related to a story (at least in terms of their vocabulary). Hence, to analyze a stream of incoming news we need to infer a) which (possibly new) cluster could have generated the article and b) which topic mix describes the cluster best.

I single out that part of the paper to remark that at first the authors say that the vocabulary for a story may vary and then in the next breath say that for key issues the vocabulary will agree on key issues.

Given the success of their results, it may be that news reporting is more homogeneous in its vocabulary than other forms of writing?

Perhaps news compression where duplicated content is suppressed but the “fact” of reportage is retained, that could make an interesting topic map.

Topincs 5.4.0 Released!

Wednesday, March 30th, 2011

Topincs 5.4.0 Released!

From the release description:


This version allows files to be archived by client-side uploading or with the new server-side store command archive. This feature makes it possible to integrate images, documents and other file types into the ontology. If a file type supports it, thumbnails are created and integrated in the start page and in subject pages.

Other important changes are:

  • The style can now be customized for every store that runs under an installation.
  • The programming interface was extended to support subject identifiers and locators.
  • The commands for backup and restore were improved.
  • The keyboard help on form pages was made stressless.

Download Topincs 5.4.0

Manual, Installation, etc.

If you like keyboard short-cuts, then you will like Topincs. From the manual: The worst enemy of speedy data entry is the mouse. See what I mean?

I am not unsympathetic, I have yet to find a graphical XML editor that I like. Some are more tolerable than others but like? Not yet.

CouchDB Tutorial: Starting to relax with CouchDB

Wednesday, March 30th, 2011

CouchDB Tutorial: Starting to relax with CouchDB

From Alex Popescu’s myNoSQL blog, a pointer to a useful tutorial on CouchDB.

CouchDB homepage

State of the LOD Cloud

Wednesday, March 30th, 2011

State of the LOD Cloud

A more complete resource than the one I referenced in The Linking Open Data cloud diagram.

I haven’t seen any movement towards solving any of the fundamental identity issues with the LOD cloud.

On the other hand, topic mappers can make use of these URIs as names and specify other data published with those URIs to form an actual identification.

One that is reliably interchangeable with others.

I think the emphasis on URIs being dereferencable.

No one says what happens after a URI is dereferenced but that’s to avoid admitting that a URI is insufficient as an identifier.

Playing with Gephi, Bio4j and Go

Wednesday, March 30th, 2011

Playing with Gephi, Bio4j and Go

From the blog:

It had already been some time without having some fun with Gephi so today I told myself: why not trying visualizing the whole Gene Ontology and seeing what happens?

First of all I had to generate the corresponding file in gexf format containing all the terms and relationships belonging to the ontology.

For that I did a small program ( which uses Bio4j for terms/relationships info retrieval and a couple of XML Gexf wrapper classes from the github project Era7BioinfoXML.

This looks like fun!

And a good way to look at an important data set, that could benefit from a topic map.

The Little MongoDB Book

Wednesday, March 30th, 2011

The Little MongoDB Book

From the webpage:

I’m happy to freely release The Little MongoDB Book; an ebook meant to help people get familiar with MongoDB and answer some of the more common questions they have.

Not complete but a useful short treatment.

Machine Learning

Wednesday, March 30th, 2011

Machine Learning

From the site:

This page documents all the machine learning algorithms present in the library. In particular, there are algorithms for performing classification, regression, clustering, anomaly detection, and feature ranking, as well as algorithms for doing more specialized computations.

A good tutorial and introduction to the general concepts used by most of the objects in this part of the library can be found in the svm example program. After reading this example another good one to consult would be the model selection example program. Finally, if you came here looking for a binary classification or regression tool then I would try the krr_trainer first as it is generally the easiest method to use.

The major design goal of this portion of the library is to provide a highly modular and simple architecture for dealing with kernel algorithms….

Update: Dlib – machine learning. Why I left out the library name I cannot say. Sorry!

The Catsters’ Category Theory Videos

Wednesday, March 30th, 2011

The Catsters’ Category Theory Videos

Courtesy of Edsko de Vries:

Eugenia Cheng and Simon Willerton of the University of Sheffield, a.k.a. The Catsters, have an excellent series of lectures on category theory on YouTube. The only thing missing is some overview of the lectures, which I have provided below. A graphical overview is available too.

I don’t recall ever seeing anyone lecture so excitedly about math. The lectures are not only enjoyable but the enthusiasm is strangely infectious.

Tchaikovsky by any other name

Tuesday, March 29th, 2011

My daughter, a musician and library school student, send me a link to variations on spellings of Tchaikovsky, which I quote below, followed by some comments.

If your mean, what is the most common way of spelling the composer's name (which in Russian was ???? ????? ??????????) in the English language, then that would be "Pyotr Ilyich Tchaikovsky". But the composer himself used "Tchaikovsky", "Tschaikovsky" and "Tschaikowsky" when writing in other languages, while "Chaykovskiy" would be a more literal transliteration.

Here are some other versions from the Library of Congress catalog (

  • Ciaikovsky, Piotr Ilic
  • Tschaikowsky, Peter Iljitch
  • Tchaikowsky, Peter Iljitch
  • Ciaikovsky, Pjotr Iljc
  • Cajkovskij, Petr Il'ic
  • Tsjaikovsky, Peter Iljitsj
  • Czajkowski, Piotr
  • Chaikovsky, P. I.
  • Csajkovszkij, Pjotr Iljics
  • Tsjai?kovskiej, Pjotr Iljietsj
  • Tjajkovskij, Pjotr Ilitj
  • C?aikovskis, P.
  • Chai?kovskii?, Petr Il'ich
  • Tchaikovski, Piotr
  • Tchaikovski, Piotr Ilyitch,
  • Chai?kovskii?, Petr
  • Tchaikovsky, Peter
  • Tchai?kovsky, Piotr Ilitch
  • Tschaikowsky, Pjotr Iljitsch
  • Tschajkowskij, Pjotr Iljitsch
  • Tchai?kovski, P. I.
  • Ciaikovskij, Piotr
  • Ciaikovskji, Piotr Ilijich
  • Tschaikowski, Peter Illic
  • Tjajkovskij, Peter
  • Chai?kovski, P'otr Ilich,
  • Tschaikousky
  • Tschaijkowskij, P. I.
  • Tschaikowsky, P. I.
  • Chai?kovski, Piotr Ilich
  • Tchaikovsky, Pyotr Ilyich
  • C?ajkovskij, Pe?tr Ilic?
  • Tschaikovsky, Peter Ilyich
  • Tchaikofsky, Peter Ilyitch
  • Tciaikowski, P.
  • Tchai?kovski, Petr Ilitch
  • Ciaikovski, Peter Ilic
  • Tschaikowski, Pjotr
  • Tchaikowsky, Pyotr
  • Tchaikovskij, Piotr Ilic

You can see the original post at:

An impressive list but doesn’t begin to touch the ways Tchaikovsky has been indexed in Russian libraries to say nothing of transliterations of his name in libraries around the world in other languages.

Or how his name has appeared in the literature.

You could search Google Books using all 40 variations listed above, plus variations in other languages.

As could everyone following you.

Or, some enterprising soul could create a topic map that responded with all the actual entries for Tchaikovsky the composer, whichever variation of his name that you used in any language.

Like an index, a topic map is a labor saving device for the user because the winnowing of false hits, addition of resources under variations on search terms and the creation of multiple ways (think spellings) that lead to the same materials, have already happened.

Of course, creation of a topic map, or paying for the use of one created by others, is a line item in the budget.

In a way that paying staff to stare at screen after screen of mind-numbing and quite possibly irrelevant “hits” is not.

Need to find a way to make the same case that is made for indexes, as labor-saving, important devices. For topic maps.

Contrary to popular belief, SQL and noSQL are really just two sides of the same coin

Tuesday, March 29th, 2011

Contrary to popular belief, SQL and noSQL are really just two sides of the same coin

From the article:

In this article we present a mathematical data model for the most common noSQL databases—namely, key/value relationships—and demonstrate that this data model is the mathematical dual of SQL’s relational data model of foreign-/primary-key relationships. Following established mathematical nomenclature, we refer to the dual of SQL as coSQL. We also show how a single generalization of the relational algebra over sets—namely, monads and monad comprehensions—forms the basis of a common query language for both SQL and noSQL. Despite common wisdom, SQL and coSQL are not diabolically opposed, but instead deeply connected via beautiful mathematical theory.

Just as Codd’s discovery of relational algebra as a formal basis for SQL shifted the database industry from a monopolistically competitive market to an oligopoly and thus propelled a billion-dollar industry around SQL and foreign-/primary-key stores, we believe that our categorical data-model formalization model and monadic query language will allow the same economic growth to occur for coSQL key-value stores.

Considering the authors’ claim that the current SQL oligopoly is woth $32 billion and still growing in double digits, color me interested!


Since they are talking about query languages, maybe the TMQL editors should take a look as well.

Phoebus: Erlang-based Implementation of Google’s Pregel

Tuesday, March 29th, 2011

Phoebus: Erlang-based Implementation of Google’s Pregel

From Alex Popescu’s myNoSQL a report on another parallel graph database engine.

You can also see the source code at the project site.

The project site points to: Pregel: a system for large-scale graph processing (2009), a one page summary about Pregel, but you may find: Pregel: a system for large-scale graph processing (2010), at eleven (11) pages of interesting detail more helpful.

BTW, the following two citations are actually the same paper, literally:

author = {Malewicz, Grzegorz and Austern, Matthew H. and Bik, Aart J.C. and Dehnert, James C. and Horn, Ilan and Leiser, Naty and Czajkowski, Grzegorz},
title = {Pregel: a system for large-scale graph processing},
booktitle = {Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures},
series = {SPAA ’09},
year = {2009},
isbn = {978-1-60558-606-9},
location = {Calgary, AB, Canada},
pages = {48–48},
numpages = {1},
url = {},
doi = {},
acmid = {1584010},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {distributed computing, graph algorithms},


author = {Malewicz, Grzegorz and Austern, Matthew H. and Bik, Aart J.C. and Dehnert, James C. and Horn, Ilan and Leiser, Naty and Czajkowski, Grzegorz},
title = {Pregel: a system for large-scale graph processing},
booktitle = {Proceedings of the 28th ACM symposium on Principles of distributed computing},
series = {PODC ’09},
year = {2009},
isbn = {978-1-60558-396-9},
location = {Calgary, AB, Canada},
pages = {6–6},
numpages = {1},
url = {},
doi = {},
acmid = {1582723},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {distributed computing, graph algorithms},

Different DOIs, different citations statistics, same text.

Just so you don’t include both in a bibliography.

Reverted Indexing

Tuesday, March 29th, 2011

Reverted Indexing

From the website:

Traditional interactive information retrieval systems function by creating inverted lists, or term indexes. For every term in the vocabulary, a list is created that contains the documents in which that term occurs and its frequency within each document. Retrieval algorithms then use these term frequencies alongside other collection statistics to identify matching documents for a query.

Term-based search, however, is just one example of interactive information seeking. Other examples include offering suggestions of documents similar to ones already found, or identifying effective query expansion terms that the user might wish to use. More generally, these fall into several categories: query term suggestion, relevance feedback, and pseudo-relevance feedback.

We can combine the inverted index with the notion of retrievability to create an efficient query expansion algorithm that is useful for a number of applications, such as query expansion and relevance (and pseudo-relevance) feedback. We call this kind of index a reverted index because rather than mapping terms onto documents, it maps document ids onto queries that retrieved the associated documents.

As to its performance:

….the short answer is that our query expansion technique outperforms PL2 and Bose-Einstein algorithms (as implemented in Terrier) by 15-20% on several TREC collections. This is just a first stab at implementing and evaluating this indexing, but we are quite excited by the results.

An interesting example of innovative thinking about indexing.

With a useful result.

MongoDB Manual

Tuesday, March 29th, 2011

MongoDB Manual

More of a placeholder for myself than anything else.

I am going to create a page of links to the documentation for all the popular DB projects.

MongoDB with Style

Tuesday, March 29th, 2011

MongoDB with Style

One of the more amusing introductions to use of MongoDB.

Lowering Barriers to Contributions

Tuesday, March 29th, 2011

Lowering Barriers to Contributions

Specifically about Erlang open source projects but possible lessons for topic map (and other) projects in general.

The theory is the easier we make it for people to contribute, the more people will contribute.

Hard to say that will happen in any particular case but I don’t see a downside.

An Introduction to Data Mining

Monday, March 28th, 2011

An Introduction to Data Mining by Dr. Saed Sayad

A very interesting map of data mining, the nodes of which lead to short articles on particular topics.

It is a useful resource for reviewing material on data mining, either as part of a course or for self-study.

While not part of the map, don’t miss the Further Readings link in the bottom left-hand corner.

Good Math, Bad Math – Category Theory

Monday, March 28th, 2011

Good Math, Bath Math – Category Theory

A series of posts on category theory.

Watson – Indexing – Human vs. Computer

Monday, March 28th, 2011

In The importance of theories of knowledge: Indexing and information retrieval as an example1, Birger Hjørland, reviews a deeply flawed study by Lykke and Eslau, Using Thesauri in Enterprise Settings: Indexing or Query Expansion?2, which concludes in part:

As human indexing is costly, it could be useful and productive to use the human indexer to assign other types of metadata such as contextual metadata, and leave the subject indexing to the computer. (Lykke and Eslau, p. 94)

Hjørland outlines a number of methodological shortcomings of the study which I won’t repeat here.

I would add to the concerns voiced by Hjørland, the failure of the paper to account for known indexing issues such as encountered in Blair and Maron’s, An evaluation of retrieval effectiveness for a full-text document-retrieval system (see Size Really Does Matter…, which was published in 1985. If more than twenty-five years later, some researchers are not yet aware of the complexities indexing, one despairs of making genuine progress.

The Text REtrieval Conference (TREC) routinely discusses the complexities of indexing so it isn’t simply a matter of historical (I suppose 25 years qualifies as “historical” in a CS context) literature.

Lykke and Eslau don’t provide enough information to evaluate their findings but it appears they may have proven that it is possible for people to index so poorly that a computer search gives a better result.

Is that a Watson moment?

1. Hjørland, B. (2011). The importance of theories of knowledge: Indexing and information retrieval as an example. Journal of the American Society for Information Science & Technology, 62(1), 72-77.

2. Lykke, M., and Eslau, A.G. (2010). Using thesauri in enterprise settings: Indexing or query expansion? in B. Larsen, J.W. Schneider & F. Aström (Eds.), The Janus faced scholar. A feitschrift in honor of Peter Ingwesen (pp. 87-97). Compenhagen: Royal School of Library and Information Science. (Special volume of the ISSI e-newsletter, Vol. 06-S, June 2010). Retrieved March 25, 2011, from

Do the Schimmy…

Monday, March 28th, 2011

I first encountered the reference to the Do the Schimmy… posts at Alex Popescu’s myNoSQL site under Efficient Large-Scale Graph Analysis with Hadoop.

An excellent pair of articles on the use (and improvement of) Hadoop for graph processing.

Do the Schimmy: Efficient Large-Scale Graph Analysis with Hadoop

Question: What do PageRank, the Kevin Bacon game, and DNA sequencing all have in common?

As you might know, PageRank is one of the many features Google uses for computing the importance of a webpage based on the other pages that link to it. The intuition is that pages linked from many important pages are themselves important. In the Kevin Bacon game, we try to find the shortest path from Kevin Bacon to your favorite movie star based on who they were costars with. For example, there is a 2 hop path from Kevin Bacon to Jason Lee: Kevin Bacon starred in A Few Good Men with Tom Cruise, whom also starred in Vanilla Star with Jason Lee. In the case of DNA sequencing, we compute the full genome sequence of a person (~3 billion nucleotides) from many short DNA fragments (~100 nucleotides) by constructing and searching the genome assembly graph. The assembly graph connects fragments with the same or similar sequences, and thus long paths of a particular form can spell out entire genomes.

The common aspect for these and countless other important problems, including those in defense & intelligence, recommendation systems & machine learning, social networking analysis, and business intelligence, is the need to analyze enormous graphs: the Web consists of trillions of interconnected pages, IMDB has millions of movies and movie stars, and sequencing a single human genome requires searching for paths between billions of short DNA fragments. At this scale, searching or analyzing a graph on a single machine would be time-consuming at best and totally impossible at worst, especially when the graph cannot possibly be stored in memory on a single computer.

Do the Schimmy: Efficient Large-Scale Graph Analysis with Hadoop, Part 2

In part 1, we looked at how extremely large graphs can be represented and analyzed in Hadoop/MapReduce. Here in part 2 we will examine this design in more depth to identify inefficiencies, and present some simple solutions that can be applied to many Hadoop/MapReduce graph algorithms. The speedup using these techniques is substantial: as a prototypical example, we were able to reduce the running time of PageRank on a webgraph with 50.2 million vertices and 1.4 billion edges by as much as 69% on a small 20-core Hadoop cluster at the University of Maryland (full details available here). We expect that similar levels of improvement will carry over to many of the other problems we discussed before (the Kevin Bacon game, and DNA sequence assembly in particular).

Authoring Topic Maps Interfaces

Sunday, March 27th, 2011

In a discussion about authoring interfaces today I had cause to mention the use of styles to enable conversion of documents to SGML/XML.

This was prior to the major word processing formats converting to XML. Yes, there was a dark time with binary formats but I will leave that for another day.

As I recall, the use of styles, if done consistently, was a useful solution for how to reliably convert from binary formats to SGML/XML.

There was only one problem.

It was difficult if not impossible to get users to reliably use styles in their documents.

Which caused all sorts of havoc with the conversion process.

I don’t recall seeing any actual studies on users failing to use styles correctly but it was common knowledge at the time.

Does anyone have pointers to literature on the consistent use of styles by users?

I mention that recollection as a starting point for discussion of different levels of topic map authoring interfaces.

That is users willingness to do something consistently, is appallingly low.

So we need to design mechanisms to compensate for their lack of consistency. (to use a nice term for it)

Rather than expecting me to somehow mark my use of the term “topic,” when followed immediately by “map,” is not a “topic” in the same sense as Latent Dirichlet Allocation (LDA), the interface should be set to make that distinction on its own.

And when I am writing a blog post on Latent Dirichlet Allocation (LDA), the interface should ask when I use the term “topic” (not followed immediately by “map”) do I mean “topic” in the sense of 13250-2 or do I mean “topic” in the sense of Latent Dirichlet Allocation (LDA)? My response is simply yes/no.

It really has to be that simple.

More complex authoring interfaces should be available but creating systems that operate in the background of our day to day activities, silently gathering up topics, associations, occurrences are going to do a long way to solving some of the adoption problems for topic maps.

We have had spell-check for years.

Why not subject-check? (I will have to think about that part. Could be interesting. Images for people/places/things? We would be asking the person most likely to know, the author.)

Lucene’s FuzzyQuery is 100 times faster in 4.0 (and a topic map tale)

Sunday, March 27th, 2011

Lucene’s FuzzyQuery is 100 times faster in 4.0

I first saw this post mentioned in a tweet by Lars Marius Garshol.

From the post:

There are many exciting improvements in Lucene’s eventual 4.0 (trunk) release, but the awesome speedup to FuzzyQuery really stands out, not only from its incredible gains but also because of the amazing behind-the-scenes story of how it all came to be.

FuzzyQuery matches terms “close” to a specified base term: you specify an allowed maximum edit distance, and any terms within that edit distance from the base term (and, then, the docs containing those terms) are matched.

The QueryParser syntax is term~ or term~N, where N is the maximum allowed number of edits (for older releases N was a confusing float between 0.0 and 1.0, which translates to an equivalent max edit distance through a tricky formula).

FuzzyQuery is great for matching proper names: I can search for mcandless~1 and it will match mccandless (insert c), mcandles (remove s), mkandless (replace c with k) and a great many other “close” terms. With max edit distance 2 you can have up to 2 insertions, deletions or substitutions. The score for each match is based on the edit distance of that term; so an exact match is scored highest; edit distance 1, lower; etc.

Prior to 4.0, FuzzyQuery took the simple yet horribly costly brute force approach: it visits every single unique term in the index, computes the edit distance for it, and accepts the term (and its documents) if the edit distance is low enough.

The story is a good one and demonstrates the need for topic maps in computer science.

The authors used “Googling” to find an implementation by Jean-Phillipe Barrette-LaPierre of an algorithm in a paper by Klaus Schulz and Stoyan Mihov that enabled this increase in performance.

That’s one way to do it, but leaves it to hit or miss as to whether other researchers will find the same implementation.

Moreover, once that connection has been made, the implementation associated with the algorithm/paper, it should be preserved for subsequent searchers.

As well as pointing to the implementation of this algorithm in Lucene, or other implementations, or even other accounts by the same authors, such as the 2004 publication in Computational Linguistics of Fast Approximate Search in Large Dictionaries.

Sounds like a topic map to me. The question is how to make ad hoc authoring of a topic map practical?


Category Theory for the Java Programmer

Sunday, March 27th, 2011

Category Theory for the Java Programmer

From the post:

There are several good introductions to category theory, each written for a different audience. However, I have never seen one aimed at someone trained as a programmer rather than as a computer scientist or as a mathematician. There are programming languages that have been designed with category theory in mind, such as Haskell, OCaml, and others; however, they are not typically taught in undergraduate programming courses. Java, on the other hand, is often used as an introductory language; while it was not designed with category theory in mind, there is a lot of category theory that passes over directly.

I’ll start with a sentence that says exactly what the relation is of category theory to Java programming; however, it’s loaded with category theory jargon, so I’ll need to explain each part.

A collection of Java interfaces is the free3 cartesian4 category2 with equalizers5 on the interface6 objects1 and the built-in7 objects.


Copy-on-write B-tree finally beaten.

Sunday, March 27th, 2011

Copy-on-write B-tree finally beaten by Andy Twigg, Andrew Byde, Grzegorz Mi?o´s, Tim Moreton, John Wilkesy and Tom Wilkie.


A classic versioned data structure in storage and computer science is the copy-on-write (CoW) B-tree – it underlies many of today’s file systems and databases, including WAFL, ZFS, Btrfs and more. Unfortunately, it doesn’t inherit the B-tree’s optimality properties; it has poor space utilization, cannot offer fast updates, and relies on random IO to scale. Yet, nothing better has been developed since. We describe the ‘stratified B-tree’, which beats the CoW B-tree in every way. In particular, it is the first versioned dictionary to achieve optimal tradeoffs between space, query and update performance. Therefore, we believe there is no longer a good reason to use CoW B-trees for versioned data stores.

I was browsing a CS blog aggregator when I ran across this. Looked like it would be interesting for anyone writing a versioned data store for a topic map application.

A more detailed account appears as: A. Byde and A. Twigg. Optimal query/update tradeoffs in versioned dictionaries. ArXiv e-prints, March 2011.

The Copy-on-write B-tree finally beaten paper has been updated: See:

Ontology Driven Implementation of Semantic Services for the Enterprise Environment (ODISSEE) Workshop

Sunday, March 27th, 2011

Ontology Driven Implementation of Semantic Services for the Enterprise Environment (ODISSEE) Workshop

April 12-13, 2011 · 8:30 a.m. – 4:30 p.m.

From the website:

Alion Science and Technology and the National Center for Ontological Research (NCOR, University at Buffalo) will host a two-day “Ontology Driven Implementation of Semantic Services for the Enterprise Environment (ODISSEE)” Workshop. ODISSEE aims to foster awareness of and collaboration between disparate information-sharing efforts across the US Government. The workshop will feature individual presentations on information-sharing development, as well as panel sessions on ontology and data vocabulary. This workshop supports the Joint Planning and Development Office (JPDO) information sharing initiatives. Information sharing is at the heart of the transformation from the current state of the National Airspace System (NAS) to NextGen capabilities in 2025 in areas such as unmanned aircraft systems, integrated surveillance and weather.


  • Identify and catalogue the various semantic technology efforts across the Federal government.
  • Identify, evaluate, and catalogue standard information-exchange models, such as Universal Core (UCore) and National Information Exchange Model (NIEM) and semantic models of common domains, including time, geography, and events.
  • Explore the use of ontologies to enable information exchanges within a service-oriented architecture (SOA), improve discoverability of services, and align disparate data standards and message models.
  • Coordinate ontology development across diverse Communities of Interest (COIs) to ensure extensibility, interoperability, and reusability.

You guessed from the title this was a government based workshop. Yes? 😉

Looks like a good opportunity to at least meet some of the players in this activity space.

Topic maps certainly qualify as an information-exchange model so that could be one starting point for conversation.


Neo4j 1.3 “Abisko Lampa” M05 – Preparing for arrival

Saturday, March 26th, 2011

Neo4j 1.3 “Abisko Lampa” M05 – Preparing for arrival

Milestone 5 for Neo4j 1.3.

Start of a reference manual and release of a high availability Neo4j cluster are the highlights of this milestone.

If you have comments or concerns, now would be the time to voice them.

Topic Modeling Browser (LDA)

Saturday, March 26th, 2011

Topic Modeling Browser (LDA)

From a post by David Blei:

allison chaney has created the “topic model visualization engine,” which can be used to create browsers of document collections based on a topic model. i think this will become a very useful tool for us. the code is on google code:
as an example, here is a browser built from a 50-topic model fit to 100K articles from wikipedia:
allison describes how she built the browser in the README for her code:
finally, to check out the code and build your own browser, see here:

Take a look.

As I have mentioned before, LDA could be a good exploration tool for document collections, preparatory to building a topic map.


Saturday, March 26th, 2011


From the post:

Ontopia’s developer team is committed to switch from Ant to Maven as build and project management tool for the Ontopia code base. Making this switch has been ongoing work since 2009. This blog post serves as a summary of the work that has been done so far and the work that still needs to be done.

Near the end of the post, you will find:

You can help us by building Ontopia with Maven yourself and either trying out the distribution or the new artifacts as dependencies in other projects. Issues you find can be reported on the Ontopia issue tracker. Keep in mind however that this branch is quite old and might not contain fixes already committed to the trunk.

So, you can have topic map software while learning or practicing your skill with Maven.

Sounds like a win-win situation to me.