Archive for September, 2011

jQuery UI Autocompletion with Elasticsearch Backend

Friday, September 30th, 2011

jQuery UI Autocompletion with Elasticsearch Backend by Gerhard Hipfinger.

From the post:

I recently discovered Elasticsearch is an incredible easy search engine solution for JSON documents. As we heavily use CouchDB in our product development Elasticsearch and CouchDB are a perfect match. Even more since Elasticsearch comes with a great out of the box connection for CouchDB! So the next step is to use Elasticsearch as backend for a jQuery UI autocompletion field.

Your users may like some form of autocompletion in their topic map interface. If nothing else, it is amusing how far they land from what the user is looking for.

BTW, read the comments on this post.

ElasticSearch: Beyond Full Text Search

Friday, September 30th, 2011

ElasticSearch: Beyond Full Text Search by Karel Minařík.

If you aren’t into hard core searching already, this is a nice introduction to the area. Would like to see the presentation that went with the slides but even the slides alone should be useful.

Notes on using the neo4j-scala package, Part 1

Friday, September 30th, 2011

Notes on using the neo4j-scala package, Part 1 by Sebastian Benthall.

From the post:

Encouraged by the reception of last week’s hacking notes, I’ve decided to keep experimenting with Neo4j and Scala. Taking Michael Hunger’s advice, I’m looking into the neo4j-scala package. My goal is to port my earlier toy program to this library to take advantage of more Scala language features.

These my notes from stumbling through it. I’m halfway through.

Let’s encourage Sebastian some more!

DBpedia Spotlight v0.5 – Shedding Light on the Web of Documents

Friday, September 30th, 2011

DBpedia Spotlight v0.5 – Shedding Light on the Web of Documents by Pablo Mendes (email announcement)

We are happy to announce the release of DBpedia Spotlight v0.5 – Shedding Light on the Web of Documents.

DBpedia Spotlight is a tool for annotating mentions of DBpedia entities and concepts in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through DBpedia. The DBpedia Spotlight Architecture is composed by the following modules:

  • Web application, a demonstration client (HTML/Javascript UI) that allows users to enter/paste text into a Web browser and visualize the resulting annotated text.
  • Web Service, a RESTful Web API that exposes the functionality of annotating and/or disambiguating resources in text. The service returns XML, JSON or XHTML+RDFa.
  • Annotation Java / Scala API, exposing the underlying logic that performs the annotation/disambiguation.
  • Indexing Java / Scala API, executing the data processing necessary to enable the annotation/disambiguation algorithms used.

In this release we have provided many enhancements to the Web Service, installation process, as well as the spotting, candidate selection, disambiguation and annotation stages. More details on the enhancements are provided below.

The new version is deployed at:

Instructions on how to use the Web Service are available at:

We invite your comments on the new version before we deploy it on our production server. We will keep it on the “dev” server until October 6th, when we will finally make the switch to the production server at and

If you are a user of DBpedia Spotlight, please join for announcements and other discussions.

Warning: I think they are serious about the requirement of Firefox 6.0.2 and Chromium 12.0.

I tried it on an older version of Firefox on Ubuntu and got no results at all. Will upgrade Firefox but only in my directory.

Essential Elements of Data Mining

Friday, September 30th, 2011

Essential Elements of Data Mining by Keith McCormick

From the post:

This is my attempt to clarify what Data Mining is and what it isn’t. According to Wikipedia, “In philosophy, essentialism is the view that, for any specific kind of entity, there is a set of characteristics or properties all of which any entity of that kind must possess.” I do not seek the Platonic form of Data Mining, but I do seek clarity where it is often lacking. There is much confusion surrounding how Data Mining is distinct from related areas like Statistics and Business Intelligence. My primary goal is to clarify the characteristics that a project must have to be a Data Mining project. By implication, Statistical Analysis (hypothesis testing), Business Intelligence reporting, Exploratory Data Analysis, etc., do not have all of these defining properties. They are highly valuable, but have their own unique characteristics. I have come up with ten. It is quite appropriate to emphasize the first and the last. They are the bookends of the list, and they capture the heart of the matter.

Comments? Characteristics you would add or take away?

How important is it to have a definition? Recall that creeds are created to separate sheep from goats, wheat from chaff. Are “essential characteristics” any different from a creed? If so, how?

Four Levels of Data Integration (Charteris White Paper)

Friday, September 30th, 2011

Four Levels of Data Integration (Charteris White Paper)

From the post:

Application Integration is the biggest cost driver of corporate IT. While it has been popular to emphasise the business process integration aspects of EAI, it remains true that data integration is a huge part of the problem, responsible for much of the cost of EAI. You cannot begin to do process integration without some data integration.

Data integration is an N-squared problem. If you have N different systems or sources of data to integrate, you may need to build as many as N(N -1) different data exchange interfaces between them – near enough to N2. For large companies, where N may run into the hundreds, and N2 may be more than 100,000, this looks an impossible problem.

In practice, the figures are not quite that huge. In our experience, a typical system may interface to between 5 and 30 other systems – so the total number of interfaces is between 5N and 30N. Even this makes a prohibitive number of data interfaces to build and maintain. Many IT managers quietly admit that they just cannot maintain the necessary number of data interfaces, because the cost would be prohibitive. Then business users are forced to live with un-integrated, inconsistent data and fragmented processes, at great cost to the business.

The bad news is that N just got bigger. New commercial imperatives, the rise of e-commerce, XML and web services require companies of all sizes to integrate data and processes with their business partners’ data and processes. If you make an unsolved problem bigger, it generally remains unsolved.

I was searching for N-squared references when I encountered this paper. You can see what I think is the topic map answer to the N-squared problem at: Semantic Integration: N-Squared to N+1 (and decentralized).

Extending tolog

Friday, September 30th, 2011

Extending tolog by Lars Marius Garshol.


This paper describes a number of extensions that might be made to the tolog query language for topic maps in order to make it fulfill all the requirements for the ISO-standardized TMQL (Topic Map Query Language). First, the lessons to be learned from the considerable body of research into the Datalog query language are considered. Finally, a number of different extensions to the existing tolog query language are considered and evaluated.

This paper extends and improves on earlier work on tolog, first described in [Garshol01].

As you can see from some recent post here, Datalog research continues!

Science Manual for Judges Updated

Friday, September 30th, 2011

Science Manual for Judges Updated by Evan Koblentz

From the post:

A new guidebook for judges and legal professionals, the Reference Manual on Scientific Evidence, became available Wednesday, replacing the previous edition published in 2000.

The 1,038-page manual explains scientific concepts, shows how evidence can be manipulated, and cites judicial decisions. It’s free for downloading and online reading, and there is a $79.95 paperback version.

“The new manual was developed in collaboration with the Federal Judicial Center, which produced the previous editions, and was rigorously peer-reviewed in accordance with the procedures of the National Research Council,” both organizations explained in a public statement. “The reference manual is intended to assist judges with the management of cases involving complex scientific and technical evidence; it is not intended, however, to instruct judges on what evidence should be admissible.”

Although intended for judges, the manual is useful for anyone in the legal community, its authors stated. It contains an introduction by U.S. Supreme Court Associate Justice Stephen Breyer and new chapters about forensic science, mental health, and neuroscience — but not computer science.

“We entertained a chapter on computer technology and unfortunately the language of the chapter was too complex, at least as evaluated by our committee, and we thought that by the time the chapter came in, it was really too late to engage another author in the development of the chapter,” said committee co-chair Jerome Kassirer, professor at the Tufts University School of Medicine, in a public conference to announce the manual.

Well, that leaves a gaping hole for someone to plug for judges and other legal professionals.

Need to run reading level software on the current text and cycle a reading level checker over prose as it was being written.

To illustrate the usefulness of topic maps, add in references to cases where some aspect of computer technology has been discussed or is the subject of litigation. Particularly where terminology has changed. Include illustrations that a judge can use to demonstrate their understanding of the technology.

Free access to all members of the judiciary and their staffs to a dynamically updated publication (no dead tree models) that presents summaries of any changes (don’t have to hunt for them). All others by subscription.

Advice regarding future directions for Protégé

Friday, September 30th, 2011

Advice regarding future directions for Protégé

Mark Munsen, Principal Investigator, The Protégé Project, posted the following request to the protege-users mailing list:

I am writing to seek your advice regarding future directions for the Protégé Project. As you know, all the work that we perform on the Protégé suite of tools is supported by external funding, nearly all from federal research grants. We currently are seeking additional grant support to migrate some of the features that are available in Protégé Version 3 to Protégé Version 4. We believe that this migration is important, as only Protégé 4 supports the full OWL 2 standard, and we appreciate that many members of our user community are asking to use certain capabilities currently unique to Protégé 3 with OWL 2 ontologies in Protégé 4.

To help the Protégé team in setting priorities, and to help us make the case to our potential funders that enhancement of Protégé 4 is warranted, we’d be grateful if you could please fill out the brief survey at the following URL:

It will not take more than a few minutes for you to give us feedback that will be influential in setting our future goals. If we can document strong community support for implementing certain Protégé 3 features in Protégé 4, then we will be in a much stronger position to make the case to our funders to initiate the required work.

The entire Protégé team is looking forward to your opinions. Please be sure to forward this message to colleagues who use Protégé who may not subscribe to these mailing lists so that we can obtain as much feedback as possible.

Many thanks for your help and support.

Please participate in this survey (there are only 7 questions, one of which is optional) and ask others to participate as well.

Semantic Integration: N-Squared to N+1 (and decentralized)

Friday, September 30th, 2011

Data Integration: The Relational Logic Approach pays homage to what is called the N-squared problem. The premise of N-squared for data integration is that every distinct identification must be mapped to every other distinct identification. Here is a graphic of the N-squared problem.

Two usual responses, depending upon the proposed solution.

First, get thee to a master schema (probably the most common). That is map every distinct data source to a common schema and all clients have to interact with that one schema. Case closed. Except data sources come and go, as do clients so there is maintenance overhead. Maintenance can take time to agree on updates.

Second, no system integrates every other possible source of data, so the fear of N-squared is greatly exaggerated. Not unlike the sudden rush for “big data” solutions whether the client has “big data” or not. Who would want to admit to having “medium” or even “small” data?

The third response that is of topic maps. The assumption that every identification must map to every other identification means things get ugly in a hurry. But topic maps question the premise of the N-Squared problem, that every identification must map to every other identification.

Here is an illustration of how five separate topic maps, with five different identifications of a popular comic book character (Superman), can be combined and yet avoid the N-Squared problem. In fact, topic maps offer an N+1 solution to the problem.

Each snippet, written in Compact Topic Map (CTM) syntax represents a separate topic map.

en-superman ;
- "Superman" ;
- altname: "Clark Kent" .


de-superman ;
- "Superman" ;
- birthname: "Kal-El" .


fr-superman ;
- "Superman" ;
birthplace: "Krypton" .


it-superman ;
- "Superman" ;
- altname: "Man of Steel" .


eo-superman ;
- "Superman" ;
- altname: "Clark Joseph Kent" .

Copied into a common file, superman-N-squared.ctm, nothing happens. That’s because they all have different subject identifiers. What if I add to the file/topic map, the following topic:

superman ; ; ; ; .

Results in the file, superman-N-squared-solution.ctm.


Or an author know one other identifier. So long as any group of authors uses at least one common identifier between any two maps, it results in the merger of their separate topic maps. (Ordering of the merges may be an issue.)

Another way to say that is that the trigger for merging of identifications is decentralized.

Which gives you a lot more eyes on the data, potential subjects and relationships between subjects.

PS: Did you know that the English and German versions gives Superman’s cover name as “Clark Kent,” while the French, Italian and Esperanto versions give his cover name as “Clark Joeseph Kent?”

PPS: The files are both here,

Beyond the Triple Count

Thursday, September 29th, 2011

Beyond the Triple Count by Leigh Dodds.

From the post:

I’ve felt for a while now that the Linked Data community has an unhealthy fascination on triple counts, i.e. on the size of individual datasets.

This was quite natural in the boot-strapping phase of Linked Data in which we were primarily focused on communicating how much data was being gathered. But we’re now beyond that phase and need to start considering a more nuanced discussion around published data.

If you’re a triple store vendor then you definitely want to talk about the volume of data your store can hold. After all, potential users or customers are going to be very interested in how much data could be indexed in your product. Even so, no-one seriously takes a headline figure at face value. As users we’re much more interested in a variety of other factors. For example how long does it take to load my data? Or, how well does a store perform with my usage profile, taking into account my hardware investment? Etc. This is why we have benchmarks, so we can take into account additional factors and more easily compare stores across different environments.

But there’s not nearly enough attention paid to other factors when evaluating a dataset. A triple count alone tells us nothing. They’re not even a good indicator of the number of useful “facts” in a dataset.

Watch Leigh’s presentation (embedded with his post) and read the post.

I think his final paragraph sets the goal for a wide variety of approaches, however we might disagree about how to best get there! 😉

Very much worth your time to read and ponder.

Bayes’ Rule and App Search

Thursday, September 29th, 2011

Bayes’ Rule and App Search by Paul Christiano.

From the post:

In order to provide relevant search results, Quixey needs to integrate many different sources of evidence — not only each app’s name and description, but content from all over the web that refers to specific apps. Aggregating huge quantities of information into a single judgment is a notoriously difficult problem, and modern machine learning offers many approaches.

When you need to incorporate just a few pieces of information, there’s a mathematical version of “brute force” that works quite well, based on Bayes’ Rule:

Very smooth explanation of Bayes’ Rule if you need to get your feet wet!

MongoDB Monitoring Service (MMS)

Thursday, September 29th, 2011

MongoDB Monitoring Service (MMS)

From the post:

Today we’re pleased to release the MongoDB Monitoring Service (MMS) to the public for free. MMS is a SaaS based tool that monitors your MongoDB cluster and makes it easy for you to see what’s going on in a production deployment.

One of the most frequently asked questions we get from users getting ready to deploy MongoDB is “What should I be monitoring in production?” Our engineers have spent a lot of time working with many of the world’s largest MongoDB deployments and based on this experience, MMS represents our current “best practices” monitoring for any deployment.

MMS is free. Anybody can sign up for MMS, download the agent, and start visualizing performance data in minutes.

If you’re a commercial support customer, MMS makes our support even better. 10gen engineers can access your MMS data, enabling them to skip the tedious back and forth information gathering that can go on during triaging of issues.

I’m don’t have access to a MongoDB cluster (at least not at the moment). Comments on MMS most welcome.

Given the feature/performance race in NoSQL solutions, should be interesting to see what monitoring solutions appear for other NoSQL offerings!

Or for SQL offerings as well!

Why your product sucks

Thursday, September 29th, 2011

Why your product sucks by Mike Pumphrey.

It isn’t often I stop listening to the Kinks for a software presentation, much less a recorded one. The title made me curious enough to spend six (6) minutes on it (total length).

My summary of the presentation:

Do you want to be righteous and make users work to use your software or do you want to be ubiquitous? Your choice.

Balisage 2012 Dates!

Thursday, September 29th, 2011

Mark your calendars!

August – Montreal – St. Catherine’s Blvd. – Markup – Balisage

What more need I say?

Oh, the dates:

August 6, 2012 — Pre-conference Symposium
August 7 — 10, 2012 – Balisage: The Markup Conference

Start lobbying now for travel funds and conference fees!

(Start writing your topic map paper as well.)

Indexed Nearest Neighbour Search in PostGIS

Thursday, September 29th, 2011

Indexed Nearest Neighbour Search in PostGIS

From the post:

An always popular question on the PostGIS users mailing list has been “how do I find the N nearest things to this point?”.

To date, the answer has generally been quite convoluted, since PostGIS supports bounding box index searches, and in order to get the N nearest things you need a box large enough to capture at least N things. Which means you need to know how big to make your search box, which is not possible in general.

PostgreSQL has the ability to return ordered information where an index exists, but the ability has been restricted to B-Tree indexes until recently. Thanks to one of our clients, we were able to directly fund PostgreSQL developers Oleg Bartunov and Teodor Sigaev in adding the ability to return sorted results from a GiST index. And since PostGIS indexes use GiST, that means that now we can also return sorted results from our indexes.

This feature (the PostGIS side of it) was funded by Vizzuality, and hopefully it comes in useful in their CartoDB work.

You will need PostgreSQL 9.1 and the PostGIS source code from the repository, but this is what a nearest neighbour search looks like:

PostgreSQL? Isn’t that SQL? 🙂

Indexed nearest neighbour search is a question of results, not ideology.

Better targeting through technology.

Subject Normalization

Thursday, September 29th, 2011

Another way to explain topic maps is in terms of Database normalization, except that I would call it subject normalization. That is every subject that is explicitly represented in the topic map appears once and only once, with relations to other subjects being recast to point to this single representative and all properties of the subject gathered to that one place.

One obvious advantage is that the shipping and accounting departments, for example, both have access to updated information for a customer as soon as entered by the other. And although they may gather different information about a customer, that information can be (doesn’t have to be) available to both of them.

Unlike database normalization, subject normalization in topic maps does not require rewriting of database tables, which can cause data access problems. Subject normalization (merging) occurs automatically, based on the presence of properties defined by the Topic Maps Data Model (TMDM).

And unlike OWL same:As, subject normalization in topic maps does not require knowledge of the “other” subject representative. That is I can insert an identifier that I know has been used for a subject, without knowledge it has been used in this topic map, and topics representing that subject will automatically merge (or be normalized).

Subject normalization in the terms of the TMDM, reduces the redundancy of information items. Which is true enough but not the primary experience of users with subject normalization. How many copies of a subject representative (information items) a system has is of little concern for an end-user.

What does concern end-users is getting the most complete and up-to-date information on a subject, however that is accomplished.

Topic maps accomplish that goal by empowering users to add identifiers to subject representatives that result in subject normalization. It doesn’t get any easier than that.

Hadoop for Archiving Email

Thursday, September 29th, 2011

Hadoop for Archiving Email by Sunil Sitaula.

When I saw the title of this post I started wondering if the NSA was having trouble with archiving all my email. 😉

From the post:

This post will explore a specific use case for Apache Hadoop, one that is not commonly recognized, but is gaining interest behind the scenes. It has to do with converting, storing, and searching email messages using the Hadoop platform for archival purposes.

Most of us in IT/Datacenters know the challenges behind storing years of corporate mailboxes and providing an interface for users to search them as necessary. The sheer volume of messages, the content structure and its complexity, the migration processes, and the need to provide timely search results stand out as key points that must be addressed before embarking on an actual implementation. For example, in some organizations all email messages are stored in production servers; others just create a backup dump and store them in tapes; and some organizations have proper archival processes that include search features. Regardless of the situation, it is essential to be able to store and search emails because of the critical information they hold as well as for legal compliance, investigation, etc. That said, let’s look at how Hadoop could help make this process somewhat simple, cost effective, manageable, and scalable.

The post concludes:

In this post I have described the conversion of email files into sequence files and store them using HDFS. I have looked at how to search through them to output results. Given the “simply add a node” scalability feature of Hadoop, it is very straightforward to add more storage as well as search capacity. Furthermore, given that Hadoop clusters are built using commodity hardware, that the software itself is open source, and that the framework makes it simple to implement specific use cases. This leads to an overall solution that is very cost effective compared to a number of existing software products that provide similar capabilities. The search portion of the solution, however, is very rudimentary. In part 2, I will look at using Lucene/Solr for indexing and searching in a more standard and robust way.

Read part one and get ready for part 2!

And start thinking about what indexing/search capabilities you are going to want.

Update: Hadoop for Archiving Email – Part 2

Data Integration: The Relational Logic Approach

Thursday, September 29th, 2011

Data Integration: The Relational Logic Approach by Michael Genesereth of Stanford University.


Data integration is a critical problem in our increasingly interconnected but inevitably heterogeneous world. There are numerous data sources available in organizational databases and on public information systems like the World Wide Web. Not surprisingly, the sources often use different vocabularies and different data structures, being created, as they are, by different people, at different times, for different purposes.

The goal of data integration is to provide programmatic and human users with integrated access to multiple, heterogeneous data sources, giving each user the illusion of a single, homogeneous database designed for his or her specific need. The good news is that, in many cases, the data integration process can be automated.

This book is an introduction to the problem of data integration and a rigorous account of one of the leading approaches to solving this problem, viz., the relational logic approach. Relational logic provides a theoretical framework for discussing data integration. Moreover, in many important cases, it provides algorithms for solving the problem in a computationally practical way. In many respects, relational logic does for data integration what relational algebra did for database theory several decades ago. A companion web site provides interactive demonstrations of the algorithms.

Interactive edition with working examples: (As near as I can tell, the entire text. Although referred to as the “companion” website.)

When the author said Datalog, I thought of Lar Marius:

In our examples here and throughout the book, we encode relationships between and among schemas as rules in a language called Datalog. In many cases, the rules are expressed in a simple version of Datalog called Basic Datalog; in other cases, rules are written in more elaborate versions, viz., Functional Datalog and Disjunctive Datalog. In the following paragraphs, we look at Basic Datalog first, then Functional Datalog, and finally Disjunctive Datalog. The presentation here is casual; formal details are given in Chapter 2.

Bottom line is that the author advocates a master schema approach but you should read book for yourself. It makes a number of good points about data integration issues and the limitations of various techniques. Plus you may learn some Datalog along the way!

Human Computation: Core Research Questions and State of the Art

Thursday, September 29th, 2011

Human Computation: Core Research Questions and State of the Art by Luis von Ahn and Edith Law. (> 300 slide tutorial) See also: Human Computation by Edith Law and Luis von Ahn.

Abstract from the book:

Human computation is a newand evolving research area that centers around harnessing human intelligence to solve computational problems that are beyond the scope of existing Artificial Intelligence (AI) algorithms.With the growth of the Web, human computation systems can now leverage the abilities of an unprecedented number of people via the Web to perform complex computation.There are various genres of human computation applications that exist today. Games with a purpose (e.g., the ESP Game) specifically target online gamers who generate useful data (e.g., image tags) while playing an enjoyable game.Crowdsourcing marketplaces (e.g.,Amazon MechanicalTurk) are human computation systems that coordinate workers to perform tasks in exchange for monetary rewards. In identity verification tasks, users perform computation in order to gain access to some online content; an example is reCAPTCHA, which leverages millions of users who solve CAPTCHAs every day to correct words in books that optical character recognition (OCR) programs fail to recognize with certainty.

This book is aimed at achieving four goals: (1) defining human computation as a research area; (2) providing a comprehensive review of existing work; (3) drawing connections to a wide variety of disciplines, including AI, Machine Learning, HCI, Mechanism/Market Design and Psychology, and capturing their unique perspectives on the core research questions in human computation; and (4) suggesting promising research directions for the future.

You may also want to see Luis van Ahn in a Google Techtalk video from about five years ago:

July 26, 2006 Luis von Ahn is an assistant professor in the Computer Science Department at Carnegie Mellon University, where he also received his Ph.D. in 2005. Previously, Luis obtained a B.S. in mathematics from Duke University in 2000. He is the recipient of a Microsoft Research Fellowship. ABSTRACT Tasks like image recognition are trivial for humans, but continue to challenge even the most sophisticated computer programs. This talk introduces a paradigm for utilizing human processing power to solve problems that computers cannot yet solve. Traditional approaches to solving such problems focus on improving software. I advocate a novel approach: constructively channel human brainpower using computer games. For example, the ESP Game, described in this talk, is an enjoyable online game — many people play over 40 hours a week — and when people play, they help label images on the Web with descriptive keywords. These keywords can be used to significantly improve the accuracy of image search. People play the game not because they want to help, but because they enjoy it. I describe other examples of “games with a purpose”: Peekaboom, which helps determine the location of objects in images, and Verbosity, which collects common-sense knowledge. I also explain a general approach for constructing games with a purpose.

A rapidly developing and exciting area of research. Perhaps your next topic map may be authored or maintained by a combination of entities.

Sorting by function value in Solr

Wednesday, September 28th, 2011

Sorting by function value in Solr by Rafał Kuć.

From the post:

In Solr 3.1 and later we have a very interesting functionality, which enables us to sort by function value. What does that gives us? Actually, a few interesting possibilities.

Let’s start

The first example that comes to mind, perhaps because of the project on which I worked some time ago, it’s sorting on the basis of distance between two geographical points. So far, to implement such functionality was needed changes in Solr (for example, LocalSolr or LocalLucene). Using Solr 3.1 and later, you can sort your search results using the value returned by the defined functions. For example, is Solr, we have the dist function calculating the distance between two points. One variation of the function is a function accepting five parameters: algorithm and two pairs of points. If, using this feature, we would like to sort your search results in ascending order from the point of latitude and longitude 0.0, we should add the following sort parameter to the Solr query:

See the post and think about how you would use this with Solr, or even how you might offer this to your users in other contexts.

Solr and LucidWorks Enterprise: When to use each

Wednesday, September 28th, 2011

Solr and LucidWorks Enterprise: When to use each

From the post:

If LucidWorks Enterprise is built on Solr, how do you know which one to use when for your own circumstances? This article describes the difference between using straight Solr, using the LucidWorks Enterprise user interface, and using LucidWorks Enterprise’s ReST API for accomplishing various common tasks so you can see which fits your situation at a given moment.

In today’s world, building the perfect product is a lot like trying to repair a set of train tracks while the train is barreling down on you. The world just keeps moving, with great ideas and new possibilities tempting you every day. And to make things worse, innovation doesn’t just show its face for you; it regularly visits your competitors as well.

That’s why you use open source software in the first place. You have smart people; does it make sense to have them building search functionality when Apache Solr already provides it? Of course not. You’d rather rely on the solid functionality that’s already been built by the community of Solr developers, and let your people spend their time building innovation into your own products. It’s simply a more efficient use of resources.

But what if you need search-related functionality that’s not available in straight Solr? In some cases, you may be able to fill those holes and lighten your load with LucidWorks Enterprise. Built on Solr, LucidWorks Enterprise starts by simplifying the day-to-day use tasks involved in using Solr, and then moves on to adding additional features that can help free up your development team for work on your own applications. But how do you know which path would be right for you?

Since I posted the LucidWorks 2.0 announcement yesterday, I thought this might be helpful in terms of its evaluation. I did not see a date on it but it looks current enough.

Traditional Entity Extraction’s Six Weaknesses

Wednesday, September 28th, 2011

Traditional Entity Extraction’s Six Weaknesses

From the post:

Most university programming courses ignore entity extraction. Some professors talk about the challenges of identifying people, places, things, events, Social Security Numbers and leave the rest to the students. Other professors may have an assignment related to parsing text and detecting anomalies or bound phrases. But most of those emerging with a degree in computer science consign the challenge of entity extraction to the Miscellaneous file.

Entity extraction means processing text to identify, tag, and properly account for those elements that are the names of person, numbers, organizations, locations, and expressions such as a telephone number, among other items. An entity can consist of a single word like Cher or a bound sequence of words like White House. The challenge of figuring out names is tough one for several reasons. Many names exist in richly varied forms. You can find interesting naming conventions in street addresses in Madrid, Spain, and for the owner of a falafel shop in Tripoli.

Entities, as information retrieval experts have learned since the first DARPA conference on the subject in 1987, are quite important to certain types of content analysis. Digital Reasoning has been working for more than 11 years on entity extraction and related content processing problems. Entity oriented analytics have become a very important issue these days as companies deal with too much data, the need to understand the meaning and not the just the statistics of the data and finally to understand entities in context – critical to understanding code terms, etc.

I want to highlight the six weaknesses of traditional entity extraction and highlight Digital Reasoning’s patented, fully automated method. Let’s look at the weaknesses.

For my library class: No, I am not endorsing this product and yes it is a promotional piece. You are going to encounter those as librarians for your entire careers. And you are going to need to be able to ask questions that focus on the information needs of your library and its patrons. Not what the software is said to do well.

Read the full piece and visit the product’s website. What would you ask? Why? What more information do you think you would need?

Getting Started with Amazon Elastic MapReduce

Wednesday, September 28th, 2011

Getting Started with Amazon Elastic MapReduce

I happened across this video yesterday. I recommend that you watch it at least two or three times, if not more.

Not that any of you need to learn how to run a Python word count script with Amazon Elastic MapReduce.

Rather this is a very effective presentation that does not get side tracked by rat holes and edge cases. It has an agenda and doesn’t deviate from that agenda.

A lot of lessons can be learned from this video for presentations at conferences or even to clients.

Dimensions to use to compare NoSQL data stores

Wednesday, September 28th, 2011

Dimensions to use to compare NoSQL data stores by Huan Liu.

From the post:

You have decided to use a NoSQL data store in favor of a DBMS store, possibly due to scaling reasons. But, there are so many NoSQL stores out there, which one should you choose? Part of the NoSQL movement is the acknowledgment that there are tradeoffs, and the various NoSQL projects have pursued different tradeoff points in the design space. Understanding the tradeoffs they have made, and figuring out which one fits your application better is a major undertaking.

Obviously, choosing the right data store is a much bigger topic, which is not something that can be covered in a single blog. There are also many resources comparing the various NoSQL data stores, e.g., here, so that there is no point repeating them. Instead, in this post, I will highlight the dimensions you should use when you compare the various data stores.

Useful information to have on hand when discussing NoSQL data stores.

Thoora is Your Robot Buddy for Exploring Web Topics

Wednesday, September 28th, 2011

Thoora is Your Robot Buddy for Exploring Web Topics by Jon Mitchell. (on ReadWriteWeb)

From the post:

With a Web full of stuff, discovery is a hard problem. Search engines were the first tools on the scene, but their rankings still have a hard time identifying relevance the same way a human user would. These days, social networks are the substitute for content discovery, and even the major search engines are using your social signals to determine what’s relevant for you. But the obvious problem with social search is that if your friends haven’t discovered it yet, it’s not on your radar.

At some point, someone in the social graph has to discover something for the first time. With so much new content getting churned out all the time, a Web surfer looking for something original could use some algorithmic help. A new app called Thoora, which launched its public beta last week, uses the power of machine learning to help users uncover new content on topics that interest them.

Create topic, Thoora suggests keywords, choose some, can declare them to be equivalent, results shared with others by default.

Users who create “good” topics can develop followings.

Although topics can be shared, the article does not mention sharing keywords.

Seems like a missed opportunity to crowd-source keywords from multiple “good” authors on the same topic to improve the results. That is you supply five or six keywords for topic A and I come along and suggest some additional keywords for topic A, perhaps from a topic I already have. Would require “acceptance” by the first user but that should not be hard.

I was amused to read in the Thoora FAQ:

Finally, Google News has no social component. Thoora was created so that topics could be shared and followed, because your topics – once painted with your expert brush – are super-valuable to others and ripe for sharing.

Sharing keywords is far more powerful that sharing topics. We have all had the experience of searching for something and a companion suggests a different word and we find the object of our search. Sharing in Thoora now is like following tweets. Useful, but not all that it could be.

If you decide to use Thoora, would appreciate your views and comments.

Practical Foundations for Programming Languages

Wednesday, September 28th, 2011

Practical Foundations for Programming Languages (pdf) by Robert Harper, Carnegie Mellon University.

From Chapter 1, page 3:

Programming languages are languages, a means of expressing computations in a form comprehensible to both people and machines. The syntax of a language specifies the means by which various sorts of phrases (expressions, commands, declarations, and so forth) may be combined to form programs. But what sort of thing are these phrases? What is a program made of?

The informal concept of syntax may be seen to involve several distinct concepts. The surface, or concrete, syntax is concerned with how phrases are entered and displayed on a computer. The surface syntax is usually thought of as given by strings of characters from some alphabet (say, ASCII or UniCode). The structural, or abstract, syntax is concerned with the structure of phrases, specifically how they are composed from other phrases. At this level a phrase is a tree, called an abstract syntax tree, whose nodes are operators that combine several phrases to form another phrase. The binding structure of syntax is concerned with the introduction and use of identifiers: how they are declared, and how declared identifiers are to be used. At this level phrases are abstract binding trees, which enrich abstract syntax trees with the concepts of binding and scope.

In this chapter we prepare the ground for all of our later work by defining precisely what are strings, abstract syntax trees, and abstract binding trees. The definitions are a bit technical, but are fundamentally quite simple and intuitive. It is probably best to skim this chapter on first reading, returning to it only as the need arises.

I am always amused when authors counsel readers to “skim” an early chapter and to return to it when in need. That works for the author, who already knows the material in the first chapter cold, works less well in my experience as a reader. How will I be aware that some future need could be satisfied by re-reading the first chapter? The first chapter is only nine (9) pages out of five hundred and seventy (570) so my suggestion would be to get the first chapter out of the way with a close reading.

From the preface:

This is a working draft of a book on the foundations of programming languages. The central organizing principle of the book is that programming language features may be seen as manifestations of an underlying type structure that governs its syntax and semantics. The emphasis, therefore, is on the concept of type, which codifies and organizes the computational universe in much the same way that the concept of set may be seen as an organizing principle for the mathematical universe. The purpose of this book is to explain this remark.

I think it is the view that “the concept of type…codifies and organizes the computational universe” that I find attractive. That being the case, we are free to construct computational universes that best fit our purposes, as opposed to fitting our purposes to particular computational universes.

Update: August 6, 2012 – First edition completed, see: There and Back Again

Is Precision the Enemy of Serendipity?

Wednesday, September 28th, 2011

I was reading claims of increased precision by software X the other day. I probably have mentioned this before (and it wasn’t original, then or now) that precision seems to me to be the enemy of serendipity.

For example, when I was an undergraduate, the library would display all the recent issues of journals on long angled shelves. So it was possible to walk along looking at the new issues in a variety of areas with ease. As a political science major I could have gone directly to journals on political science. But I would have missed the Review of Metaphysics and/or the Journal of the History of Ideas, both of which are rich sources of ideas relevant to topic maps (and information systems more generally).

But precision about the information available, a departmental page that links only to electronic versions of journals relevant to the “discipline,” reduces the opportunity to perhaps recognize relevant literature outside the confines of a discipline.

True, I still browse a lot, otherwise I would not notice titles like: k-means Approach to the Karhunen-Loéve Transform (aka PCA – Principal Component Analysis). I knew that k-means was a form of clustering that could help with gathering members of collective topics together but quite honestly did not recognize Karhunen-Loéve Transform. I know it as either PCA – Principal Component Analysis, which I inserted in my blog title to help others recognize the technique.

Of course the problem is that sometimes I really want precision, perhaps I am rushed to finish a job or need to find a reference for a standard, etc. In those cases I don’t have time to wade through a lot of search results and appreciate whatever (little) precision I can wring out of a search engine.

Whether I want more precision or more serendipity varies on a day to day basis for me. How about you?

The Free Law Reporter – Open Access to the Law and Beyond

Wednesday, September 28th, 2011

The Free Law Reporter – Open Access to the Law and Beyond

From the post:

Like many projects, the Free Law Reporter (FLR) started out as way to scratch an itch for ourselves. As a publisher of legal education materials and developer of legal education resources, CALI finds itself doing things with the text of the law all the time. Our open casebook project, eLangdell, is the most obvious example.

The theme of the 2006 Conference for Law School Computing was “Rip, Mix, Learn” and first introduced the idea of open access casebooks and what later became the eLangdell project. At the keynote talk I laid out a path to open access electronic casebooks using public domain case law as a starting point. On the ebook front, I was a couple of years early.

The basic idea was that casebooks were made up of cases (mostly) and that it was a fairly obvious idea to give the full text of cases to law faculty so that they could write their own casebooks and deliver them to their students electronically via the Web or as PDF files. This was before the Amazon Kindle and Apple iPad legitimized the ebook marketplace.

The Free Law Reporter is currently working on a Solr-based application to handle searching of all the case law they publish.

It has always seemed to me that the law is one of those areas that just crys out for the use of topic maps. The main problem being finding territory that isn’t already mostly occupied with current solutions. Such as linking law to case files (done). Linking depositions together and firms to do the encoding/indexing (done). Linking work to billing department (probably came first).

Sharing data/legal analysis? Across systems? That might be of interest in public interest or large class action suits.