Poor man’s “entity” extraction with Solr by Erik Hatcher.

From the post:

My work at LucidWorks primarily involves helping customers build their desired solutions. Recently, more than one customer has inquired about doing “entity extraction”. Entity extraction, as defined on Wikipedia, “seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.” When drilling down into the specifics of the requirements from our customers, it turns out that many of them have straightforward solutions using built-in (Solr 4.x) components, such as:

  • Acronyms as facets
  • Key words or phrases, from a fixed list, as facets
  • Lat/long mentions as geospatial points

This article will describe and demonstrate how to do these, and as a bonus we’ll also extract URLs found in text too. Let’s start with an example input and the corresponding output all of the described techniques provides.

If you have been thinking about experimenting with Solr, Erik touches on some of its features by example.

