Archive for the ‘OpenNLP’ Category

Use Cases for Taming Text, 2nd ed.

Saturday, January 25th, 2014

Use Cases for Taming Text, 2nd ed. by Grant Ingersoll.

From the post:

Drew Farris, Tom Morton and I are currently working on the 2nd Edition of Taming Text ( for first ed.) and are soliciting interested parties who would be willing to contribute to a chapter on practical use cases (i.e. you have something in production and are willing to write about it) for search with Solr, NLP using OpenNLP or Stanford NLP and machine learning using Mahout, OpenNLP or MALLET — ideally you are using combinations of 2 or more of these to solve your problems. We are especially interested in large scale use cases in eCommerce, Advertising, social media analytics, fraud, etc.

The writing process is fairly straightforward. A section roughly equates to somewhere between 3 – 10 pages, including diagrams/pictures. After writing, there will be some feedback from editors and us, but otherwise the process is fairly simple.

In order to participate, you must have permission from your company to write on the topic. You would not need to divulge any proprietary information, but we would want enough information for our readers to gain a high-level understanding of your use case. In exchange for your participation, you will have your name and company published on that section of the book as well as in the acknowledgments section. If you have a copy of Lucene in Action or Mahout In Action, it would be similar to the use case sections in those books.


I am guessing the second edition isn’t going to take as long as the first. ­čśë

Couldn’t be in better company as far as co-authors.

See the post for the contact details.

Apache OpenNLP 1.5.2-incubating

Tuesday, November 29th, 2011

From the announcement of the release of Apache OpenNLP 1.5.2-incubating:

The Apache OpenNLP team is pleased to announce the release of version 1.5.2-incubating of Apache OpenNLP.

The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution.

The OpenNLP 1.5.2-incubating binary and source distributions are available for download from our download page:

The OpenNLP library is distributed by Maven Central as well. See the Maven Dependency page for more details:

This release contains a couple of new features, improvements and bug fixes. The maxent trainer can now run in multiple threads to utilize multi-core CPUs, configurable feature generation was added to the name finder, the perceptron trainer was refactored and improved, machine learners can now be configured with much more options via a parameter file, evaluators can print out detailed evaluation information.

Additionally the release contains the following noteworthy changes:

  • Improved the white space handling in the Sentence Detector and its training code
  • Added more cross validator command line tools
  • Command line handling code has been refactored
  • Fixed problems with the new build
  • Now uses fast token class feature generation code by default
  • Added support for BioNLP/NLPBA 2004 shared task data
  • Removal of old and deprecated code
  • Dictionary case sensitivity support is now done properly
  • Support for OSGi

For a complete list of fixed bugs and improvements please see the RELEASE_NOTES file included in the distribution.

Using Lucene and Cascalog for Fast Text Processing at Scale

Monday, November 7th, 2011

Using Lucene and Cascalog for Fast Text Processing at Scale

From the post:

Here at Yieldbot we do a lot of text processing of analytics data. In order to accomplish this in a reasonable amount of time, we use Cascalog, a data processing and querying library for Hadoop; written in Clojure. Since Cascalog is Clojure, you can develop and test queries right inside of the Clojure REPL. This allows you to iteratively develop processing workflows with extreme speed. Because Cascalog queries are just Clojure code, you can access everything Clojure has to offer, without having to implement any domain specific APIs or interfaces for custom processing functions. When combined with Clojure’s awesome Java Interop, you can do quite complex things very simply and succinctly.

Many great Java libraries already exist for text processing, e.g., Lucene, OpenNLP, LingPipe, Stanford NLP. Using Cascalog allows you take advantage of these existing libraries with very little effort, leading to much shorter development cycles.

By way of example, I will show how easy it is to combine Lucene and Cascalog to do some (simple) text processing. You can find the entire code used in the examples over on Github.  

The world of text exploration just gets better all the time!