Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 13, 2013

Taming Text is released!

Filed under: Text Analytics,Text Mining — Patrick Durusau @ 8:09 pm

Taming Text is released! by Mike McCandless.

From the post:

There’s a new exciting book just published from Manning, with the catchy title Taming Text, by Grant S. Ingersoll (fellow Apache Lucene committer), Thomas S. Morton, and Andrew L. Farris.

I enjoyed the (e-)book: it does a good job covering a truly immense topic that could easily have taken several books. Text processing has become vital for businesses to remain competitive in this digital age, with the amount of online unstructured content growing exponentially with time. Yet, text is also a messy and therefore challenging science: the complexities and nuances of human language don’t follow a few simple, easily codified rules and are still not fully understood today.

The book describe search techniques, including tokenization, indexing, suggest and spell correction. It also covers fuzzy string matching, named entity extraction (people, places, things), clustering, classification, tagging, and a question answering system (think Jeopardy). These topics are challenging!

N-gram processing (both character and word ngrams) is featured prominently, which makes sense as it is a surprisingly effective technique for a number of applications. The book includes helpful real-world code samples showing how to process text using modern open-source tools including OpenNLP, Tika, Lucene, Solr and Mahout.

You can see:

Table of Contents.

Sample chapter 1

Sample chapter 8

Source code (98 MB)

Or, you can do like I did, grab the source code and order the eBook (PDF) version of Taming Text.

More comments to follow!

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress