Taming Text is released! by Mike McCandless.
From the post:
There’s a new exciting book just published from Manning, with the catchy title Taming Text, by Grant S. Ingersoll (fellow Apache Lucene committer), Thomas S. Morton, and Andrew L. Farris.
I enjoyed the (e-)book: it does a good job covering a truly immense topic that could easily have taken several books. Text processing has become vital for businesses to remain competitive in this digital age, with the amount of online unstructured content growing exponentially with time. Yet, text is also a messy and therefore challenging science: the complexities and nuances of human language don’t follow a few simple, easily codified rules and are still not fully understood today.
The book describe search techniques, including tokenization, indexing, suggest and spell correction. It also covers fuzzy string matching, named entity extraction (people, places, things), clustering, classification, tagging, and a question answering system (think Jeopardy). These topics are challenging!
N-gram processing (both character and word ngrams) is featured prominently, which makes sense as it is a surprisingly effective technique for a number of applications. The book includes helpful real-world code samples showing how to process text using modern open-source tools including OpenNLP, Tika, Lucene, Solr and Mahout.
You can see:
Or, you can do like I did, grab the source code and order the eBook (PDF) version of Taming Text.
More comments to follow!