Entity Discovery using Mahout CollocDriver by Sujit Pal.
From the post:
I spent most of last week trying out various approaches to extract “interesting” phrases from a collection of articles. The objective was to identify candidate concepts that could be added to our taxonomy. There are various approaches, ranging from simple NGram frequencies, to algorithms such as RAKE (Rapid Automatic Keyword Extraction), to rescoring NGrams using Log Likelihood or Chi-squared measures. In this post, I describe how I used Mahout’s CollocDriver (which uses the Log Likelihood measure) to find interesting phrases from a small corpus of about 200 articles.
The articles were in various formats (PDF, DOC, HTML), and I used Apache Tika to parse them into text (yes, I finally found the opportunity to learn Tika :-)). Tika provides parsers for many common formats, so all we have to do was to hook them up to produce text from the various file formats. Here is my code:
Think of this as winnowing the chaff that your human experts would otherwise read.
A possible next step would be to decorate the candidate “interesting” phrases with additional information before being viewed by your expert(s).