Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 25, 2013

Implementing the RAKE Algorithm with NLTK

Filed under: Authoring Topic Maps,Natural Language Processing,NLTK,Python — Patrick Durusau @ 3:09 pm

Implementing the RAKE Algorithm with NLTK by Sujit Pal.

From the post:

The Rapid Automatic Keyword Extraction (RAKE) algorithm extracts keywords from text, by identifying runs of non-stopwords and then scoring these phrases across the document. It requires no training, the only input is a list of stop words for a given language, and a tokenizer that splits the text into sentences and sentences into words.

The RAKE algorithm is described in the book Text Mining Applications and Theory by Michael W Berry (free PDF). There is a (relatively) well-known Python implementation and somewhat less well-known Java implementation.

I started looking for something along these lines because I needed to parse a block of text before vectorizing it and using the resulting features as input to a predictive model. Vectorizing text is quite easy with Scikit-Learn as shown in its Text Processing Tutorial. What I was trying to do was to cut down the noise by extracting keywords from the input text and passing a concatenation of the keywords into the vectorizer. It didn’t improve results by much in my cross-validation tests, however, so I ended up not using it. But keyword extraction can have other uses, so I decided to explore it a bit more.

I had started off using the Python implementation directly from my application code (by importing it as a module). I soon noticed that it was doing a lot of extra work because it was implemented in pure Python. I was using NLTK anyway for other stuff in this application, so it made sense to convert it to also use NLTK so I could hand off some of the work to NLTK’s built-in functions. So here is another RAKE implementation, this time using Python and NLTK.

Reminds me of the “statistically insignificant phrases” at Amazon. Or was that “statistically improbable phrases?”

If you search on “statistically improbable phrases,” you get twenty (20) “hits” under books at Amazon.com.

Could be a handy tool to quickly extract candidates for topics in a topic map.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress