Archive for the ‘NLTK’ Category

New Natural Language Processing and NLTK Videos

Saturday, May 2nd, 2015

Natural Language Processing With Python and NLTK p.1 Tokenizing words and Sentences and Stop Words – Natural Language Processing With Python and NLTK p.2 by Harrison Kinsley.

From part 1:

Natural Language Processing is the task we give computers to read and understand (process) written text (natural language). By far, the most popular toolkit or API to do natural language processing is the Natural Language Toolkit for the Python programming language.

The NLTK module comes packed full of everything from trained algorithms to identify parts of speech to unsupervised machine learning algorithms to help you train your own machine to understand a specific bit of text.

NLTK also comes with a large corpora of data sets containing things like chat logs, movie reviews, journals, and much more!

Bottom line, if you’re going to be doing natural language processing, you should definitely look into NLTK!

Playlist link:…

sample code:

Use the Playlist link:… link as I am sure more videos will be appearing in the near future.


Thoughts on Software Development Python NLTK/Neo4j:…

Saturday, January 10th, 2015

Python NLTK/Neo4j: Analysing the transcripts of How I Met Your Mother by Mark Needham.

From the post:

After reading Emil’s blog post about dark data a few weeks ago I became intrigued about trying to find some structure in free text data and I thought How I met your mother’s transcripts would be a good place to start.

I found a website which has the transcripts for all the episodes and then having manually downloaded the two pages which listed all the episodes, wrote a script to grab each of the transcripts so I could use them on my machine.

Interesting intermarriage between NLTK and Neo4j. Perhaps even more so if NLTK were used to extract information from dialogue outside of fictional worlds and Neo4j was used to model dialogue roles, etc., as well as relationships and events outside of the dialogue.

Congressional hearings (in the U.S., same type of proceedings outside the U.S.) would make an interesting target for analysis using NLTK and Neo4j.

Python 3 Text Processing with NLTK 3 Cookbook

Friday, November 28th, 2014

Python 3 Text Processing with NLTK 3 Cookbook by Jacobs Perkins.

From the post:

After many weekend writing sessions, the 2nd edition of the NLTK Cookbook, updated for NLTK 3 and Python 3, is available at Amazon and Packt. Code for the book is on github at nltk3-cookbook. Here’s some details on the changes & updates in the 2nd edition:

First off, all the code in the book is for Python 3 and NLTK 3. Most of it should work for Python 2, but not all of it. And NLTK 3 has made many backwards incompatible changes since version 2.0.4. One of the nice things about Python 3 is that it’s unicode all the way. No more issues with ASCII versus unicode strings. However, you do have to deal with byte strings in a few cases. Another interesting change is that hash randomization is on by default, which means that if you don’t set the PYTHONHASHSEED environment variable, training accuracy can change slightly on each run, because the iteration order of dictionaries is no longer consistent by default.

It’s never too late to update your wish list! 😉


Dive Into NLTK

Saturday, November 1st, 2014

Dive Into NLTK Part I: Getting Started with NLTK

From the post:

NLTK is the most famous Python Natural Language Processing Toolkit, here I will give a detail tutorial about NLTK. This is the first article in a series where I will write everything about NLTK with Python, especially about text mining and text analysis online.

This is the first article in the series “Dive Into NLTK”, here is an index of all the articles in the series that have been published to date:

Part I: Getting Started with NLTK (this article)
Part II: Sentence Tokenize and Word Tokenize
Part III: Part-Of-Speech Tagging and POS Tagger
Part IV: Stemming and Lemmatization
Part V: Using Stanford Text Analysis Tools in Python
Part VI: Add Stanford Word Segmenter Interface for Python NLTK
Part VII: A Preliminary Study on Text Classification

Kudos for the refreshed index at the start of each post. Ease of navigation is a plus!

Have you considered subjecting your “usual” reading to NLTK? That is rather than analyzing a large corpus, what about the next CS article you are meaning to read?

The most I have done so far is to build concordances for standard drafts, mostly to catch bad keyword usage and misspelling. There is a lot more that could be done. Suggestions?

Enjoy this series!

Deep dive into understanding human language with Python

Saturday, September 13th, 2014

Deep dive into understanding human language with Python by Alyona Medelyan.


Whenever your data is text and you need to analyze it, you are likely to need Natural Language Processing algorithms that help make sense of human language. They will help you answer questions like: Who is the author of this text? What is his or her attitude? What is it about? What facts does it mention? Do I have similar texts like this one already? Where does it belong to?

This tutorial will cover several open-source Natural Language Processing Python libraries such as NLTK, Gensim and TextBlob, show you how they work and how you can use them effectively.

Level: Intermediate (knowledge of basic Python language features is assumed)

Pre-requisites: a Python environment with NLTK, Gensim and TextBlob already installed. Please make sure to run and install movie_reviews and stopwords (under Corpora), as well as POS model (under Models).

Code examples, data and slides from Alyona’s NLP tutorial at KiwiPyCon 2014.

Introduction to NLTK, Gensim and TextBlob.

Not enough to make you dangerous but enough to get you interested in natural language processing.

NLTK 3.0 Is Out!

Sunday, September 7th, 2014

NLTK 3.0

The online book has been updated:

Porting your code to NLTK 3.0


NLTK 3.0 Beta!

Wednesday, July 23rd, 2014

NLTK 3.0 Beta!

The official name is nltk 3.0.0b1 but I thought 3.0 beta rolls off the tongue better. 😉

Interface changes.

Grab the latest, contribute bug reports, etc.

Visualizing Philosophers And Scientists

Tuesday, July 1st, 2014

Visualizing Philosophers And Scientists By The Words They Used With Python and d3.js by Sahand Saba.

From the post:

This is a rather short post on a little fun project I did a couple of weekends ago. The purpose was mostly to demonstrate how easy it is to process and visualize large amounts of data using Python and d3.js.

With the goal of visualizing the words that were most associated with a given scientist or philosopher, I downloaded a variety of science and philosophy books that are in the public domain (project Gutenberg, more specifically), and processed them using Python (scikit-learn and nltk), then used d3.js and d3.js cloud by Jason Davies ( to visualize the words most frequently used by the authors. To make it more interesting, only words that are somewhat unique to the author are displayed (i.e. if a word is used frequently by all authors then it is likely not that interesting and is dropped from the results). This can be easily achieved using the max_df parameter of the CountVectorizer class.

I pass by Copleston’s A History of Philosophy several times a day. It is a paperback edition from many years ago that I keep meaning to re-read.

At least for philosophers with enough surviving texts in machine readable format, perhaps Sahand’s post will provide the incentive to return to reading Copleston. A word cloud is one way to explore a text. Commentary, such as Copleston’s, is another.

What other tools would you use with philosophers and a commentary like Copleston?

I first saw this in a tweet by Christophe Viau.

Using NLTK for Named Entity Extraction

Sunday, April 27th, 2014

Using NLTK for Named Entity Extraction by Emily Daniels.

From the post:

Continuing on from the previous project, I was able to augment the functions that extract character names using NLTK’s named entity module and an example I found online, building my own custom stopwords list to run against the returned names to filter out frequently used words like “Come”, “Chapter”, and “Tell” which were caught by the named entity functions as potential characters but are in fact just terms in the story.

Whether you are trusting your software or using human proofing, named entity extraction is a key task in mining data.

Having extracted named entities, the harder task is uncovering relationships between them that may not be otherwise identified.

Challenging with the text of Oliver Twist but even more difficult when mining donation records and the Congressional record.

Saving Output of nltk Text.Concordance()

Friday, April 18th, 2014

Saving Output of NLTK Text.Concordance() by Kok Hua.

From the post:

In NLP, sometimes users would like to search for series of phrases that contain particular keyword in a passage or web page.

NLTK provides the function concordance() to locate and print series of phrases that contain the keyword. However, the function only print the output. The user is not able to save the results for further processing unless redirect the stdout.

Below function will emulate the concordance function and return the list of phrases for further processing. It uses the NLTK concordance Index which keeps track of the keyword index in the passage/text and retrieve the surrounding words.

Text mining is a very common part of topic map construction so tools that help with that task are always welcome.

To be honest, I am citing this because it may become part of several small tools for processing standards drafts. Concordance software is not rare but a full concordance of a document seems to frighten some proof readers.

The current thinking being if only the “important” terms are highlighted in context, that some proof readers will be more likely to use the work product.

The same principal applies to the authoring of topic maps as well.

NLTK-like Wordnet Interface in Scala

Wednesday, April 16th, 2014

NLTK-like Wordnet Interface in Scala by Sujit Pal.

From the post:

I recently figured out how to setup the Java WordNet Library (JWNL) for something I needed to do at work. Prior to this, I have been largely unsuccessful at figuring out how to access Wordnet from Java, unless you count my one attempt to use the Java Wordnet Interface (JWI) described here. I think there are two main reason for this. First, I just didn’t try hard enough, since I could get by before this without having to hook up Wordnet from Java. The second reason was the over-supply of libraries (JWNL, JWI, RiTa, JAWS, WS4j, etc), each of which annoyingly stops short of being full-featured in one or more significant ways.

The one Wordnet interface that I know that doesn’t suffer from missing features comes with the Natural Language ToolKit (NLTK) library (written in Python). I have used it in the past to access Wordnet for data pre-processing tasks. In this particular case, I needed to call it at runtime from within a Java application, so I finally bit the bullet and chose a library to integrate into my application – I chose JWNL based on seeing it being mentioned in the Taming Text book (and used in the code samples). I also used code snippets from Daniel Shiffman’s Wordnet page to learn about the JWNL API.

After I had successfully integrated JWNL, I figured it would be cool (and useful) if I could build an interface (in Scala) that looked like the NLTK Wordnet interface. Plus, this would also teach me how to use JWNL beyond the basic stuff I needed for my webapp. My list of functions were driven by the examples from the Wordnet section (2.5) from the NLTK book and the examples from the NLTK Wordnet Howto. My Scala class implements most of the functions mentioned on these two pages. The following session will give you an idea of the coverage – even though it looks a Python interactive session, it was generated by my JUnit test. I do render the Synset and Word (Lemma) objects using custom format() methods to preserve the illusion (and to make the output readable), but if you look carefully, you will notice the rendering of List() is Scala’s and not Python’s.

NLTK is amazing in its own right and creating a Scala interface will give you an excuse to learn Scala. That’s a win-win situation!

Analyzing PubMed Entries with Python and NLTK

Wednesday, February 19th, 2014

Analyzing PubMed Entries with Python and NLTK by Themos Kalafatis.

From the post:

I decided to take my first steps of learning Python with the following task : Retrieve all entries from PubMed and then analyze those entries using Python and the Text Mining library NLTK.

We assume that we are interested in learning more about a condition called Sudden Hearing Loss. Sudden Hearing Loss is considered a medical emergency and has several causes although usually it is idiopathic (a disease or condition the cause of which is not known or that arises spontaneously according to Wikipedia).

At the moment of writing, the PubMed Query for sudden hearing loss returns 2919 entries :

A great illustration of using NLTK but of the iterative nature of successful querying.

Some queries, quite simple ones, can and do succeed on the first attempt.

Themos demonstrates how to use NLTK to explore a data set where the first response isn’t all that helpful.

This is a starting idea for weekly exercises with NLTK. Exercises which emphasize different aspects of NLTK.

Extracting Insights – FBO.Gov

Tuesday, January 21st, 2014

Extracting Insights from FBO.Gov data – Part 1

Extracting Insights from FBO.Gov data – Part 2

Extracting Insights from FBO.Gov data – Part 3

Dave Fauth has written a great three part series on extracting “insights” from large amounts of data.

From the third post in the series:

Earlier this year, Sunlight foundation filed a lawsuit under the Freedom of Information Act. The lawsuit requested solication and award notices from In November, Sunlight received over a decade’s worth of information and posted the information on-line for public downloading. I want to say a big thanks to Ginger McCall and Kaitlin Devine for the work that went into making this data available.

In the first part of this series, I looked at the data and munged the data into a workable set. Once I had the data in a workable set, I created some heatmap charts of the data looking at agencies and who they awarded contracts to. In part two of this series, I created some bubble charts looking at awards by Agency and also the most popular Awardees.

In the third part of the series, I am going to look at awards by date and then displaying that information in a calendar view. Then we will look at the types of awards.

For the date analysis, we are going to use all of the data going back to 2000. We have six data files that we will join together, filter on the ‘Notice Type’ field, and then calculate the counts by date for the awards. The goal is to see when awards are being made.

The most compelling lesson from this series is that data doesn’t always easily give up its secrets.

If you make it to the end of the series, you will find the government, on occasion, does the right thing. I’ll admit it, I was very surprised. 😉

Unpublished Data (Meaning What?)

Sunday, January 5th, 2014

PLoS Biology Bigrams by Georg.

From the post:

Here I will use the Natural Language Toolkit and a recipe from Python Text Processing with NLTK 2.0 Cookbook to work out the most frequent bigrams in the PLoS Biology articles that I downloaded last year and have described in previous posts here and here.

The amusing twist in this blog post is that the most frequent bigram, after filtering out stopwords, is unpublished data.

Not a trivial data set, some 1,754 articles.

Do you see the flaw in saying that most articles in PLoS data use “unpublished” data?

First, without looking at the data, I would be asking for the number of bigrams for each of the top six bigrams. I suspect that “gene expression” is used frequently relative to the number of articles, but I can’t make that judgment with the information given.

Second, the other question you would need to ask is why an article used the bigram “unpublished data.”

If I were writing a paper about papers that used “unpublished data” or more generally about “unpublished data,” I would use the bigram a lot. That would not mean my article was based on “unpublished data.”

NLTK can point you to the articles but deeper analysis is going to require you.

Google’s Python Lessons are Awesome

Sunday, November 3rd, 2013

Google’s Python Lessons are Awesome by Hartley Brody.

From the post:

Whether you’re just starting to learn Python, or you’ve been working with it for awhile, take note.

The lovably geeky Nick Parlante — a Google employee and CS lecturer at Stanford — has written some awesomely succinct tutorials that not only tell you how you can use Python, but also how you should use Python. This makes them a fantastic resource, regardless of whether you’re just starting, or you’ve been working with Python for awhile.

The course also features six YouTube videos of Nick giving a lesson in front of some new Google employees. These make it feel like he’s actually there teaching you every feature and trick, and I’d highly recommend watching all of them as you go through the lessons. Some of the videos are longish (~50m) so this is something you want to do when you’re sitting down and focused.

And to really get your feet wet, there are also downloadable samples puzzles and challenges that go along with the lessons, so you can actually practice coding along with the Googlers in his class. They’re all pretty basic — most took me less than 5m — but they’re a great chance to practice what you’ve learned. Plus you get the satisfaction that comes with solving puzzles and successfully moving through the class.

I am studying the NLTK to get ready for a text analysis project. At least to be able to read along. This looks like a great resource to know about.

I also like the idea of samples, puzzles and challenges.

Not that samples, puzzles and challenges would put topic maps over the top but it would make instruction/self-learning more enjoyable.

Finding Parties Named in U.S. Law…

Friday, August 16th, 2013

Finding Parties Named in U.S. Law using Python and NLTK by Gary Sieling.

From the post:

U.S. Law periodically names specific institutions; historically it is possible for Congress to write a law naming an individual, although I think that has become less common. I expect the most common entities named in Federal Law to be groups like Congress. It turns out this is true, but the other most common entities are the law itself and bureaucratic functions like archivists.

To get at this information, we need to read the Code XML, and use a natural language processing library to get at the named groups.

NLTK is such an NLP library. It provides interesting features like sentence parsing, part of speech tagging, and named entity recognition. (If interested in the subject see my review of “Natural Language Processing with Python“, a book which covers this library in detail)

I would rather know who paid for particular laws but that requires information external to the Code XML data set. 😉

A very good exercise to become familiar with both NLTK and the Code XML data set.

NLTK 2.1 – Working with Text Corpora

Sunday, June 9th, 2013

NLTK 2.1 – Working with Text Corpora by Vsevolod Dyomkin.

From the post:

Let’s return to start of chapter 2 and explore the tools needed to easily and efficiently work with various linguistic resources.

What are the most used and useful corpora? This is a difficult question to answer because different problems will likely require specific annotations and often a specific corpus. There are even special conferences dedicated to corpus linguistics.

Here’s a list of the most well-known general-purpose corpora:

  • Brown Corpus – one of the first big corpora and the only one in the list really easily accessible – we’ve already worked with it in the first chapter
  • Penn Treebank – Treebank is a corpus of sentences annotated with their constituency parse trees so that they can be used to train and evaluate parsers
  • Reuters Corpus (not to be confused with the ApteMod version provided with NLTK)
  • British National Corpus (BNC) – a really huge corpus, but, unfortunately, not freely available

Another very useful resource which isn’t structured specifically as academic corpora mentioned above, but at the same time has other dimensions of useful connections and annotations is Wikipedia. And there’s being a lot of interesting linguistic research performed with it.

Besides there are two additional valuable language resources that can’t be classified as text corpora at all, but rather as language databases: WordNet and Wiktionary. We have already discussed CL-NLP interface to Wordnet. And we’ll touch working with Wiktionary in this part.

Vsevolod continues to recast the NLTK into Lisp.

Learning corpus processing along with Lisp. How can you lose?

Finding Significant Phrases in Tweets with NLTK

Sunday, May 12th, 2013

Finding Significant Phrases in Tweets with NLTK by Sujit Pal.

From the post:

Earlier this week, there was a question about finding significant phrases in text on the Natural Language Processing People (login required) group on LinkedIn. I suggested looking at this LingPipe tutorial. The idea is to find statistically significant word collocations, ie, those that occur more frequently than we can explain away as due to chance. I first became aware of this approach from the LLG Book, where two approaches are described – one based on Log-Likelihood Ratios (LLR) and one based on the Chi-Squared test of independence – the latter is used by LingPipe.

I had originally set out to actually provide an implementation for my suggestion (to answer a followup question). However, the Scipy Pydoc notes that the chi-squared test may be invalid when the number of observed or expected frequencies in each category are too small. Our algorithm compares just two observed and expected frequencies, so it probably qualifies. Hence I went with the LLR approach, even though it is slightly more involved.

The idea is to find, for each bigram pair, the likelihood that the components are dependent on each other versus the likelihood that they are not. For bigrams which have a positive LLR, we repeat the analysis by adding its neighbor word, and arrive at a list of trigrams with positive LLR, and so on, until we reach the N-gram level we think makes sense for the corpus. You can find an explanation of the math in one of my earlier posts, but you will probably find a better explanation in the LLG book.

For input data, I decided to use Twitter. I’m not that familiar with the Twitter API, but I’m taking the Introduction to Data Science course on Coursera, and the first assignment provided some code to pull data from the Twitter 1% feed, so I just reused that. I preprocess the feed so I am left with about 65k English tweets using the following code:

An interesting look “behind the glass” on n-grams.

I am using AntConc to generate n-grams for proofing standards prose.

But as a finished tool, AntConc doesn’t give you insight into the technical side of the process.

Inter-Document Similarity with Scikit-Learn and NLTK

Saturday, May 4th, 2013

Inter-Document Similarity with Scikit-Learn and NLTK by Sujit Pal.

From the post:

Someone recently asked me about using Python to calculate document similarity across text documents. The application had to do with cheating detection, ie, compare student transcripts and flag documents with (abnormally) high similarity for further investigation. For security reasons, I could not get access to actual student transcripts. But the basic idea was to convince ourselves that this approach is valid, and come up with a code template for doing this.

I have been playing quite a bit with NLTK lately, but for this work, I decided to use the Python ML Toolkit Scikit-Learn, which has pretty powerful text processing facilities. I did end up using NLTK for its cosine similarity function, but that was about it.

I decided to use the coffee-sugar-cocoa mini-corpus of 53 documents to test out the code – I first found this in Dr Manu Konchady’s TextMine project, and I have used it off and on. For convenience I have made it available at the github location for the sub-project.

Similarity measures are fairly well understood.

But they lack interesting data sets for testing code.

Here are some random suggestions:

  • Speeches by Republicans on Benghazi
  • Speeches by Democrats on Gun Control
  • TV reports on any particular disaster
  • News reports of sporting events
  • Dialogue from popular TV shows

With a five to ten second lag, perhaps streams of speech could be monitored for plagiarism or repetition and simply dropped.


Open Sentiment Analysis

Thursday, May 2nd, 2013

Open Sentiment Analysis by Pete Warden.

From the post:

Sentiment analysis is fiendishly hard to solve well, but easy to solve to a first approximation. I’ve been frustrated that there have been no easy free libraries that make the technology available to non-specialists like me. The problem isn’t with the code, there are some amazing libraries like NLTK out there, but everyone guards their training sets of word weights jealously. I was pleased to discover that SentiWordNet is now CC-BY-SA, but even better I found that Finn Årup has made a drop-dead simple list of words available under an Open Database License!

With that in hand, I added some basic tokenizing code and was able to implement a new text2sentiment API endpoint for the Data Science Toolkit:

BTW, while you are there, take a look at the Data Science Toolkit more generally.

Glad to hear about the open set of word weights.

Sentiment analysis with undisclosed word weights sounds iffy to me.

It’s like getting a list of rounded numbers but you don’t know the rounding factor.

Even worse with sentiment analysis because every rounding factor may be different.

NLTK 2.3 – Working with Wordnet

Friday, April 12th, 2013

NLTK 2.3 – Working with Wordnet by Vsevolod Dyomkin.

From the post:

I’m a little bit behind my schedule of implementing NLTK examples in Lisp with no posts on topic in March. It doesn’t mean that work on CL-NLP has stopped – I’ve just had an unexpected vacation and also worked on parts, related to writing programs for the excellent Natural Language Processing by Michael Collins Coursera course.

Today we’ll start looking at Chapter 2, but we’ll do it from the end, first exploring the topic of Wordnet.

Vsevolod more than makes up for his absence with his post on Wordnet.

As a sample, consider this graphic of the potential of Wordnet:

Wordnet schema

Pay particular attention to the coverage of similarity measures.


Implementing the RAKE Algorithm with NLTK

Monday, March 25th, 2013

Implementing the RAKE Algorithm with NLTK by Sujit Pal.

From the post:

The Rapid Automatic Keyword Extraction (RAKE) algorithm extracts keywords from text, by identifying runs of non-stopwords and then scoring these phrases across the document. It requires no training, the only input is a list of stop words for a given language, and a tokenizer that splits the text into sentences and sentences into words.

The RAKE algorithm is described in the book Text Mining Applications and Theory by Michael W Berry (free PDF). There is a (relatively) well-known Python implementation and somewhat less well-known Java implementation.

I started looking for something along these lines because I needed to parse a block of text before vectorizing it and using the resulting features as input to a predictive model. Vectorizing text is quite easy with Scikit-Learn as shown in its Text Processing Tutorial. What I was trying to do was to cut down the noise by extracting keywords from the input text and passing a concatenation of the keywords into the vectorizer. It didn’t improve results by much in my cross-validation tests, however, so I ended up not using it. But keyword extraction can have other uses, so I decided to explore it a bit more.

I had started off using the Python implementation directly from my application code (by importing it as a module). I soon noticed that it was doing a lot of extra work because it was implemented in pure Python. I was using NLTK anyway for other stuff in this application, so it made sense to convert it to also use NLTK so I could hand off some of the work to NLTK’s built-in functions. So here is another RAKE implementation, this time using Python and NLTK.

Reminds me of the “statistically insignificant phrases” at Amazon. Or was that “statistically improbable phrases?”

If you search on “statistically improbable phrases,” you get twenty (20) “hits” under books at

Could be a handy tool to quickly extract candidates for topics in a topic map.

NLTK 1.3 – Computing with Language: Simple Statistics

Wednesday, March 6th, 2013

NLTK 1.3 – Computing with Language: Simple Statistics by Vsevolod Dyomkin.

From the post:

Most of the remaining parts of the first chapter of NLTK book serve as an introduction to Python in the context of text processing. I won’t translate that to Lisp, because there’re much better resources explaining how to use Lisp properly. First and foremost I’d refer anyone interested to the appropriate chapters of Practical Common Lisp:

List Processing
Macros: Standard Control Constructs

It’s only worth noting that Lisp has a different notion of lists, than Python. Lisp’s lists are linked lists, while Python’s are essentially vectors. Lisp also has vectors as a separate data-structure, and it also has multidimensional arrays (something Python mostly lacks). And the set of Lisp’s list operations is somewhat different from Python’s. List is the default sequence data-structure, but you should understand its limitations and know, when to switch to vectors (when you will have a lot of elements and often access them at random). Also Lisp doesn’t provide Python-style syntactic sugar for slicing and dicing lists, although all the operations are there in the form of functions. The only thing which isn’t easily reproducible in Lisp is assigning to a slice:

Vsevolod continues his journey through chapter 1 of NLTK 1.3 focusing on the statistics (with examples).

NLTK 1.1 – Computing with Language: …

Monday, March 4th, 2013

NLTK 1.1 – Computing with Language: Texts and Words by Vsevolod Dyomkin.

From the post:

OK, let’s get started with the NLTK book. Its first chapter tries to impress the reader with how simple it is to accomplish some neat things with texts using it. Actually, the underlying algorithms that allow to achieve these results are mostly quite basic. We’ll discuss them in this post and the code for the first part of the chapter can be found in nltk/ch1-1.lisp.

A continuation of Natural Language Meta Processing with Lisp.

Who knows? You might decide that Lisp is a natural language. 😉

A Consumer Electronics Named Entity Recognizer using NLTK [Post-Authoring ER?]

Saturday, December 1st, 2012

A Consumer Electronics Named Entity Recognizer using NLTK by Sujit Pal.

From the post:

Some time back, I came across a question someone asked about possible approaches to building a Named Entity Recognizer (NER) for the Consumer Electronics (CE) industry on LinkedIn’s Natural Language Processing People group. I had just finished reading the NLTK Book and had some ideas, but I wanted to test my understanding, so I decided to build one. This post describes this effort.

The approach is actually quite portable and not tied to NLTK and Python, you could, for example, build a Java/Scala based NER using components from OpenNLP and Weka using this approach. But NLTK provides all the components you need in one single package, and I wanted to get familiar with it, so I ended up using NLTK and Python.

The idea is that you take some Consumer Electronics text, mark the chunks (words/phrases) you think should be Named Entities, then train a (binary) classifier on it. Each word in the training set, along with some features such as its Part of Speech (POS), Shape, etc is a training input to the classifier. If the word is part of a CE Named Entity (NE) chunk, then its trained class is True otherwise it is False. You then use this classifier to predict the class (CE NE or not) of words in (previously unseen) text from the Consumer Electronics domain.

Should help with mining data for “entities” (read “subjects” in the topic map sense) for addition to your topic map.

I did puzzle over the suggestion for improvement that reads:

Another idea is to not do reference resolution during tagging, but instead postponing this to a second stage following entity recognition. That way, the references will be localized to the text under analysis, thus reducing false positives.

Post-authoring reference resolution might benefit from that approach.

But, if references were resolved by authors during the creation of a text, such as the insertion of Wikipedia references for entities, a different result would be obtained.

In those cases, assuming the author of a text is identified, they can be associated with a particular set of reference resolutions.

First Steps with NLTK

Friday, October 26th, 2012

First Steps with NLTK by Sujit Pal.

From the post:

Most of what I know about NLP is as a byproduct of search, ie, find named entities in (medical) text and annotating them with concept IDs (ie node IDs in our taxonomy graph). My interest in NLP so far has been mostly as a user, like using OpenNLP to do POS tagging and chunking. I’ve been meaning to learn a bit more, and I did take the Stanford Natural Language Processing class from Coursera. It taught me a few things, but still not enough for me to actually see where a deeper knowledge would actually help me. Recently (over the past month and a half), I have been reading the NLTK Book and the NLTK Cookbook in an effort to learn more about NLTK, the Natural Language Toolkit for Python.

This is not the first time I’ve been through the NLTK book, but it is the first time I have tried working out all the examples and (some of) the exercises (available on GitHub here), and I feel I now understand the material a lot better than before. I also realize that there are parts of NLP that I can safely ignore at my (user) level, since they are not either that baked out yet or because their scope of applicability is rather narrow. In this post, I will describe what I learned, where NLTK shines, and what one can do with it.

You will find the structured listing of links into the NLTK PyDocs very useful.

Explore Python, machine learning, and the NLTK library

Wednesday, October 10th, 2012

Explore Python, machine learning, and the NLTK library by Chris Joakim (, Senior Software Engineer, Primedia Inc.

From the post:

The challenge: Use machine learning to categorize RSS feeds

I was recently given the assignment to create an RSS feed categorization subsystem for a client. The goal was to read dozens or even hundreds of RSS feeds and automatically categorize their many articles into one of dozens of predefined subject areas. The content, navigation, and search functionality of the client website would be driven by the results of this daily automated feed retrieval and categorization.

The client suggested using machine learning, perhaps with Apache Mahout and Hadoop, as she had recently read articles about those technologies. Her development team and ours, however, are fluent in Ruby rather than Java™ technology. This article describes the technical journey, learning process, and ultimate implementation of a solution.

If a wholly automated publication process leaves you feeling uneasy, imagine the same system that feeds content to subject matter experts for further processing.

Think of it as processing raw ore on the way to finding diamonds and then deciding which ones get polished.

GATE, NLTK: Basic components of Machine Learning (ML) System

Thursday, October 4th, 2012

GATE, NLTK: Basic components of Machine Learning (ML) System by Krishna Prasad.

From the post:

I am currently building a Machine Learning system. In this blog I want to captures the elements of a machine learning system.

My definition of a Machine Learning System is to take voice or text inputs from a user and provide relevant information. And over a period of time, learn the user behavior and provides him better information. Let us hold on to this comment and dissect apart each element.

In the below example, we will consider only text input. Let us also assume that the text input will be a freeflowing English text.

  • As a 1st step, when someone enters a freeflowing text, we need to understand what is the noun, what is the verb, what is the subject and what is the predicate. For doing this we need a Parts of Speech analyzer (POS), for example “I want a Phone”. One of the components of Natural Language Processing (NLP) is POS.
  • For associating relationship between a noun and a number, like “Phone greater than 20 dollers”, we need to run the sentence thru a rule engine. The terminology used for this is Semantic Rule Engine
  • The 3rd aspect is the Ontology, where in each noun needs to translate to a specific product or a place. For example, if someone says “I want a Bike” it should translate as “I want a Bicycle” and it should interpret that the company that manufacture a bicycle is BSA, or a Trac. We typically need to build a Product Ontology
  • Finally if you have buying pattern of a user and his friends in the system, we need a Recommendation Engine to give the user a proper recommendation

What would you add (or take away) to make the outlined system suitable as a topic map authoring assistant?

Feel free to add more specific requirements/capabilities.

I first saw this at DZone.

Calculating Word and N-Gram Statistics from the Gutenberg Corpus

Wednesday, April 11th, 2012

Calculating Word and N-Gram Statistics from the Gutenberg Corpus by Richard Marsden.

From the post:

Following on from the previous article about scanning text files for word statistics, I shall extend this to use real large corpora. First we shall use this script to create statistics for the entire Gutenberg English language corpus. Next I shall do the same with the entire English language Wikipedia.

A “get your feet wet” sort of exercise with the script included.

The Gutenberg project isn’t “big data” but it is more than your usual inbox.

Think of it as learning about the data set for application of more sophisticated algorithms.

NLTK Trees

Saturday, December 17th, 2011

NLTK Trees by Richard Marsden.

A number of NLTK functions work with Tree objects. For example, part of speech tagging and chunking classifiers, naturally return trees. Sentence manipulation functions also work with trees. Although Natural Language Processing with Python (Bird et al) includes a couple of pages about NLTK’s Tree module, coverage is generally sparse. The online documentation actually contains some good coverage although it is not always in the most logical location (e.g. the unit tests contain some very good documentation). This article is intended as a quick introduction, and the more informative documentation pages are listed under Further Reading.

A handy introduction to NLTK trees.