Natural Language Processing and Big Data: Using NLTK and Hadoop – Talk Overview by Benjamin Bengfort.
From the post:
My previous startup, Unbound Concepts, created a machine learning algorithm that determined the textual complexity (e.g. reading level) of children’s literature. Our approach started as a natural language processing problem — designed to pull out language features to train our algorithms, and then quickly became a big data problem when we realized how much literature we had to go through in order to come up with meaningful representations. We chose to combine NLTK and Hadoop to create our Big Data NLP architecture, and we learned some useful lessons along the way. This series of posts is based on a talk done at the April Data Science DC meetup.
Think of this post as the Cliff Notes of the talk and the upcoming series of posts so you don’t have to read every word … but trust me, it’s worth it.
If you can’t wait for the future posts, Benjamin’s presentation from April is here. Amusing but fairly sparse slides.
Looking forward to more posts in this series!
Big Data and Natural Language Processing – Part 1
The “Foo” of Big Data – Part 2
Python’s Natural Language Took Kit (NLTK) and Hadoop – Part 3