Data Sciencing by Numbers: A Walk-through for Basic Text Analysis by Jason Baldridge.
From the post:
My previous post “Titillating Titles: Uncoding SXSW Proposal Titles with Text Analytics” discusses a simple exploration I did into algorithmically rating SXSW titles, most of which I did while on a plane trip last week. What I did was pretty basic, and to demonstrate that, I’m following up that post with one that explicitly shows you how you can do it yourself, provided you have access to a Mac or Unix machine.
There are three main components to doing what I did for the blog post:
Topic modeling code: the Mallet toolkit’s implementation of Latent Dirichlet Allocation Language modeling code: the BerkeleyLM Java package for training and using n-gram language models Unix command line tools for processing raw text files with standard tools and the topic modeling and language modeling code I’ll assume you can use the Unix command line at at least a basic level, and I’ve packaged up the topic modeling and language modeling code in the Github repository maul to make it easy to try them out. To keep it really simple: you can download the Maul code and then follow the instructions in the Maul README. (By the way, by giving it the name “maul” I don’t want to convey that it is important or anything — it is just a name I gave the repository, which is just a wrapper around other people’s code.)
…
Jason’s post should help get you starting doing data exercises. It is up to you if you continue those exercises and branch out to other data and new tools.
Like everything else, data exploration proficiency requires regular exercise.
Are you keeping a data exercise calendar?
I first saw this in a post by Jason Baldridge.