UNIX, Bi-Grams, Tri-Grams, and Topic Modeling by Greg Brown.
From the post:
I’ve built up a list of UNIX commands over the years for doing basic text analysis on written language. I’ve built this list from a number of sources (Jim Martin‘s NLP class, StackOverflow, web searches), but haven’t seen it much in one place. With these commands I can analyze everything from log files to user poll responses.
Mostly this just comes down to how cool UNIX commands are (which you probably already know). But the magic is how you mix them together. Hopefully you find these recipes useful. I’m always looking for more so please drop into the comments to tell me what I’m missing.
For all of these examples I assume that you are analyzing a series of user responses with one response per line in a single file: data.txt. With a few cut and paste commands I often apply the same methods to CSV files and log files.
My favorite comment on this post was a reader who extended the tri-gram generator to build a hexagram!
If that sounds unreasonable, you haven’t read very many government reports. 😉
While you are at Greg’s blog, notice a number of useful posts on Elasticsearch.