tokenising the visible english text of common crawl by Mat Kelcey.
From the post:
Common crawl is a publically available 30TB web crawl taken between September 2009 and September 2010. As a small project I decided to extract and tokenised the visible text of the web pages in this dataset. All the code to do this is on github.
Well, 30TB of data, that certainly sounds like a small project.
What small amount of data are you using for your next project?