running mahout collocations over common crawl text by Mat Kelcey.
From the post:
Common crawl is a publically available 30TB web crawl taken between September 2009 and September 2010. As a small project I decided to extract and tokenised the visible text of the web pages in this dataset. All the code to do this is on github.
Can you answer Mat’s question about the incidence of Lithuanian pages? (Please post here.)
[…] background-color:#222222; background-repeat : repeat; } tm.durusau.net – Today, 8:12 […]
Pingback by running mahout collocations over common crawl text « Another Word For It | Hadoop and Mahout | Scoop.it — March 8, 2012 @ 8:12 am