Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 6, 2012

running mahout collocations over common crawl text

Filed under: Common Crawl,Mahout — Patrick Durusau @ 8:09 pm

running mahout collocations over common crawl text by Mat Kelcey.

From the post:

Common crawl is a publically available 30TB web crawl taken between September 2009 and September 2010. As a small project I decided to extract and tokenised the visible text of the web pages in this dataset. All the code to do this is on github.

Can you answer Mat’s question about the incidence of Lithuanian pages? (Please post here.)

1 Comment

  1. […] background-color:#222222; background-repeat : repeat; } tm.durusau.net – Today, 8:12 […]

    Pingback by running mahout collocations over common crawl text « Another Word For It | Hadoop and Mahout | Scoop.it — March 8, 2012 @ 8:12 am

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress