Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 31, 2014

Web Data Commons Extraction Framework …

Filed under: Common Crawl,Software — Patrick Durusau @ 2:33 pm

Web Data Commons Extraction Framework for the Distributed Processing of CC Data by Robert Meusel.

Interested in a framework to process all the Common Crawl data?

From the post:

We used the extraction tool for example to extract a hyperlink graph covering over 3.5 billion pages and 126 billion hyperlinks from the 2012 CC corpus (over 100TB when uncompressed). Using our framework and 100 EC2 instances, the extraction took less than 12 hours and did costs less than US$ 500. The extracted graph had a size of less than 100GB zipped.

NSA level processing it’s not but then you are most likely looking for useful results, not data for the sake of filling up drives.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress