Web Data Commons Extraction Framework …

Web Data Commons Extraction Framework for the Distributed Processing of CC Data by Robert Meusel.

Interested in a framework to process all the Common Crawl data?

From the post:

We used the extraction tool for example to extract a hyperlink graph covering over 3.5 billion pages and 126 billion hyperlinks from the 2012 CC corpus (over 100TB when uncompressed). Using our framework and 100 EC2 instances, the extraction took less than 12 hours and did costs less than US$ 500. The extracted graph had a size of less than 100GB zipped.

NSA level processing it’s not but then you are most likely looking for useful results, not data for the sake of filling up drives.

Comments are closed.