Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 21, 2014

October 2014 Crawl Archive Available

Filed under: Common Crawl,Data — Patrick Durusau @ 5:41 pm

October 2014 Crawl Archive Available by Stephen Merity.

From the post:

The crawl archive for October 2014 is now available! This crawl archive is over 254TB in size and contains 3.72 billion webpages. The files are located in the aws-publicdatasets bucket at /common-crawl/crawl-data/CC-MAIN-2014-42/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://aws-publicdatasets/ or https://aws-publicdatasets.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

Thanks again to blekko for their ongoing donation of URLs for our crawl!

Just in time for weekend exploration! 😉

Enjoy!

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress