Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 17, 2014

April 2014 Crawl Data Available

Filed under: Common Crawl — Patrick Durusau @ 2:40 pm

April 2014 Crawl Data Available by Stephen Merity.

From the post:

The April crawl of 2014 is now available! The new dataset is over 183TB in size containing approximately 2.6 billion webpages. The new data is located in the aws-publicdatasets bucket at /common-crawl/crawl-data/CC-MAIN-2014-15/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://aws-publicdatasets/ or https://aws-publicdatasets.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

Thanks again to Blekko for their ongoing donation of URLs for our crawl!

Well, at 183TB, I don’t guess I am going to have a local copy. 😉

Enjoy!

1 Comment

  1. […] Just in case you have exhausted all the possibilities with the April Crawl Data. […]

    Pingback by July 2014 Crawl Data Available [Honeypot Detection] « Another Word For It — August 13, 2014 @ 3:26 pm

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress