April 2014 Crawl Data Available by Stephen Merity.
From the post:
The April crawl of 2014 is now available! The new dataset is over 183TB in size containing approximately 2.6 billion webpages. The new data is located in the aws-publicdatasets bucket at /common-crawl/crawl-data/CC-MAIN-2014-15/.
To assist with exploring and using the dataset, we’ve provided gzipped files that list:
- all segments (CC-MAIN-2014-15/segment.paths.gz)
- all WARC files (CC-MAIN-2014-15/warc.paths.gz)
- all WAT files (CC-MAIN-2014-15/wat.paths.gz)
- all WET files (CC-MAIN-2014-15/wet.paths.gz)
By simply adding either s3://aws-publicdatasets/ or https://aws-publicdatasets.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.
Thanks again to Blekko for their ongoing donation of URLs for our crawl!
Well, at 183TB, I don’t guess I am going to have a local copy. 😉
Enjoy!
[…] Just in case you have exhausted all the possibilities with the April Crawl Data. […]
Pingback by July 2014 Crawl Data Available [Honeypot Detection] « Another Word For It — August 13, 2014 @ 3:26 pm