Navigating the WARC File Format by Stephen Merity.
From the post:
Recently CommonCrawl has switched to the Web ARChive (WARC) format. The WARC format allows for more efficient storage and processing of CommonCrawl’s free multi-billion page web archives, which can be hundreds of terabytes in size.
This document aims to give you an introduction to working with the new format, specifically the difference between:
- WARC files which store the raw crawl data
- WAT files which store computed metadata for the data stored in the WARC
- WET files which store extracted plaintext from the data stored in the WARC
If you want all the nitty gritty details, the best source is the ISO standard, for which the final draft is available.
If you’re more interested in diving into code, we’ve provided three introductory examples in Java that use the Hadoop framework to process WAT, WET and WARC.
If you aren’t already using Common Crawl data, you should be.
Fresh Data Available:
The latest dataset is from March 2014, contains approximately 2.8 billion webpages and is located
in Amazon Public Data Sets at /common-crawl/crawl-data/CC-MAIN-2014-10.
What are you going to look for in 2.8 billion webpages?