Building a Scalable Web Crawler with Hadoop
Ahad Rana of Common Crawl presents an architectural view of a web crawler based on Hadoop.
You can access the data from Common Crawl.
But the architecture notes may be useful if you decide to crawl a sub-part of the web and/or you need to crawl “deep web” data in your organization.