Common Crawl’s Move to Nutch by Jordan Mendelson.
From the post:
Last year we transitioned from our custom crawler to the Apache Nutch crawler to run our 2013 crawls as part of our migration from our old data center to the cloud.
Our old crawler was highly tuned to our data center environment where every machine was identical with large amounts of memory, hard drives and fast networking.
We needed something that would allow us to do web-scale crawls of billions of webpages and would work in a cloud environment where we might run on a heterogenous machines with differing amounts of memory, CPU and disk space depending on the price plus VMs that might go up and down and varying levels of networking performance.
Before you hand roll a custom web crawler, you should read this short but useful report on the Common Crawl experience with Nutch.