The Anatomy of Search Technology: Crawling using Combinators by Greg Lindahl.
From the post:
This is the second guest post (part 1) of a series by Greg Lindahl, CTO of blekko, the spam free search engine. Previously, Greg was Founder and Distinguished Engineer at PathScale, at which he was the architect of the InfiniPath low-latency InfiniBand HCA, used to build tightly-coupled supercomputing clusters.
What’s so hard about crawling the web?
Web crawlers have been around as long as the Web has — and before the web, there were crawlers for gopher and ftp. You would think that 25 years of experience would render crawling a solved problem, but the vast growth of the web and new inventions in the technology of webspam and other unsavory content results in a constant supply of new challenges. The general difficulty of tightly-coupled parallel programming also rears its head, as the web has scaled from millions to 100s of billions of pages
In part 2, you learn why you were supposed to pay attention to combinators in part 1.
Want to take a few minutes to refresh on part 1?
Crawler problems still exist but you may have some new approaches to try.