The Anatomy of Search Technology: blekko’s NoSQL database by Greg Lindahl.
From the post:
This is a guest post by Greg Lindahl, CTO of blekko, the spam free search engine that had over 3.5 million unique visitors in March. Greg Lindahl was Founder and Distinguished Engineer at PathScale, at which he was the architect of the InfiniPath low-latency InfiniBand HCA, used to build tightly-coupled supercomputing clusters.
Imagine that you’re crazy enough to think about building a search engine. It’s a huge task: the minimum index size needed to answer most queries is a few billion webpages. Crawling and indexing a few billion webpages requires a cluster with several petabytes of usable disk — that’s several thousand 1 terabyte disks — and produces an index that’s about 100 terabytes in size.
Greg starts with the storage aspects of the blekko search engine before taking on crawling in part 2 of this series.
Pay special attention to the combinators. You will be glad you did.