Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 28, 2012

The Anatomy of Search Technology: Crawling using Combinators [blekko – part 2]

Filed under: blekko,Search Engines — Patrick Durusau @ 7:09 pm

The Anatomy of Search Technology: Crawling using Combinators by Greg Lindahl.

From the post:

This is the second guest post (part 1) of a series by Greg Lindahl, CTO of blekko, the spam free search engine. Previously, Greg was Founder and Distinguished Engineer at PathScale, at which he was the architect of the InfiniPath low-latency InfiniBand HCA, used to build tightly-coupled supercomputing clusters.

What’s so hard about crawling the web?

Web crawlers have been around as long as the Web has — and before the web, there were crawlers for gopher and ftp. You would think that 25 years of experience would render crawling a solved problem, but the vast growth of the web and new inventions in the technology of webspam and other unsavory content results in a constant supply of new challenges. The general difficulty of tightly-coupled parallel programming also rears its head, as the web has scaled from millions to 100s of billions of pages

In part 2, you learn why you were supposed to pay attention to combinators in part 1.

Want to take a few minutes to refresh on part 1?

Crawler problems still exist but you may have some new approaches to try.

The Anatomy of Search Technology: blekko’s NoSQL database [part 1]

Filed under: blekko,Search Engines — Patrick Durusau @ 6:57 pm

The Anatomy of Search Technology: blekko’s NoSQL database by Greg Lindahl.

From the post:

This is a guest post by Greg Lindahl, CTO of blekko, the spam free search engine that had over 3.5 million unique visitors in March. Greg Lindahl was Founder and Distinguished Engineer at PathScale, at which he was the architect of the InfiniPath low-latency InfiniBand HCA, used to build tightly-coupled supercomputing clusters.

Imagine that you’re crazy enough to think about building a search engine. It’s a huge task: the minimum index size needed to answer most queries is a few billion webpages. Crawling and indexing a few billion webpages requires a cluster with several petabytes of usable disk — that’s several thousand 1 terabyte disks — and produces an index that’s about 100 terabytes in size.

Greg starts with the storage aspects of the blekko search engine before taking on crawling in part 2 of this series.

Pay special attention to the combinators. You will be glad you did.

Powered by WordPress