Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 17, 2011

Building blocks of a scalable web crawler

Filed under: Indexing,NoSQL,Search Engines,Searching,SQL — Patrick Durusau @ 7:29 pm

Building blocks of a scalable web crawler Thesis by Marc Seeger. (2010)

Abstract:

The purpose of this thesis was the investigation and implementation of a good architecture for collecting, analysing and managing website data on a scale of millions of domains. The final project is able to automatically collect data about websites and analyse the content management system they are using.

To be able to do this efficiently, different possible storage back-ends were examined and a system was implemented that is able to gather and store data at a fast pace while still keeping it searchable.

This thesis is a collection of the lessons learned while working on the project combined with the necessary knowledge that went into architectural decisions. It presents an overview of the different infrastructure possibilities and general approaches and as well as explaining the choices that have been made for the implemented system.

From the conclusion:

The implemented architecture has been recorded processing up to 100 domains per second on a single server. At the end of the project the system gathered information about approximately 100 million domains. The collected data can be searched instantly and the automated generation of statistics is visualized in the internal web interface.

Most of your clients have lesser information demands but the lessons here will stand you in good stead with their systems too.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress