Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 20, 2010

Building blocks of a scalable webcrawler

Filed under: Indexing,Search Engines,Webcrawler — Patrick Durusau @ 4:41 am

Building blocks of a scalable webcrawler

From Marc Seeger’s post about his thesis:

This thesis documents my experiences trying to handle over 100 million sets of data while keeping them searchable. All of that happens while collecting and analyzing about 100 new domains per second. It covers topics from the different Ruby VMs (JRuby, Rubinius, YARV, MRI) to different storage-backend (Riak, Cassandra, MongoDB, Redis, CouchDB, Tokyo Cabinet, MySQL, Postgres, …) and the data-structures that they use in the background.

Questions:

  1. What components would need to be added to make this a semantic crawling project? (3-5 pages, citations)
  2. What scalability issues would semantic crawling introduce? (3-5 pages, citations)
  3. Design a configurable, scalable, semantic crawler. (Project)

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress