Building blocks of a scalable webcrawler
From Marc Seeger’s post about his thesis:
This thesis documents my experiences trying to handle over 100 million sets of data while keeping them searchable. All of that happens while collecting and analyzing about 100 new domains per second. It covers topics from the different Ruby VMs (JRuby, Rubinius, YARV, MRI) to different storage-backend (Riak, Cassandra, MongoDB, Redis, CouchDB, Tokyo Cabinet, MySQL, Postgres, …) and the data-structures that they use in the background.
Questions:
- What components would need to be added to make this a semantic crawling project? (3-5 pages, citations)
- What scalability issues would semantic crawling introduce? (3-5 pages, citations)
- Design a configurable, scalable, semantic crawler. (Project)