Archive for the ‘Webcrawler’ Category

Heritrix

Saturday, August 25th, 2012

Heritrix

From the wiki page:

This is the public wiki for the Heritrix archival crawler project.

Heritrix is the Internet Archive’s open-source, extensible, web-scale, archival-quality web crawler project.

Heritrix (sometimes spelled heretrix, or misspelled or mis-said as heratrix/heritix/ heretix/heratix) is an archaic word for heiress (woman who inherits). Since our crawler seeks to collect and preserve the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt.

All topical contributions to this wiki (corrections, proposals for new features, new FAQ items, etc.) are welcome! Register using the link near the top-right corner of this page.

Tool for creating a customized search collection or as reference code for a web crawler project.

I first saw this at Pete Warden’s Five Short Links for 24 August 2012.

Building a Scalable Web Crawler with Hadoop

Friday, January 27th, 2012

Building a Scalable Web Crawler with Hadoop

Ahad Rana of Common Crawl presents an architectural view of a web crawler based on Hadoop.

You can access the data from Common Crawl.

But the architecture notes may be useful if you decide to crawl a sub-part of the web and/or you need to crawl “deep web” data in your organization.

YaCy Search Engine

Wednesday, December 7th, 2011

YaCy – Decentralized Web Search

Has anyone seen this?

From the homepage:

YaCy is a free search engine that anyone can use to build a search portal for their intranet or to help search the public internet. When contributing to the world-wide peer network, the scale of YaCy is limited only by the number of users in the world and can index billions of web pages. It is fully decentralized, all users of the search engine network are equal, the network does not store user search requests and it is not possible for anyone to censor the content of the shared index. We want to achieve freedom of information through a free, distributed web search which is powered by the world’s users.

Limited demo interface: http://search.yacy.net/

Interesting idea.

It would be more interesting if it used a language that permitted dynamic updating of software while it is running. Otherwise, you are going to have the YaCy search engine you installed and nothing more.

Reportedly Google improves its search algorithm many times every quarter. How many of those changes are ad-driven they don’t say.

The documentation for YaCy is slim at best. Particularly on technical details. For example, uses a NoSQL database. OK, a custom one or one of the standard ones? I could go on but it would not produce any answers. As I explore the software I will post what I find out about it.

Building blocks of a scalable webcrawler

Monday, December 20th, 2010

Building blocks of a scalable webcrawler

From Marc Seeger’s post about his thesis:

This thesis documents my experiences trying to handle over 100 million sets of data while keeping them searchable. All of that happens while collecting and analyzing about 100 new domains per second. It covers topics from the different Ruby VMs (JRuby, Rubinius, YARV, MRI) to different storage-backend (Riak, Cassandra, MongoDB, Redis, CouchDB, Tokyo Cabinet, MySQL, Postgres, …) and the data-structures that they use in the background.

Questions:

  1. What components would need to be added to make this a semantic crawling project? (3-5 pages, citations)
  2. What scalability issues would semantic crawling introduce? (3-5 pages, citations)
  3. Design a configurable, scalable, semantic crawler. (Project)