Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 20, 2013

Crawl Anywhere

Filed under: Search Engines,Search Interface,Solr,Webcrawler — Patrick Durusau @ 5:59 pm

Crawl Anywhere 4.0.0-release-candidate available

From the Overview:

What is Crawl Anywhere?

Crawl Anywhere allows you to build vertical search engines. Crawl Anywhere includes :   

  • a Web Crawler with a powerful Web user interface
  • a document processing pipeline
  • a Solr indexer
  • a full featured and customizable search application

You can see the diagram of a typical use of all components in this diagram.

Why was Crawl Anywhere created?

Crawl Anywhere was originally developed to index in Apache Solr 5400 web sites (more than 10.000.000 pages) for the Hurisearch search engine: http://www.hurisearch.org/. During this project, various crawlers were evaluated (heritrix, nutch, …) but one key feature was missing : a user friendly web interface to manage Web sites to be crawled with their specific crawl rules. Mainly for this raison, we decided to develop our own Web crawler. Why did we choose the name "Crawl Anywhere" ? This name may appear a little over stated, but crawl any source types (Web, database, CMS, …) is a real objective and Crawl Anywhere was designed in order to easily implement new source connectors.

Can you create a better search corpus for some domain X than Google?

Less noise and trash?

More high quality content?

Cross referencing? (Not more like this but meaningful cross-references.)

There is only one way to find out!

Crawl Anywhere will help you with the technical side of creating a search corpus.

What it won’t help with is developing the strategy to build and maintain such a corpus.

Interested in how you go beyond creating a subject specific list of resources?

A list that leaves a reader to sort though the chaff. Time and time again.

Pointers, suggestions, comments?

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress