Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 9, 2016

ACHE Focused Crawler

Filed under: ElasticSearch,Record Linkage,Webcrawler — Patrick Durusau @ 4:51 pm

ACHE Focused Crawler

From the webpage:

ACHE is an implementation of a focused crawler. A focused crawler is a web crawler that collects Web pages that satisfy some specific property. ACHE differs from other crawlers in the sense that it includes page classifiers that allows it to distinguish between relevant and irrelevant pages in a given domain. The page classifier can be from a simple regular expression (that matches every page that contains a specific word, for example), to a sophisticated machine-learned classification model. ACHE also includes link classifiers, which allows it decide the best order in which the links should be downloaded in order to find the relevant content on the web as fast as possible, at the same time it doesn’t waste resources downloading irrelevant content.

ache-logo-400

The inclusion of machine learning (Weka) and robust indexing (ElasticSearch) means this will take more than a day or two to explore.

Certainly well suited to exploring all the web accessible resources on narrow enough topics.

I was thinking about doing a “9 Million Pages of Donald Trump,” (think Nine Billion Names of God) but a quick sanity check showed there are already more than 230 million such pages.

Perhaps by the election I could produce “9 Million Pages With Favorable Comments About Donald Trump.” Perhaps if I don’t dedupe the pages found by searching it would go that high.

Other topics for comprehensive web searching come to mind?

PS: The many names of record linkage come to mind. I think I have thirty (30) or so.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress