Precise data extraction with Apache Nutch By Emir Dizdarevic.
From the post:
Nutch’s HtmlParser parses the whole page and returns parsed text, outlinks and additional meta data. Some parts of this are really useful like the outlinks but that’s basically it. The problem is that the parsed text is too general for the purpose of precise data extraction. Fortunately the HtmlParser provides us a mechanism (extension point) to attach an HtmlParserFilter to it.
We developed a plugin, which consists of HtmlParserFilter and IndexingFilter extensions, which provides a mechanism to fetch and index the desired data from a web page trough use of XPath 1.0. The name of the plugin is filter-xpath plugin.
Using this plugin we are now able to extract the desired data from web site with known structure. Unfortunately the plugin is an extension of the HtmlParserFilter extension point which is hardly coupled to the HtmlParser, hence plugin won’t work without the HtmlParser. The HtmlParser generates its own metadata (host, site, url, content, title, cache and tstamp) which will be indexed too. One way to control this is by not including IndexFilter plugins which depend on the metadata data to generate the indexing data (NutchDocument). The other way is to change the SOLR index mappings in the solrindex-mapping.xml file (maps NutchDocument fields to SolrInputDocument field). That way we will index only the fields we want.
The next problem arises when it comes to indexing. We want that Nutch fetches every page on the site but we don’t want to index them all. If we use the UrlRegexFilter to control this we will loose the indirect links which we also want to index and add to our URL DB. To address this problem we developed another plugin which is a extension of the IndexingFilter extension point which is called index-omit plugin. Using this plugin we are able to omit indexing on the pages we don’t need.
Great post on precision and data extraction.
And a lesson that indexing more isn’t the same thing as indexing smarter.