Archive for the ‘Nutch’ Category

Nutch 1.11 Release

Thursday, December 10th, 2015

Nutch 1.11 Release

From the homepage:

The Apache Nutch PMC are pleased to announce the immediate release of Apache Nutch v1.11, we advise all current users and developers of the 1.X series to upgrade to this release.

This release is the result of many months of work and around 100 issues addressed. For a complete overview of these issues please see the release report.

As usual in the 1.X series, release artifacts are made available as both source and binary and also available within Maven Central as a Maven dependency. The release is available from our DOWNLOADS PAGE.

I have played with Nutch but never really taken advantage of it as a day-to-day discovery tool.

I don’t need to boil the Internet ocean to cover well over 90 to 95% of all the content that interests me on a day to day basis.

Moreover, searching a limited part of the Internet would enable things like granular date sorting and not within a week, month, last year.

Not to mention I could abandon the never sufficiently damned page-rank sorting of search results. Maybe you need to look “busy” as you sort through search result cruft, time and again, but I have other tasks to fill my time.

Come to think of it, as I winnow through search results, I could annotate, tag, mark, choose your terminology, such that a subsequent search turns up my evaluation, ranking, preference among those items.

Try that with Google, Bing or other general search appliance.

This won’t be an end of 2015 project, mostly because I am trying to learn a print dictionary layout from the 19th century for representation in granular markup and other tasks are at hand.

However, in early 2016 I will grab the Nutch 1.11 release and see if I can put it into daily use. More on that in 2016.

BTW, what projects are you going to be pursuing in 2016?

Crawling With Nutch

Tuesday, May 27th, 2014

Crawling With Nutch by Elizabeth Haubert.

From the post:

Recently, I had a client using LucidWorks search engine who needed to integrate with the Nutch crawler. This sounds simple as both products have been around for a while and are officially integrated. Even better, there are some great “getting started in x minutes” tutorials already out there for both Nutch, Solr and LucidWorks. But there were a few gotchas that kept those tutorials from working for me out of the box. This blog post documents my process of getting Nutch up and running on a Ubuntu server.

I know exactly what Elizabeth means, I have yet to find a Nutch/Solr tutorial that isn’t incomplete in some way.

What is really amusing is to try to setup Tomcat 7, Solr and Nutch.

I need to write up that experience sometime fairly soon. But no promises if you vary from the releases I document.

Common Crawl’s Move to Nutch

Sunday, February 23rd, 2014

Common Crawl’s Move to Nutch by Jordan Mendelson.

From the post:

Last year we transitioned from our custom crawler to the Apache Nutch crawler to run our 2013 crawls as part of our migration from our old data center to the cloud.

Our old crawler was highly tuned to our data center environment where every machine was identical with large amounts of memory, hard drives and fast networking.

We needed something that would allow us to do web-scale crawls of billions of webpages and would work in a cloud environment where we might run on a heterogenous machines with differing amounts of memory, CPU and disk space depending on the price plus VMs that might go up and down and varying levels of networking performance.

Before you hand roll a custom web crawler, you should read this short but useful report on the Common Crawl experience with Nutch.

Winter 2013 Crawl Data Now Available

Saturday, January 11th, 2014

Winter 2013 Crawl Data Now Available by Lisa Green.

From the post:

The second crawl of 2013 is now available! In late November, we published the data from the first crawl of 2013 (see previous blog post for more detail on that dataset). The new dataset was collected at the end of 2013, contains approximately 2.3 billion webpages and is 148TB in size. The new data is located in the aws-publicdatasets at /common-crawl/crawl-data/CC-MAIN-2013-48/

In 2013, we made changes to our crawling and post-processing systems. As detailed in the previous blog post, we switched file formats to the international standard WARC and WAT files. We also began using Apache Nutch to crawl – stay tuned for an upcoming blog post on our use of Nutch. The new crawling method relies heavily on the generous data donations from blekko and we are extremely grateful for blekko’s ongoing support!

In 2014 we plan to crawl much more frequently and publish fresh datasets at least once a month.

Data to play with now and the promise of more to come! Can’t argue with that!

Learning more about Common Crawl’s use of Nutch will be fun as well.

175 Analytic and Data Science Web Sites

Monday, December 30th, 2013

175 Analytic and Data Science Web Sites by Vincent Granville.

From the post:

Following is a list (in alphabetical order) of top domains related to analytics, data science or big data, based on input from Data Science central members. These top domains were cited by at least 4 members. Some of them are pure data science web sites, while others are more general (but still tech-oriented) with strong emphasis on data issues at large, or regular data science content.

I created 175-DataSites-2013.txt from Vincent’s listing formatted as a Nutch seed text.

I would delete some of the entries prior to crawling.

For example,

Lots interesting content but If you are looking for data-centric resources, I would be more specific.

DeleteDuplicates based on crawlDB only [Nutch-656]

Thursday, November 14th, 2013

DeleteDuplicates based on crawlDB only [Nutch-656]

As of today, Nutch, well, the nightly build after tonight, will have the ability to delete duplicate URLs.

Step in the right direction!

Now if duplicates could be declared on more than duplicate URLs and relationships maintained across deletions. 😉

Integrating Nutch 1.7 with ElasticSearch

Saturday, October 26th, 2013

Integrating Nutch 1.7 with ElasticSearch

From the post:

With Nutch 1.7 the possibility for integrating with ElasticSearch became available. However setting up the integration turned out to be quite a treasure hunt for me. For anybody else wanting to achieve the same result without tearing out as much hair as I did please find some simple instructions on this page that hopefully will help you in getting Nutch to talk to ElasticSearch.

I’m assuming you have both Nutch and ElasticSearch running fine by which I mean that Nutch does it crawl, fetch, parse thing and ElasticSearch is doing its indexing and searching magic, however not yet together.

All of the work involved is in Nutch and you need to edit nutch-site.xml in the conf directory to get things going. First off you need to activate the elasticsearch indexer plugin by adding the following line to nutch-site.xml:

A post that will be much appreciated by anyone who wants to integrate Nutch with ElasticSearch.

A large number of software issues are matters of configuration, once you know the configuration.

The explorers who find those configurations and share them with others are under appreciated.

Apache Nutch v1.7 Released

Friday, June 28th, 2013

Apache Nutch v1.7 Released

Main new feature is a pluggable indexing architecture that supports both Apache Solr and ElasticSearch.


Nutch/ElasticSearch News!

Wednesday, June 19th, 2013

Apache Nutch-1527

To summarize: Elasticsearch indexer committed to the trunk of Apache Nutch in rev. 1494496.


Apache Nutch: Web-scale search engine toolkit

Wednesday, June 19th, 2013

Apache Nutch: Web-scale search engine toolkit by Andrezej Białecki.

From the description:

This slideset presents the Nutch search engine ( A high-level architecture is described, as well as some challenges common in web-crawling and solutions implemented in Nutch. The presentation closes with a brief look into the Nutch future.

One of the best component based descriptions of Nutch that I have ever seen.

Apache Nutch v1.6 and Apache 2.1 Releases

Monday, December 10th, 2012

Apache Nutch v1.6 Released

From the news:

The Apache Nutch PMC are extremely pleased to announce the release of Apache Nutch v1.6. This release includes over 20 bug fixes, the same in improvements, as well as new functionalities including a new HostNormalizer, the ability to dynamically set fetchInterval by MIME-type and functional enhancements to the Indexer API inluding the normalization of URL’s and the deletion of robots noIndex documents. Other notable improvements include the upgrade of key dependencies to Tika 1.2 and Automaton 1.11-8. Please see the list of changes or the release report made in this version for a full breakdown. The release is available here.

See the Nutch 1.x tutorial.

Apache Nutch v2.1 Released

From the news:

The Apache Nutch PMC are very pleased to announce the release of Apache Nutch v2.1. This release continues to provide Nutch users with a simplified Nutch distribution building on the 2.x development drive which is growing in popularity amongst the community. As well as addressing ~20 bugs this release also offers improved properties for better Solr configuration, upgrades to various Gora dependencies and the introduction of the option to build indexes in elastic search. Please see the list of changes made in this version for a full breakdown. The release is available here.

See the Nutch 2.x tutorial.

I haven’t done a detailed comparison but roughly, Nutch 1.x relies upon Solr for storage and Nutch 2.x relies upon Gora and HBase.

Surprised that isn’t in the FAQ.

Perhaps I will investigate further and offer a short summary of the differences.

Nutch 1.5/1.5.1 [Cloud Setup for Experiements?]

Sunday, July 15th, 2012

Before the release of Nutch 2.0, there was the release of Nutch 1.5 and 1.5.1.

From the 1.5 release note:

The 1.5 release of Nutch is now available. This release includes several improvements including upgrades of several major components including Tika 1.1 and Hadoop 1.0.0, improvements to LinkRank and WebGraph elements as well as a number of new plugins covering blacklisting, filtering and parsing to name a few. Please see the list of changes

[WRONG URL – Should be: (version /1.5/” missing from the path, took me a while to notice the nature of the problem.)]

made in this version for a full breakdown of the 50 odd improvements the release boasts. A full PMC release statement can be found below

Apache Nutch is an open source web-search software project. Stemming from Apache Lucene, it now builds on Apache Solr adding web-specifics, such as a crawler, a link-graph database and parsing support handled by Apache Tika for HTML and and array other document formats. Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster. The system can be enhanced (eg other document formats can be parsed) using a highly flexible, easily extensible and thoroughly maintained plugin infrastructure.

Nutch is available in source and binary form (zip and tar.gz) from the following download page:

And 1.5.1:

Nutch is available in source and binary form (zip and tar.gz) from the following download page:

Question: Would you put together some commodity boxes for local experimentation or would you spin up an installation in one of the clouds?

As hot as the summer promises to be near Atlanta, I am leaning towards the cloud route.

As I write that I can hear a close friend from the West Coast shouting “…trust, trust issues….” But I trust the local banking network, credit card, utilities, finance, police/fire, etc., with just as little reason as any of the “clouds.”

Not really even “trust,” I don’t even think about it. The credit card industry knows $X fraud is going to occur every year and it is a cost of liquid transactions. So they allow for it in their fees. They proceed in the face of known rates of fraud. How’s that for trust? 😉 Trusting fraud is going to happen.

Same will be true for the “clouds” and mechanisms will evolve to regulate the amount of exposure versus potential damage. I am going to be experimenting with non-client data so the worst exposure I have is loss of time. Perhaps some hard lessons learned on configuration/security. But hardly a reason to avoid the “clouds” and to incur the local hardware cost.

I was serious when I suggested governments should start requiring side by side comparison of hardware costs for local installs versus cloud services. I would call the major cloud services up and ask them for independent bids.

Would the “clouds” be less secure? Possibly, but I don’t think any of them allow Lady Gaga CDs on premises.

Apache Nutch v2.0 Release

Sunday, July 15th, 2012

Apache Nutch v2.0 Release

From the post:

The Apache Nutch PMC are very pleased to announce the release of Apache Nutch v2.0. This release offers users an edition focused on large scale crawling which builds on storage abstraction (via Apache Gora™) for big data stores such as Apache Accumulo™, Apache Avro™, Apache Cassandra™, Apache HBase™, HDFS™, an in memory data store and various high profile SQL stores. After some two years of development Nutch v2.0 also offers all of the mainstream Nutch functionality and it builds on Apache Solr™ adding web-specifics, such as a crawler, a link-graph database and parsing support handled by Apache Tika™ for HTML and an array other document formats. Nutch v2.0 shadows the latest stable mainstream release (v1.5.X) based on Apache Hadoop™ and covers many use cases from small crawls on a single machine to large scale deployments on Hadoop clusters. Please see the list of changes made in this version for a full breakdown..

A full PMC release statement can be found below:

Nutch v2.0 is available in source (zip and tar.gz) from the following download page:

Apache Nutch 1.5 Released!

Friday, June 8th, 2012

Apache Nutch 1.5 Released!

From the homepage:

The 1.5 release of Nutch is now available. This release includes several improvements including upgrades of several major components including Tika 1.1 and Hadoop 1.0.0, improvements to LinkRank and WebGraph elements as well as a number of new plugins covering blacklisting, filtering and parsing to name a few. Please see the list of changes made in this version for a full breakdown of the 50 odd improvements the release boasts. The release is available here.

If you are looking for documentation, may I suggest the Nutch wiki?

Precise data extraction with Apache Nutch

Sunday, April 1st, 2012

Precise data extraction with Apache Nutch By Emir Dizdarevic.

From the post:

Nutch’s HtmlParser parses the whole page and returns parsed text, outlinks and additional meta data. Some parts of this are really useful like the outlinks but that’s basically it. The problem is that the parsed text is too general for the purpose of precise data extraction. Fortunately the HtmlParser provides us a mechanism (extension point) to attach an HtmlParserFilter to it.

We developed a plugin, which consists of HtmlParserFilter and IndexingFilter extensions, which provides a mechanism to fetch and index the desired data from a web page trough use of XPath 1.0. The name of the plugin is filter-xpath plugin.

Using this plugin we are now able to extract the desired data from web site with known structure. Unfortunately the plugin is an extension of the HtmlParserFilter extension point which is hardly coupled to the HtmlParser, hence plugin won’t work without the HtmlParser. The HtmlParser generates its own metadata (host, site, url, content, title, cache and tstamp) which will be indexed too. One way to control this is by not including IndexFilter plugins which depend on the metadata data to generate the indexing data (NutchDocument). The other way is to change the SOLR index mappings in the solrindex-mapping.xml file (maps NutchDocument fields to SolrInputDocument field). That way we will index only the fields we want.

The next problem arises when it comes to indexing. We want that Nutch fetches every page on the site but we don’t want to index them all. If we use the UrlRegexFilter to control this we will loose the indirect links which we also want to index and add to our URL DB. To address this problem we developed another plugin which is a extension of the IndexingFilter extension point which is called index-omit plugin. Using this plugin we are able to omit indexing on the pages we don’t need.

Great post on precision and data extraction.

And a lesson that indexing more isn’t the same thing as indexing smarter.

Sentence based semantic similarity measure for blog-posts

Saturday, January 14th, 2012

Sentence based semantic similarity measure for blog-posts by Mehwish Aziz and Muhammad Rafi.


Blogs-Online digital diary like application on web 2.0 has opened new and easy way to voice opinion, thoughts, and like-dislike of every Internet user to the World. Blogosphere has no doubt the largest user-generated content repository full of knowledge. The potential of this knowledge is still to be explored. Knowledge discovery from this new genre is quite difficult and challenging as it is totally different from other popular genre of web-applications like World Wide Web (WWW). Blog-posts unlike web documents are small in size, thus lack in context and contain relaxed grammatical structures. Hence, standard text similarity measure fails to provide good results. In this paper, specialized requirements for comparing a pair of blog-posts is thoroughly investigated. Based on this we proposed a novel algorithm for sentence oriented semantic similarity measure of a pair of blog-posts. We applied this algorithm on a subset of political blogosphere of Pakistan, to cluster the blogs on different issues of political realm and to identify the influential bloggers.

I am not sure I agree that “relaxed grammatical structures” are peculiar to blog posts. 😉

A “measure” of similarity that I have not seen (would appreciate a citation if you have) is the listing of one blog by another by another in its “blogroll.” On the theory that blogs may cite blogs they disagree with both semantically and otherwise in post but are unlikely to list blogs in their “blogroll” that they find disagreeable.

Nutch Tutorial: Supplemental III

Wednesday, December 14th, 2011

Apologies for the diversion in Nutch Tutorial: Supplemental II.

We left off last time with a safe way to extract the URLs from the RDF text without having to parse the XML and without having to expand the file onto the file system. And we produced a unique set of URLs.

We still need a random set of URLs, 1,000 was the amount mentioned in the Nutch Tutorial at Option 1.

Since we did not parse the RDF, we can’t use the subset option for

So, back to the Unix command line and our file with 3838759 lines in it, each with a unique URL.

Let’s do this a step at a time and we can pipe it all together below.

First, our file is: dmoz.urls.gz, so we expand it with gunzip:

gunzip dmoz.urls.gz

Results in dmoz.urls

The we run the shuf command, which randomly shuffles the lines in the file:

shuf dmoz.urls > dmoz.shuf.urls

Remember the < command pipes the results to another file.

Now the lines are in random order. But it is still the full set of URLs.

So we run the head command to take the first 1,000 lines off of our now randomly sorted file:

head -1000 dmoz.shuf.urls > dmoz.shuf.1000.urls

So now we have a file with 1,000 randomly chosen URLs from our DMOZ source file.

Here is how to do all that in one line:

gunzip -c dmoz.urls.gz | shuf | head -1000 > dmoz.shuf.1000.urls

BTW, in case you are worried about the randomness of your set, so many of us are not hitting the same servers with our test installations, don’t be.

I ran shuf twice in a row on my set of URL and then ran diff, which reported the first 100 lines were in a completely different order.

BTW, to check yourself on the extracted set of 1,000 URLs, run the following:

wc -l dmoz.shuf.1000.urls

Result should be 1000.

The wc command prints newline, word and byte counts. With the -l option, it prints new line counts.

In case you don’t have the shuf command on your system, I would try:

sort -R dmoz.urls > dmoz.sort.urls

as a substitute for shuf dmoz.urls > dmoz.shuf.urls

Hillary Mason (source of the sort suggestion, has collected more extract one line (not exactly our requirement but you can be creative) at: How to get a random line from a file in bash.

I am still having difficulties with one way to use Nutch/Solr so we will cover the “other” path, the working one, tomorrow. It looks like a bug between versions and I haven’t found the correct java class to copy over at this point. Not like a tutorial to mention that sort of thing. 😉

Nutch Tutorial: Supplemental II

Monday, December 12th, 2011

This continues Nutch Tutorial: Supplemental.

I am getting a consistent error from:

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5

and have posted to the solr list, although my post has yet to appear on the list. More on that in my next post.

I wanted to take a quick detour into 3.2 Using Individual Commands for Whole-Web Crawling as it has some problematic advice in it.

First, Open Directory Project data can be downloaded from How to Get ODP Data. (Always nice to have a link, I think they call them hyperlinks.)

Second, as of last week, the content.rdf.u8.gz file is 295831712. Something about the file should warn you that gunzip on this file is a very bad idea.

A better one?

Run: gunzip -l (the switch is lowercase “l” as in Larry) which delivers the following information:

compressed size: size of the compressed file
uncompressed size: size of the uncompressed file
ratio: compression ratio (0.0% if known)
uncompressed_name: name of the uncompressed file

Or, in this case:

gunzip -l content.rdf.u8.gz
compressed uncompressed ratio uncompressed_name
295831712 1975769200 85.0% content.rdf.u8

Yeah, that 1 under uncompressed is in the TB column. So just a tad shy of 2 TB of data.

Not everyone who is keeping up with search technology has a couple of spare TBs of drive space lying around, although it is becoming more common.

What got my attention was the lack of documentation of the file size or potential problems such a download could cause causal experimenters.

But, if we are going to work with this file we have to do so without decompressing it.

Since the tutorial only extracts URLs, I am taking that as the initial requirement although we will talk about more sophisticated requirements in just a bit.

On a *nix system it is possible to move the results of a command to another command by what are called pipes. My thinking in this case was to use the decompress command not to decompress the file but to decompress it and send the results of that compression to another command that would extract the URLs. After I got that part to working, I sorted and then deduped the URL set.

Here is the command with step [step] numbers that you should remove before trying to run it:

[1]gunzip -c content.rdf.u8.gz [2]| [3]grep -o ‘http://[^”]*’ [2]| [4]sort [2]| [5]uniq [6]> [7]dmoz.urls

  1. gunzip -c content.rdf.u8.gz – With the -c switch, gunzip does not change the original file but streams the uncompressed content of the file to standard out. This is our starting point for dealing with files too large to expand.
  2. | – This is the pipe character that moves the output of one command to be used by another. The shell command in this case has three (3) pipe commands.
  3. grep -o ‘http://[^”]*’ – With the -o switch, grep will print on the “matched” parts of a matched line (grep normally prints the entire line), with each part on a different line. The ‘http://[^”]*’ is a regular expression looking for parts that start with http:// and proceed to match any character other than the doublequote mark. When the double quote mark is reached, the match is complete and that part prints. Note the use of the wildcard character “*” which allows any number of charaters up to the closing double quote. The entire expression is surrounded with single ” ‘ ” characters because it contains a double quote character.
  4. sort – The result of #3 is piped into #4, where it is sorted. The sort was necessary because of the next command in in the pipe.
  5. uniq – The sorted result is delivered to the uniq command which deletes any duplicate URLs. A requirement for the uniq command is that the duplicates be located next to each other, hence the sort command.
  6. > Is the command to write the results of the uniq command to a file.
  7. dmoz.urls Is the file name for the results.

The results were as follows:

  • dmoz.urls = 130,279,429 – Remember the projected expansion of the original was 1,975,769,200 or 1,845,489,771 larger.
  • dmoz.urls.gz = 27,832,013 – The original was 295,831,712 or 267,999,699 larger.
  • unique urls – 3,838,759 (I have no way to compare to the original)

Note that it wasn’t necessary to process the RDF in order to extract a set of URLs for seeding a search engine.

Murray Altheim made several very good suggestions with regard to Java libraries and code for this task. Those don’t appear here but will appear in a command line tool for this dataset that allows the user to choose categories of websites to be extracted for seeding a search engine.

All that is preparatory to a command line tool for creating a topic map from a selected part of this data set and then enhancing it with the results of the use of a search engine.

Apologies for getting off track on the Nutch tutorial. There are some issues that remain to be answers, typos and the like, which I will take up in the next post on this subject.

Nutch Tutorial: Supplemental

Thursday, December 8th, 2011

If you have tried following the Nutch Tutorial you have probably encountered one or more problems getting Nutch to run. I have installed Nutch on an Ubutu 10.10 system and suggest the following additions or modifications to the instructions you find there, as of 8 December 2011.

Steps 1 and 2, perform as written.

Step 3, at least the first part has serious issues.

The example:

<value>My Nutch Spider</value>

is mis-leading.

The content of <value></value> cannot contain spaces.

Therefore, <value>Patrick Durusau Nutch Spider</value> is wrong and produces:

Fetcher: No agents listed in ‘’ property.

Which then results in a Java exception that kills the process.

If I enter, <value>pdurusau</value>, the process continues to the next error in step 3.

Correct instruction:

3. Crawl your first website:

In $HOME/nutch-1.4/conf, open the nutch-site.xml file, which has the following content the first time you open the file:

<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<!– Put site-specific property overrides in this file. –>



Inside the configuration element you will insert:


Which will result in a nutch-site.xml file that reads:

<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<!– Put site-specific property overrides in this file. –>




Save the nutch-site.xml file and we can move onto the next setup error.

Step 3 next says:

mkdir -p urls

You are already in the conf directory so that is where you create urls, yes?

No! That results in:

Input path does not exist: [$HOME]/nutch-1.4/runtime/local/urls

So you must create the urls directory under[$HOME]/nutch-1.4/ runtime/local by:

cd $HOME/nutch-1.4/runtime/local

mkdir urls

Don’t worry about the “-p” option on mkdir. It is an option that allows you to…, well, run man mkdir at a *nix command prompt if you are really interested. It would take more space than I need to spend here to explain it clearly.

The nutch-default.xml file, located under [$HOME]/nutch-1.4/conf/, sets a number of default settings for Nutch. You should not edit this file but copy properties of interest to [$HOME]/nutch1.4/conf/nutch-site.xml to create settings that override the default settings in nutch-default.xml.

If you look at nutch-default.xml or the rest of the Nutch Tutorial at the Apache site, you may be saying to yourself, but, but…, there are a lot of other settings and possible pitfalls.

Yes, yes that’s true.

I am setting up Nutch straight from the instruction as given, to encounter the same ambiguities users fresh to it will encounter. My plan is to use Nutch with Solr (and other search engines as well) to explore data for my blog as well as developing information for creating topic maps.

So, I will be covering the minimal set of options we need to get Nutch/Solr up and running but then expanding on other options as we need them.

Will pick up with corrections/suggestions tomorrow on the rest of the tutorial.

Suggestions and/or corrections/expansions are welcome!

PS: On days when I am correcting/editing user guides, expect fewer posts overall.