Archive for the ‘Nutch’ Category

Apache Nutch v1.6 and Apache 2.1 Releases

Monday, December 10th, 2012

Apache Nutch v1.6 Released

From the news:

The Apache Nutch PMC are extremely pleased to announce the release of Apache Nutch v1.6. This release includes over 20 bug fixes, the same in improvements, as well as new functionalities including a new HostNormalizer, the ability to dynamically set fetchInterval by MIME-type and functional enhancements to the Indexer API inluding the normalization of URL’s and the deletion of robots noIndex documents. Other notable improvements include the upgrade of key dependencies to Tika 1.2 and Automaton 1.11-8. Please see the list of changes or the release report made in this version for a full breakdown. The release is available here.

See the Nutch 1.x tutorial.

Apache Nutch v2.1 Released

From the news:

The Apache Nutch PMC are very pleased to announce the release of Apache Nutch v2.1. This release continues to provide Nutch users with a simplified Nutch distribution building on the 2.x development drive which is growing in popularity amongst the community. As well as addressing ~20 bugs this release also offers improved properties for better Solr configuration, upgrades to various Gora dependencies and the introduction of the option to build indexes in elastic search. Please see the list of changes made in this version for a full breakdown. The release is available here.

See the Nutch 2.x tutorial.

I haven’t done a detailed comparison but roughly, Nutch 1.x relies upon Solr for storage and Nutch 2.x relies upon Gora and HBase.

Surprised that isn’t in the FAQ.

Perhaps I will investigate further and offer a short summary of the differences.

Nutch 1.5/1.5.1 [Cloud Setup for Experiements?]

Sunday, July 15th, 2012

Before the release of Nutch 2.0, there was the release of Nutch 1.5 and 1.5.1.

From the 1.5 release note:

The 1.5 release of Nutch is now available. This release includes several improvements including upgrades of several major components including Tika 1.1 and Hadoop 1.0.0, improvements to LinkRank and WebGraph elements as well as a number of new plugins covering blacklisting, filtering and parsing to name a few. Please see the list of changes

http://www.apache.org/dist/nutch/CHANGES-1.5.txt

[WRONG URL - Should be: http://www.apache.org/dist/nutch/1.5/CHANGES-1.5.txt (version /1.5/" missing from the path, took me a while to notice the nature of the problem.)]

made in this version for a full breakdown of the 50 odd improvements the release boasts. A full PMC release statement can be found below

http://nutch.apache.org/#07+June+2012+-+Apache+Nutch+1.5+Released

Apache Nutch is an open source web-search software project. Stemming from Apache Lucene, it now builds on Apache Solr adding web-specifics, such as a crawler, a link-graph database and parsing support handled by Apache Tika for HTML and and array other document formats. Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster. The system can be enhanced (eg other document formats can be parsed) using a highly flexible, easily extensible and thoroughly maintained plugin infrastructure.

Nutch is available in source and binary form (zip and tar.gz) from the following download page: http://www.apache.org/dyn/closer.cgi/nutch/

And 1.5.1:

http://www.apache.org/dist/nutch/1.5.1/CHANGES.txt

Nutch is available in source and binary form (zip and tar.gz) from the following download page: http://www.apache.org/dyn/closer.cgi/nutch/

Question: Would you put together some commodity boxes for local experimentation or would you spin up an installation in one of the clouds?

As hot as the summer promises to be near Atlanta, I am leaning towards the cloud route.

As I write that I can hear a close friend from the West Coast shouting “…trust, trust issues….” But I trust the local banking network, credit card, utilities, finance, police/fire, etc., with just as little reason as any of the “clouds.”

Not really even “trust,” I don’t even think about it. The credit card industry knows $X fraud is going to occur every year and it is a cost of liquid transactions. So they allow for it in their fees. They proceed in the face of known rates of fraud. How’s that for trust? ;-) Trusting fraud is going to happen.

Same will be true for the “clouds” and mechanisms will evolve to regulate the amount of exposure versus potential damage. I am going to be experimenting with non-client data so the worst exposure I have is loss of time. Perhaps some hard lessons learned on configuration/security. But hardly a reason to avoid the “clouds” and to incur the local hardware cost.

I was serious when I suggested governments should start requiring side by side comparison of hardware costs for local installs versus cloud services. I would call the major cloud services up and ask them for independent bids.

Would the “clouds” be less secure? Possibly, but I don’t think any of them allow Lady Gaga CDs on premises.

Apache Nutch v2.0 Release

Sunday, July 15th, 2012

Apache Nutch v2.0 Release

From the post:

The Apache Nutch PMC are very pleased to announce the release of Apache Nutch v2.0. This release offers users an edition focused on large scale crawling which builds on storage abstraction (via Apache Gora™) for big data stores such as Apache Accumulo™, Apache Avro™, Apache Cassandra™, Apache HBase™, HDFS™, an in memory data store and various high profile SQL stores. After some two years of development Nutch v2.0 also offers all of the mainstream Nutch functionality and it builds on Apache Solr™ adding web-specifics, such as a crawler, a link-graph database and parsing support handled by Apache Tika™ for HTML and an array other document formats. Nutch v2.0 shadows the latest stable mainstream release (v1.5.X) based on Apache Hadoop™ and covers many use cases from small crawls on a single machine to large scale deployments on Hadoop clusters. Please see the list of changes

http://www.apache.org/dist/nutch/2.0/CHANGES.txt made in this version for a full breakdown..

A full PMC release statement can be found below:

http://nutch.apache.org/#07+July+2012+-+Apache+Nutch+v2.0+Released

Nutch v2.0 is available in source (zip and tar.gz) from the following download page: http://www.apache.org/dyn/closer.cgi/nutch/2.0

Apache Nutch 1.5 Released!

Friday, June 8th, 2012

Apache Nutch 1.5 Released!

From the homepage:

The 1.5 release of Nutch is now available. This release includes several improvements including upgrades of several major components including Tika 1.1 and Hadoop 1.0.0, improvements to LinkRank and WebGraph elements as well as a number of new plugins covering blacklisting, filtering and parsing to name a few. Please see the list of changes made in this version for a full breakdown of the 50 odd improvements the release boasts. The release is available here.

If you are looking for documentation, may I suggest the Nutch wiki?

Precise data extraction with Apache Nutch

Sunday, April 1st, 2012

Precise data extraction with Apache Nutch By Emir Dizdarevic.

From the post:

Nutch’s HtmlParser parses the whole page and returns parsed text, outlinks and additional meta data. Some parts of this are really useful like the outlinks but that’s basically it. The problem is that the parsed text is too general for the purpose of precise data extraction. Fortunately the HtmlParser provides us a mechanism (extension point) to attach an HtmlParserFilter to it.

We developed a plugin, which consists of HtmlParserFilter and IndexingFilter extensions, which provides a mechanism to fetch and index the desired data from a web page trough use of XPath 1.0. The name of the plugin is filter-xpath plugin.

Using this plugin we are now able to extract the desired data from web site with known structure. Unfortunately the plugin is an extension of the HtmlParserFilter extension point which is hardly coupled to the HtmlParser, hence plugin won’t work without the HtmlParser. The HtmlParser generates its own metadata (host, site, url, content, title, cache and tstamp) which will be indexed too. One way to control this is by not including IndexFilter plugins which depend on the metadata data to generate the indexing data (NutchDocument). The other way is to change the SOLR index mappings in the solrindex-mapping.xml file (maps NutchDocument fields to SolrInputDocument field). That way we will index only the fields we want.

The next problem arises when it comes to indexing. We want that Nutch fetches every page on the site but we don’t want to index them all. If we use the UrlRegexFilter to control this we will loose the indirect links which we also want to index and add to our URL DB. To address this problem we developed another plugin which is a extension of the IndexingFilter extension point which is called index-omit plugin. Using this plugin we are able to omit indexing on the pages we don’t need.

Great post on precision and data extraction.

And a lesson that indexing more isn’t the same thing as indexing smarter.

Sentence based semantic similarity measure for blog-posts

Saturday, January 14th, 2012

Sentence based semantic similarity measure for blog-posts by Mehwish Aziz and Muhammad Rafi.

Abstract:

Blogs-Online digital diary like application on web 2.0 has opened new and easy way to voice opinion, thoughts, and like-dislike of every Internet user to the World. Blogosphere has no doubt the largest user-generated content repository full of knowledge. The potential of this knowledge is still to be explored. Knowledge discovery from this new genre is quite difficult and challenging as it is totally different from other popular genre of web-applications like World Wide Web (WWW). Blog-posts unlike web documents are small in size, thus lack in context and contain relaxed grammatical structures. Hence, standard text similarity measure fails to provide good results. In this paper, specialized requirements for comparing a pair of blog-posts is thoroughly investigated. Based on this we proposed a novel algorithm for sentence oriented semantic similarity measure of a pair of blog-posts. We applied this algorithm on a subset of political blogosphere of Pakistan, to cluster the blogs on different issues of political realm and to identify the influential bloggers.

I am not sure I agree that “relaxed grammatical structures” are peculiar to blog posts. ;-)

A “measure” of similarity that I have not seen (would appreciate a citation if you have) is the listing of one blog by another by another in its “blogroll.” On the theory that blogs may cite blogs they disagree with both semantically and otherwise in post but are unlikely to list blogs in their “blogroll” that they find disagreeable.

Nutch Tutorial: Supplemental III

Wednesday, December 14th, 2011

Apologies for the diversion in Nutch Tutorial: Supplemental II.

We left off last time with a safe way to extract the URLs from the RDF text without having to parse the XML and without having to expand the file onto the file system. And we produced a unique set of URLs.

We still need a random set of URLs, 1,000 was the amount mentioned in the Nutch Tutorial at Option 1.

Since we did not parse the RDF, we can’t use the subset option for org.apache.nutch.tools.DmozParser.

So, back to the Unix command line and our file with 3838759 lines in it, each with a unique URL.

Let’s do this a step at a time and we can pipe it all together below.

First, our file is: dmoz.urls.gz, so we expand it with gunzip:

gunzip dmoz.urls.gz

Results in dmoz.urls

The we run the shuf command, which randomly shuffles the lines in the file:

shuf dmoz.urls > dmoz.shuf.urls

Remember the < command pipes the results to another file.

Now the lines are in random order. But it is still the full set of URLs.

So we run the head command to take the first 1,000 lines off of our now randomly sorted file:

head -1000 dmoz.shuf.urls > dmoz.shuf.1000.urls

So now we have a file with 1,000 randomly chosen URLs from our DMOZ source file.

Here is how to do all that in one line:

gunzip -c dmoz.urls.gz | shuf | head -1000 > dmoz.shuf.1000.urls

BTW, in case you are worried about the randomness of your set, so many of us are not hitting the same servers with our test installations, don’t be.

I ran shuf twice in a row on my set of URL and then ran diff, which reported the first 100 lines were in a completely different order.

BTW, to check yourself on the extracted set of 1,000 URLs, run the following:

wc -l dmoz.shuf.1000.urls

Result should be 1000.

The wc command prints newline, word and byte counts. With the -l option, it prints new line counts.

In case you don’t have the shuf command on your system, I would try:

sort -R dmoz.urls > dmoz.sort.urls

as a substitute for shuf dmoz.urls > dmoz.shuf.urls

Hillary Mason (source of the sort suggestion, has collected more extract one line (not exactly our requirement but you can be creative) at: How to get a random line from a file in bash.

I am still having difficulties with one way to use Nutch/Solr so we will cover the “other” path, the working one, tomorrow. It looks like a bug between versions and I haven’t found the correct java class to copy over at this point. Not like a tutorial to mention that sort of thing. ;-)

Nutch Tutorial: Supplemental II

Monday, December 12th, 2011

This continues Nutch Tutorial: Supplemental.

I am getting a consistent error from:

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5

and have posted to the solr list, although my post has yet to appear on the list. More on that in my next post.

I wanted to take a quick detour into 3.2 Using Individual Commands for Whole-Web Crawling as it has some problematic advice in it.

First, Open Directory Project data can be downloaded from How to Get ODP Data. (Always nice to have a link, I think they call them hyperlinks.)

Second, as of last week, the content.rdf.u8.gz file is 295831712. Something about the file should warn you that gunzip on this file is a very bad idea.

A better one?

Run: gunzip -l (the switch is lowercase “l” as in Larry) which delivers the following information:

compressed size: size of the compressed file
uncompressed size: size of the uncompressed file
ratio: compression ratio (0.0% if known)
uncompressed_name: name of the uncompressed file

Or, in this case:

gunzip -l content.rdf.u8.gz
compressed uncompressed ratio uncompressed_name
295831712 1975769200 85.0% content.rdf.u8

Yeah, that 1 under uncompressed is in the TB column. So just a tad shy of 2 TB of data.

Not everyone who is keeping up with search technology has a couple of spare TBs of drive space lying around, although it is becoming more common.

What got my attention was the lack of documentation of the file size or potential problems such a download could cause causal experimenters.

But, if we are going to work with this file we have to do so without decompressing it.

Since the tutorial only extracts URLs, I am taking that as the initial requirement although we will talk about more sophisticated requirements in just a bit.

On a *nix system it is possible to move the results of a command to another command by what are called pipes. My thinking in this case was to use the decompress command not to decompress the file but to decompress it and send the results of that compression to another command that would extract the URLs. After I got that part to working, I sorted and then deduped the URL set.

Here is the command with step [step] numbers that you should remove before trying to run it:

[1]gunzip -c content.rdf.u8.gz [2]| [3]grep -o ‘http://[^"]*’ [2]| [4]sort [2]| [5]uniq [6]> [7]dmoz.urls

  1. gunzip -c content.rdf.u8.gz – With the -c switch, gunzip does not change the original file but streams the uncompressed content of the file to standard out. This is our starting point for dealing with files too large to expand.
  2. | – This is the pipe character that moves the output of one command to be used by another. The shell command in this case has three (3) pipe commands.
  3. grep -o ‘http://[^"]*’ – With the -o switch, grep will print on the “matched” parts of a matched line (grep normally prints the entire line), with each part on a different line. The ‘http://[^"]*’ is a regular expression looking for parts that start with http:// and proceed to match any character other than the doublequote mark. When the double quote mark is reached, the match is complete and that part prints. Note the use of the wildcard character “*” which allows any number of charaters up to the closing double quote. The entire expression is surrounded with single ” ‘ ” characters because it contains a double quote character.
  4. sort – The result of #3 is piped into #4, where it is sorted. The sort was necessary because of the next command in in the pipe.
  5. uniq – The sorted result is delivered to the uniq command which deletes any duplicate URLs. A requirement for the uniq command is that the duplicates be located next to each other, hence the sort command.
  6. > Is the command to write the results of the uniq command to a file.
  7. dmoz.urls Is the file name for the results.

The results were as follows:

  • dmoz.urls = 130,279,429 – Remember the projected expansion of the original was 1,975,769,200 or 1,845,489,771 larger.
  • dmoz.urls.gz = 27,832,013 – The original was 295,831,712 or 267,999,699 larger.
  • unique urls – 3,838,759 (I have no way to compare to the original)

Note that it wasn’t necessary to process the RDF in order to extract a set of URLs for seeding a search engine.

Murray Altheim made several very good suggestions with regard to Java libraries and code for this task. Those don’t appear here but will appear in a command line tool for this dataset that allows the user to choose categories of websites to be extracted for seeding a search engine.

All that is preparatory to a command line tool for creating a topic map from a selected part of this data set and then enhancing it with the results of the use of a search engine.

Apologies for getting off track on the Nutch tutorial. There are some issues that remain to be answers, typos and the like, which I will take up in the next post on this subject.

Nutch Tutorial: Supplemental

Thursday, December 8th, 2011

If you have tried following the Nutch Tutorial you have probably encountered one or more problems getting Nutch to run. I have installed Nutch on an Ubutu 10.10 system and suggest the following additions or modifications to the instructions you find there, as of 8 December 2011.

Steps 1 and 2, perform as written.

Step 3, at least the first part has serious issues.

The example:

<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>

is mis-leading.

The content of <value></value> cannot contain spaces.

Therefore, <value>Patrick Durusau Nutch Spider</value> is wrong and produces:

Fetcher: No agents listed in ‘http.agent.name’ property.

Which then results in a Java exception that kills the process.

If I enter, <value>pdurusau</value>, the process continues to the next error in step 3.

Correct instruction:

3. Crawl your first website:

In $HOME/nutch-1.4/conf, open the nutch-site.xml file, which has the following content the first time you open the file:

<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<!– Put site-specific property overrides in this file. –>

<configuration>

</configuration>

Inside the configuration element you will insert:

<property>
<name>http.agent.name</name>
<value>noSpaceName</value>
</property>

Which will result in a nutch-site.xml file that reads:

<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<!– Put site-specific property overrides in this file. –>

<configuration>

<property>
<name>http.agent.name</name>
<value>noSpaceName</value>
</property>

</configuration>

Save the nutch-site.xml file and we can move onto the next setup error.

Step 3 next says:

mkdir -p urls

You are already in the conf directory so that is where you create urls, yes?

No! That results in:

Input path does not exist: [$HOME]/nutch-1.4/runtime/local/urls

So you must create the urls directory under[$HOME]/nutch-1.4/ runtime/local by:

cd $HOME/nutch-1.4/runtime/local

mkdir urls

Don’t worry about the “-p” option on mkdir. It is an option that allows you to…, well, run man mkdir at a *nix command prompt if you are really interested. It would take more space than I need to spend here to explain it clearly.

The nutch-default.xml file, located under [$HOME]/nutch-1.4/conf/, sets a number of default settings for Nutch. You should not edit this file but copy properties of interest to [$HOME]/nutch1.4/conf/nutch-site.xml to create settings that override the default settings in nutch-default.xml.

If you look at nutch-default.xml or the rest of the Nutch Tutorial at the Apache site, you may be saying to yourself, but, but…, there are a lot of other settings and possible pitfalls.

Yes, yes that’s true.

I am setting up Nutch straight from the instruction as given, to encounter the same ambiguities users fresh to it will encounter. My plan is to use Nutch with Solr (and other search engines as well) to explore data for my blog as well as developing information for creating topic maps.

So, I will be covering the minimal set of options we need to get Nutch/Solr up and running but then expanding on other options as we need them.

Will pick up with corrections/suggestions tomorrow on the rest of the tutorial.

Suggestions and/or corrections/expansions are welcome!

PS: On days when I am correcting/editing user guides, expect fewer posts overall.