Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 8, 2011

Nutch Tutorial: Supplemental

Filed under: Nutch,Solr — Patrick Durusau @ 7:58 pm

If you have tried following the Nutch Tutorial you have probably encountered one or more problems getting Nutch to run. I have installed Nutch on an Ubutu 10.10 system and suggest the following additions or modifications to the instructions you find there, as of 8 December 2011.

Steps 1 and 2, perform as written.

Step 3, at least the first part has serious issues.

The example:

<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>

is mis-leading.

The content of <value></value> cannot contain spaces.

Therefore, <value>Patrick Durusau Nutch Spider</value> is wrong and produces:

Fetcher: No agents listed in ‘http.agent.name’ property.

Which then results in a Java exception that kills the process.

If I enter, <value>pdurusau</value>, the process continues to the next error in step 3.

Correct instruction:

3. Crawl your first website:

In $HOME/nutch-1.4/conf, open the nutch-site.xml file, which has the following content the first time you open the file:

<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<!– Put site-specific property overrides in this file. –>

<configuration>

</configuration>

Inside the configuration element you will insert:

<property>
<name>http.agent.name</name>
<value>noSpaceName</value>
</property>

Which will result in a nutch-site.xml file that reads:

<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<!– Put site-specific property overrides in this file. –>

<configuration>

<property>
<name>http.agent.name</name>
<value>noSpaceName</value>
</property>

</configuration>

Save the nutch-site.xml file and we can move onto the next setup error.

Step 3 next says:

mkdir -p urls

You are already in the conf directory so that is where you create urls, yes?

No! That results in:

Input path does not exist: [$HOME]/nutch-1.4/runtime/local/urls

So you must create the urls directory under[$HOME]/nutch-1.4/ runtime/local by:

cd $HOME/nutch-1.4/runtime/local

mkdir urls

Don’t worry about the “-p” option on mkdir. It is an option that allows you to…, well, run man mkdir at a *nix command prompt if you are really interested. It would take more space than I need to spend here to explain it clearly.

The nutch-default.xml file, located under [$HOME]/nutch-1.4/conf/, sets a number of default settings for Nutch. You should not edit this file but copy properties of interest to [$HOME]/nutch1.4/conf/nutch-site.xml to create settings that override the default settings in nutch-default.xml.

If you look at nutch-default.xml or the rest of the Nutch Tutorial at the Apache site, you may be saying to yourself, but, but…, there are a lot of other settings and possible pitfalls.

Yes, yes that’s true.

I am setting up Nutch straight from the instruction as given, to encounter the same ambiguities users fresh to it will encounter. My plan is to use Nutch with Solr (and other search engines as well) to explore data for my blog as well as developing information for creating topic maps.

So, I will be covering the minimal set of options we need to get Nutch/Solr up and running but then expanding on other options as we need them.

Will pick up with corrections/suggestions tomorrow on the rest of the tutorial.

Suggestions and/or corrections/expansions are welcome!

PS: On days when I am correcting/editing user guides, expect fewer posts overall.

2 Comments

  1. […] came up in the discussion of the Nutch Tutorial and I thought it might be helpful to have an entry on the […]

    Pingback by dmoz – open directory project « Another Word For It — December 9, 2011 @ 8:25 pm

  2. […] This continues Nutch Tutorial: Supplemental. […]

    Pingback by Nutch Tutorial: Supplemental II « Another Word For It — December 12, 2011 @ 10:20 pm

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress