This continues Nutch Tutorial: Supplemental.
I am getting a consistent error from:
bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5
and have posted to the solr list, although my post has yet to appear on the list. More on that in my next post.
I wanted to take a quick detour into 3.2 Using Individual Commands for Whole-Web Crawling as it has some problematic advice in it.
First, Open Directory Project data can be downloaded from How to Get ODP Data. (Always nice to have a link, I think they call them hyperlinks.)
Second, as of last week, the content.rdf.u8.gz file is 295831712. Something about the file should warn you that gunzip on this file is a very bad idea.
A better one?
Run: gunzip -l (the switch is lowercase “l” as in Larry) which delivers the following information:
compressed size: size of the compressed file
uncompressed size: size of the uncompressed file
ratio: compression ratio (0.0% if known)
uncompressed_name: name of the uncompressed file
Or, in this case:
gunzip -l content.rdf.u8.gz
compressed uncompressed ratio uncompressed_name
295831712 1975769200 85.0% content.rdf.u8
Yeah, that 1 under uncompressed is in the TB column. So just a tad shy of 2 TB of data.
Not everyone who is keeping up with search technology has a couple of spare TBs of drive space lying around, although it is becoming more common.
What got my attention was the lack of documentation of the file size or potential problems such a download could cause causal experimenters.
But, if we are going to work with this file we have to do so without decompressing it.
Since the tutorial only extracts URLs, I am taking that as the initial requirement although we will talk about more sophisticated requirements in just a bit.
On a *nix system it is possible to move the results of a command to another command by what are called pipes. My thinking in this case was to use the decompress command not to decompress the file but to decompress it and send the results of that compression to another command that would extract the URLs. After I got that part to working, I sorted and then deduped the URL set.
Here is the command with step [step] numbers that you should remove before trying to run it:
gunzip -c content.rdf.u8.gz | grep -o ‘http://[^"]*’ | sort | uniq > dmoz.urls
- gunzip -c content.rdf.u8.gz – With the -c switch, gunzip does not change the original file but streams the uncompressed content of the file to standard out. This is our starting point for dealing with files too large to expand.
- | – This is the pipe character that moves the output of one command to be used by another. The shell command in this case has three (3) pipe commands.
- grep -o ‘http://[^"]*’ – With the -o switch, grep will print on the “matched” parts of a matched line (grep normally prints the entire line), with each part on a different line. The ‘http://[^"]*’ is a regular expression looking for parts that start with http:// and proceed to match any character other than the doublequote mark. When the double quote mark is reached, the match is complete and that part prints. Note the use of the wildcard character “*” which allows any number of charaters up to the closing double quote. The entire expression is surrounded with single ” ‘ ” characters because it contains a double quote character.
- sort – The result of #3 is piped into #4, where it is sorted. The sort was necessary because of the next command in in the pipe.
- uniq – The sorted result is delivered to the uniq command which deletes any duplicate URLs. A requirement for the uniq command is that the duplicates be located next to each other, hence the sort command.
- > Is the command to write the results of the uniq command to a file.
- dmoz.urls Is the file name for the results.
The results were as follows:
- dmoz.urls = 130,279,429 – Remember the projected expansion of the original was 1,975,769,200 or 1,845,489,771 larger.
- dmoz.urls.gz = 27,832,013 – The original was 295,831,712 or 267,999,699 larger.
- unique urls – 3,838,759 (I have no way to compare to the original)
Note that it wasn’t necessary to process the RDF in order to extract a set of URLs for seeding a search engine.
Murray Altheim made several very good suggestions with regard to Java libraries and code for this task. Those don’t appear here but will appear in a command line tool for this dataset that allows the user to choose categories of websites to be extracted for seeding a search engine.
All that is preparatory to a command line tool for creating a topic map from a selected part of this data set and then enhancing it with the results of the use of a search engine.
Apologies for getting off track on the Nutch tutorial. There are some issues that remain to be answers, typos and the like, which I will take up in the next post on this subject.