Apologies for the diversion in Nutch Tutorial: Supplemental II.
We left off last time with a safe way to extract the URLs from the RDF text without having to parse the XML and without having to expand the file onto the file system. And we produced a unique set of URLs.
We still need a random set of URLs, 1,000 was the amount mentioned in the Nutch Tutorial at Option 1.
Since we did not parse the RDF, we can’t use the subset option for org.apache.nutch.tools.DmozParser.
So, back to the Unix command line and our file with 3838759 lines in it, each with a unique URL.
Let’s do this a step at a time and we can pipe it all together below.
First, our file is: dmoz.urls.gz, so we expand it with gunzip:
gunzip dmoz.urls.gz
Results in dmoz.urls
The we run the shuf command, which randomly shuffles the lines in the file:
shuf dmoz.urls > dmoz.shuf.urls
Remember the < command pipes the results to another file.
Now the lines are in random order. But it is still the full set of URLs.
So we run the head command to take the first 1,000 lines off of our now randomly sorted file:
head -1000 dmoz.shuf.urls > dmoz.shuf.1000.urls
So now we have a file with 1,000 randomly chosen URLs from our DMOZ source file.
Here is how to do all that in one line:
gunzip -c dmoz.urls.gz | shuf | head -1000 > dmoz.shuf.1000.urls
BTW, in case you are worried about the randomness of your set, so many of us are not hitting the same servers with our test installations, don’t be.
I ran shuf twice in a row on my set of URL and then ran diff, which reported the first 100 lines were in a completely different order.
BTW, to check yourself on the extracted set of 1,000 URLs, run the following:
wc -l dmoz.shuf.1000.urls
Result should be 1000.
The wc command prints newline, word and byte counts. With the -l option, it prints new line counts.
In case you don’t have the shuf command on your system, I would try:
sort -R dmoz.urls > dmoz.sort.urls
as a substitute for shuf dmoz.urls > dmoz.shuf.urls
Hillary Mason (source of the sort suggestion, has collected more extract one line (not exactly our requirement but you can be creative) at: How to get a random line from a file in bash.
I am still having difficulties with one way to use Nutch/Solr so we will cover the “other” path, the working one, tomorrow. It looks like a bug between versions and I haven’t found the correct java class to copy over at this point. Not like a tutorial to mention that sort of thing. đ