Before the release of Nutch 2.0, there was the release of Nutch 1.5 and 1.5.1.
From the 1.5 release note:
The 1.5 release of Nutch is now available. This release includes several improvements including upgrades of several major components including Tika 1.1 and Hadoop 1.0.0, improvements to LinkRank and WebGraph elements as well as a number of new plugins covering blacklisting, filtering and parsing to name a few. Please see the list of changes
http://www.apache.org/dist/nutch/CHANGES-1.5.txt
[WRONG URL – Should be: http://www.apache.org/dist/nutch/1.5/CHANGES-1.5.txt (version /1.5/” missing from the path, took me a while to notice the nature of the problem.)]
made in this version for a full breakdown of the 50 odd improvements the release boasts. A full PMC release statement can be found below
http://nutch.apache.org/#07+June+2012+-+Apache+Nutch+1.5+Released
Apache Nutch is an open source web-search software project. Stemming from Apache Lucene, it now builds on Apache Solr adding web-specifics, such as a crawler, a link-graph database and parsing support handled by Apache Tika for HTML and and array other document formats. Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster. The system can be enhanced (eg other document formats can be parsed) using a highly flexible, easily extensible and thoroughly maintained plugin infrastructure.Nutch is available in source and binary form (zip and tar.gz) from the following download page: http://www.apache.org/dyn/closer.cgi/nutch/
And 1.5.1:
http://www.apache.org/dist/nutch/1.5.1/CHANGES.txt
Nutch is available in source and binary form (zip and tar.gz) from the following download page: http://www.apache.org/dyn/closer.cgi/nutch/
Question: Would you put together some commodity boxes for local experimentation or would you spin up an installation in one of the clouds?
As hot as the summer promises to be near Atlanta, I am leaning towards the cloud route.
As I write that I can hear a close friend from the West Coast shouting “…trust, trust issues….” But I trust the local banking network, credit card, utilities, finance, police/fire, etc., with just as little reason as any of the “clouds.”
Not really even “trust,” I don’t even think about it. The credit card industry knows $X fraud is going to occur every year and it is a cost of liquid transactions. So they allow for it in their fees. They proceed in the face of known rates of fraud. How’s that for trust? 😉 Trusting fraud is going to happen.
Same will be true for the “clouds” and mechanisms will evolve to regulate the amount of exposure versus potential damage. I am going to be experimenting with non-client data so the worst exposure I have is loss of time. Perhaps some hard lessons learned on configuration/security. But hardly a reason to avoid the “clouds” and to incur the local hardware cost.
I was serious when I suggested governments should start requiring side by side comparison of hardware costs for local installs versus cloud services. I would call the major cloud services up and ask them for independent bids.
Would the “clouds” be less secure? Possibly, but I don’t think any of them allow Lady Gaga CDs on premises.