From the post:
Following my scrap and crawling experiences, I was looking for a good indexer. Initially I was setup to use Lucene, as I got pretty good recomendations about. Lucene really shines, but I was decided about using Ruby or any other scripting language to avoid bloated code.
Browsing around I found about Ferret, which is a text indexing library for Ruby. The benchmarks and references were good, and so I setup to work on some testing to get used to it. Fortunately, the results were good, and the API is a breeze. Also pagination is built-in. How cool is that ?
For an initial test, I setup to index the Linux Kernel source code. By looking at Brian McCallister example, I wrote two small scripts: indexer.rb and search.rb. I ran indexer over the source tree, and came up with some very interesting results. The words I searched for were ‘net’, ‘skb’, ‘x86′ and finally ‘linux’.
You probably want to drop by Ferret, to pick up the source.
Ferret is described there as:
Ferret is an information retrieval library in the same vein as Apache Lucene[1]. Originally it was a full port of Lucene but it now uses it’s own file format and indexing algorithm although it is still very similar in many ways to Lucene. Everything you can do in Lucene you should be able to do in Ferret.