Grant Ingersoll writes:
We’ve been doing a lot of work at Lucid lately on scaling out Solr, so I thought I would blog about some of the things we’ve been working on recently and how it might help you handle large indexes with ease. First off, if you want a more basic approach using versions of Solr prior to what will be Solr4 and you don’t care about scaling out Solr indexing to match Hadoop or being fault tolerant, I recommend you read Indexing Files via Solr and Java MapReduce. (Note, you could also modify that code to handle these things. If you need do that, we’d be happy to help.)
Instead of doing all the extra work of making sure instances are up, etc., however, I am going to focus on using some of the new features of Solr4 (i.e. SolrCloud whose development effort has been primarily led by several of my colleagues: Yonik Seeley, Mark Miller and Sami Siren) which remove the need to figure out where to send documents when indexing, along with a convenient Hadoop-based document processing toolkit, created by Julien Nioche, called Behemoth that takes care of the need to write any Map/Reduce code and also handles things like extracting content from PDFs and Word files in a Hadoop friendly manner (think Apache Tika run in Map/Reduce) while also allowing you to output the results to things like Solr or Mahout, GATE and others as well as to annotate the intermediary results. Behemoth isn’t super sophisticated in terms of ETL (Extract-Transform-Load) capabilities, but it is lightweight, easy to extend and gets the job done on Hadoop without you having to spend time worrying about writing mappers and reducers.
If you are pushing the boundaries of your Solr 3.* installation or just want to know more about Solr4, this post is for you.