Archive for the ‘SolrCloud’ Category

Introducing the Solr Scale Toolkit

Saturday, June 7th, 2014

Introducing the Solr Scale Toolkit by Timothy Potter.

From the post:

SolrCloud is a set of features in Apache Solr that enable elastic scaling of distributed search indexes using sharding and replication. One of the hurdles to adopting SolrCloud has been the lack of tools for deploying and managing a SolrCloud cluster. In this post, I introduce the Solr Scale Toolkit, an open-source project sponsored by LucidWorks (, which provides tools and guidance for deploying and managing SolrCloud in cloud-based platforms such as Amazon EC2. In the last section, I use the toolkit to run some performance benchmarks against Solr 4.8.1 to see just how “scalable” Solr really is.


When you download a recent release of Solr (4.8.1 is the latest at the time of this writing), it’s actually quite easy to get a SolrCloud cluster running on your local workstation. Solr allows you to start an embedded ZooKeeper instance to enable “cloud” mode using a simple command-line option: -DzkRun. If you’ve not done this before, I recommend following the instructions provided by the Solr Reference Guide:

Once you’ve worked through the out-of-the-box experience with SolrCloud, you quickly realize you need tools to help you automate deployment and system administration tasks across multiple servers. Moreover, once you get a well-configured cluster running, there are ongoing system maintenance tasks that also should be automated, such as doing rolling restarts, performing off-site backups, or simply trying to find an error message across multiple log files on different servers.

Until now, most organizations had to integrate SolrCloud operations into an existing environment using tools like Chef or Puppet. While those are still valid approaches, the Solr Scale Toolkit provides a simple, Python-based solution that is easy to install and use to manage SolrCloud. In the remaining sections of this post, I walk you through some of the key features of the toolkit and encourage you to follow along. To begin there’s a little setup that is required to use the toolkit.

If you are looking to scale Solr, Timothy’s post is the right place to start!

Take serious heed of the following advice:

One of the most important tasks when planning to use SolrCloud is to determine how many servers you need to support your index(es). Unfortunately, there’s not a simple formula for determining this because there are too many variables involved. However, most experienced SolrCloud users do agree that the only way to determine computing resources for your production cluster is to test with your own data and queries. So for this blog, I’m going to demonstrate how to provision the computing resources for a small cluster but you should know that the same process works for larger clusters. In fact, the toolkit was developed to enable large-scale testing of SolrCloud. I leave it as an exercise for the reader to do their own cluster-size planning.

If anyone offers you a fixed rate SolrCloud, you should know they have calculated the cluster to be good for them, and if possible, good for you.

You have been warned.

Multi level composite-id routing in SolrCloud

Monday, January 13th, 2014

Multi level composite-id routing in SolrCloud by Anshum Gupta.

From the post:

SolrCloud over the last year has evolved into a rather intelligent system with a lot of interesting and useful features going in. One of them has been the work for intelligent routing of documents and queries.

SolrCloud started off with a basic hash based routing in 4.0. It then got interesting with the composite id router being introduced with 4.1 which enabled smarter routing of documents and queries to achieve things like multi-tenancy and co-location. With 4.7, the 2-level composite id routing will be expanded to work for 3-levels (SOLR-5320).

A good post about how document routing generally works can be found here. Now, let’s look at how the composite-id routing extends to 3-levels and how we can really use it to query specific documents in our corpus.

An important thing to note here is that the 3-level router only extends the 2-level one. It’s the same router and the same java class i.e. you don’t really need to ‘set it up’.

Where would you want to use the multi-level composite-id router?

The multi-level implementation further extends the support for multi tenancy and co-location of documents provided by the already existing composite-id router. Consider a scenario where a single setup is used to host data for multiple applications (or departments) and each of them have a set of users. Each user further has documents associated with them. Using a 3-level composite-id router, a user can route the documents to the right shards at index time without having to really worry about the actual routing. This would also enable users to target queries for specific users or applications using the shard.keys parameter at query time.

Does that sound related to topic maps?

What if you remembered that “document” for Lucene means:

Documents are the unit of indexing and search. A Document is a set of fields. Each field has a name and a textual value. A field may be stored with the document, in which case it is returned with search hits on the document. Thus each document should typically contain one or more stored fields which uniquely identify it.

Probably not an efficient way to handle multiple identifiers but that depends on your use case.

Apache Lucene: Then and Now

Thursday, October 10th, 2013

Apache Lucene: Then and Now by Doug Cutting.

From the description at Washington DC Hadoop Users Group:

Doug Cutting originally wrote Lucene in 1997-8. It joined the Apache Software Foundation’s Jakarta family of open-source Java products in September 2001 and became its own top-level Apache project in February 2005. Until recently it included a number of sub-projects, such as Lucene.NET, Mahout, Solr and Nutch. Solr has merged into the Lucene project itself and Mahout, Nutch, and Tika have moved to become independent top-level projects. While suitable for any application which requires full text indexing and searching capability, Lucene has been widely recognized for its utility in the implementation of Internet search engines and local, single-site searching. At the core of Lucene’s logical architecture is the idea of a document containing fields of text. This flexibility allows Lucene’s API to be independent of the file format. Text from PDFs, HTML, Microsoft Word, and OpenDocument documents, as well as many others (except images), can all be indexed as long as their textual information can be extracted.

In today’s discussion, Doug will share background on the impetus and creation of Lucene. He will talk about the evolution of the project and explain what the core technology has enabled today. Doug will also share his thoughts on what the future holds for Lucene and SOLR

Interesting walk down history lane with the creator of Lucene, Doug Cutting.

Hard/Soft Commits, Transaction Logs (SolrCloud)

Saturday, August 24th, 2013

Understanding Transaction Logs, Soft Commit and Commit in SolrCloud by Erick Erickson.

From the post:

As of Solr 4.0, there is a new “soft commit” capability, and a new parameter for hard commits – openSearcher. Currently, there’s quite a bit of confusion about the interplay between soft and hard commit actions, and especially what it all means for the transaction log. The stock solrconfig.xml file explains the options, but with the usual documentation-in-example limits, if there was a full explanation of everything, the example file would be about a 10M and nobody would ever read through the whole thing. This article outlines the consequences hard and soft commits and the new openSearcher option for hard commits. The release documentation can be found in the Solr Reference Guide, this post is a more leisurely overview of this topic. I persuaded a couple of the committers to give me some details. I’m sure I was told the accurate information, any transcription errors are mine!

The mantra

Repeat after me “Hard commits are about durability, soft commits are about visibility“. Hard and soft commits are related concepts, but serve different purposes. Concealed in this simple statement are many details; we’ll try to illuminate some of them.

Interested? 😉

No harm in knowing the details. Could come in very handy.

Installing Distributed Solr 4 with Fabric

Thursday, May 23rd, 2013

Installing Distributed Solr 4 with Fabric by Martijn Koster

From the post:

Solr 4 has a subset of features that allow it be run as a distributed fault-tolerant cluster, referred to as “SolrCloud”. Installing and configuring Solr on a multi-node cluster can seem daunting when you’re a developer who just wants to give the latest release a try. The wiki page is long and complex, and configuring nodes manually is laborious and error-prone. And while your OS has ZooKeeper/Solr packages, they are probably outdated. But it doesn’t have to be a lot of work: in this post I will show you how to deploy and test a Solr 4 cluster using just a few commands, using mechanisms you can easily adjust for your own deployments.

I am using a cluster consisting of a virtual machines running Ubuntu 12.04 64bit and I am controlling them from my MacBook Pro. The Solr configuration will mimic the Two shard cluster with shard replicas and zookeeper ensemble example from the wiki.

You can run this on AWS EC2, but some special considerations apply, see the footnote.

We’ll use Fabric, a light-weight deployment tool that is basically a Python library to easily execute commands on remote nodes over ssh. Compared to Chef/Puppet it is simpler to learn and use, and because it’s an imperative approach it makes sequential orchestration of dependencies more explicit. Most importantly, it does not require a separate server or separate node-side software installation.

DISCLAIMER: these instructions and associated scripts are released under the Apache License; use at your own risk.

I strongly recommend you use disposable virtual machines to experiment with.

Something to get you excited about the upcoming weekend!


Battle of the Giants: Apache Solr 4.0 vs. ElasticSearch 0.20

Saturday, November 10th, 2012

Battle of the Giants: Apache Solr 4.0 vs. ElasticSearch 0.20 by Rafał Kuć.

A very nice summary (slides) of Rafał’s comparison of the latest releases of Solr and ElasticSearch.

They differ and those differences fit some use cases better than others.

And the winner is: … (well, I won’t spoil the surprise.)

Read the slides.

Unless you are Rafał, you will learn something you didn’t know before.

Solr vs ElasticSearch: Part 4 – Faceting

Tuesday, October 30th, 2012

Solr vs ElasticSearch: Part 4 – Faceting by Rafał Kuć.

From the post:

Solr 4 (aka SolrCloud) has just been released, so it’s the perfect time to continue our ElasticSearch vs. Solr series. In the last three parts of the ElasticSearch vs. Solr series we gave a general overview of the two search engines, about data handling, and about their full text search capabilities. In this part we look at how these two engines handle faceting.

Rafał continues his excellent comparison of Solr and ElasticSearch.

Understanding your software options is almost as important as understanding your data.

High Availability Search with SolrCloud

Sunday, October 28th, 2012

High Availability Search with SolrCloud by Brent Lemons.

Brent explains that using embedded ZooKeeper is useful for testing/learning SolrCloud, but high availaility requires more.

As in separate installations of SolrCloud and ZooKeeper, both as high availability applications.

He walks through the steps to create and test such an installation.

If you have or expect to have a high availability search requirement, Brent’s post will be helpful.

Battle of the Giants: Apache Solr 4.0 vs ElasticSearch

Tuesday, September 25th, 2012

Battle of the Giants: Apache Solr 4.0 vs ElasticSearch

From the post:

Apache Solr 4.0 release is imminent and we have a heavily anticipated Solr vs. ElasticSearch blog post series going on. What better time to share that our Rafał Kuć will be giving a talk titled Battle of the giants: Apache Solr 4.0 vs ElasticSearch at the upcoming ApacheCon/Lucene EuroCon in Germany this November.


In this talk audience will be able to hear about how the long awaited Apache Solr 4.0 (aka SolrCloud) compares to the second search engine built on top of Apache Lucene – ElasticSearch. From understanding the architectural differences and behavior in situations like split – brain, to cluster recovery. From distributed indexing and document distribution control, to handling multiple shards and replicas in a single cluster. During the talk, we will also compare the most used and anticipated features such as faceting handling, documents grouping and so on. At the end we will talk about performance differences, cluster monitoring and troubleshooting.

ApacheCon Europe 2012
Rhein-Neckar-Arena, Sinsheim, Germany
5–8 November 2012

Email, tweet, publicize ApacheCon Europe 2012!

Blog especially! A pale imitation but those of us unable to attend benefit from your posts!

Solr vs. ElasticSearch: Part 1 – Overview

Friday, August 24th, 2012

Solr vs. ElasticSearch: Part 1 – Overview by Rafał Kuć.

From the post:

A good Solr vs. ElasticSearch coverage is long overdue. We make good use of our own Search Analytics and pay attention to what people search for. Not surprisingly, lots of people are wondering when to choose Solr and when ElasticSearch.

As the Apache Lucene 4.0 release approaches and with it Solr 4.0 release as well, we thought it would be beneficial to take a deeper look and compare the two leading open source search engines built on top of Lucene – Apache Solr and ElasticSearch. Because the topic is very wide and can go deep, we are publishing our research as a series of blog posts starting with this post, which provides the general overview of the functionality provided by both search engines.

Rafal gets this series of posts off to a good start!

PS: Solr vs. ElasticSearch: Part 2 – Data Handling

Lucene Revolution 2012 – Slides/Videos

Thursday, June 7th, 2012

Lucene Revolution 2012 – Slides/Videos

The slides and videos from Lucene Revolution 2012 are up!

Now you don’t have to search for old re-runs on Hulu to watch during lunch!

Compare SearchBlox 7.0 vs. Solr

Monday, May 28th, 2012

Compare SearchBlox 7.0 vs. Solr by Timo Selvaraj.

From the post:

SearchBlox 7 is a (free) enterprise solution for website, ecommerce, intranet and portal search. The new 7.0 version makes it easy to add faceted search without the hassles of managing a schema and scales horizontally without any manual configuration or external software/scripts. SearchBlox enables you to achieve term, range and date based faceted search without manually maintaining a schema file as in Solr. SearchBlox enables to have distributed indexing and searching abilities without using any separate scripts/programs as in SolrCloud. SearchBlox provides on demand dynamic faceting of fields without specifying them through a config or script.

Expecting a comparison of SearchBlox 7.0 and Solr?

You are going to be disappointed.

Summary of what Timo thinks about SearchBlox 7.0.

Not a bad thing, just not a basis for comparison.

That you have to supply yourself.

I am going to throw a copy of SearchBlox 7.0 on the fire later this week.

Solr 4 preview: SolrCloud, NoSQL, and more

Monday, May 21st, 2012

Solr 4 preview: SolrCloud, NoSQL, and more

From the post:

The first alpha release of Solr 4 is quickly approaching, bringing powerful new features to enhance existing Solr powered applications, as well as enabling new applications by further blurring the lines between full-text search and NoSQL.

The largest set of features goes by the development code-name “Solr Cloud” and involves bringing easy scalability to Solr. Distributed indexing with no single points of failure has been designed from the ground up for near real-time (NRT), and NoSQL features such as realtime-get, optimistic locking, and durable updates.

We’ve incorporated Apache ZooKeeper, the rock-solid distributed coordination project that is immune to issues like split-brain syndrome that tend to plague other hand-rolled solutions. ZooKeeper holds the Solr configuration, and contains the cluster meta-data such as hosts, collections, shards, and replicas, which are core to providing an elastic search capability.

When a new node is brought up, it will automatically be assigned a role such as becoming an additional replica for a shard. A bounced node can do a quick “peer sync” by exchanging updates with its peers in order to bring itself back up to date. New nodes, or those that have been down too long, recover by replicating the whole index of a peer while concurrently buffering any new updates.

Run, don’t walk, to learn about the new features for Solr 4.

You won’t be disappointed.

Interested to see the “….blurriing [of] the lines between full-text search and NoSQL.”

Would be even more interested to see the “…blurring of indexing and data/data formats.”

That is to say that data, along with its format, is always indexed in digital media.

So why can’t I see the data as a table, as a graph, as a …., depending upon my requirements?

No ETL, JVD – Just View Differently.

Suspect I will have to wait a while for that, but in the mean time, enjoy Solr 4 alpha.

Scaling Solr Indexing with SolrCloud, Hadoop and Behemoth

Monday, April 2nd, 2012

Scaling Solr Indexing with SolrCloud, Hadoop and Behemoth

Grant Ingersoll writes:

We’ve been doing a lot of work at Lucid lately on scaling out Solr, so I thought I would blog about some of the things we’ve been working on recently and how it might help you handle large indexes with ease. First off, if you want a more basic approach using versions of Solr prior to what will be Solr4 and you don’t care about scaling out Solr indexing to match Hadoop or being fault tolerant, I recommend you read Indexing Files via Solr and Java MapReduce. (Note, you could also modify that code to handle these things. If you need do that, we’d be happy to help.)

Instead of doing all the extra work of making sure instances are up, etc., however, I am going to focus on using some of the new features of Solr4 (i.e. SolrCloud whose development effort has been primarily led by several of my colleagues: Yonik Seeley, Mark Miller and Sami Siren) which remove the need to figure out where to send documents when indexing, along with a convenient Hadoop-based document processing toolkit, created by Julien Nioche, called Behemoth that takes care of the need to write any Map/Reduce code and also handles things like extracting content from PDFs and Word files in a Hadoop friendly manner (think Apache Tika run in Map/Reduce) while also allowing you to output the results to things like Solr or Mahout, GATE and others as well as to annotate the intermediary results. Behemoth isn’t super sophisticated in terms of ETL (Extract-Transform-Load) capabilities, but it is lightweight, easy to extend and gets the job done on Hadoop without you having to spend time worrying about writing mappers and reducers.

If you are pushing the boundaries of your Solr 3.* installation or just want to know more about Solr4, this post is for you.

A First Exploration Of SolrCloud

Thursday, February 9th, 2012

A First Exploration Of SolrCloud

From the post:

SolrCloud has recently been in the news and was merged into Solr trunk, so it was high time to have a fresh look at it.

The SolrCloud wiki page gives various examples but left a few things unclear for me. The examples only show Solr instances which host one core/shard, and it doesn’t go deep on the relation between cores, collections and shards, or how to manage configurations.

In this blog, we will have a look at an example where we host multiple shards per instance, and explain some details along the way.

If you have any interest in SolrCloud, this is a post for you. Forward it to your friends if they are interested in Solr. And family. Well, maybe not that last one. 😉

I have a weakness for posts that take the time to call out “…shard and slice are often used in ambiguous ways…,” establish a difference and then use those terms consistently.

One of the primary weaknesses of software projects is that “documentation” is treated with about the same concern as “requirements.”

The problem is that the original programmers may understand ambiguities and if you want a cult program, that’s great. The problem is that to be successful software, that is software that is widely used, it has to be understood by as many programmers as possible. Possibly even users if it is an end use product.

Think of it this way: You don’t want to be distracted from the next great software project by endless questions that you have to stop and answer. Do the documentation along the way and you won’t lose time on the next great project. Which we are all waiting to see. Documentation is a win-win situation.

The New SolrCloud: Overview

Thursday, February 2nd, 2012

The New SolrCloud: Overview by Rafał Kuć.

From the post:

Just the other day we wrote about Sensei, the new distributed, real-time full-text search database built on top of Lucene and here we are again writing about another “new” distributed, real-time, full-text search server also built on top of Lucene: SolrCloud.

In this post we’ll share some interesting SolrCloud bits and pieces that matter mostly to those working with large data and query volumes, but that all search lovers should find really interesting, too. If you have any questions about what we wrote (or did not write!) in this post, please leave a comment – we’re good at following up to comments! Or just ask @sematext!

Please note that functionality described in this post is now part of trunk in Lucene and Solr SVN repository. This means that it will be available when Lucene and Solr 4.0 are released, but you can also use trunk version just like we did, if you don’t mind living on the bleeding edge.

A good overview and one that could be useful with semi-technical management types.

If you need more details, see: SolrCloud wiki.

SolrCloud is Coming (and looking to mix in even more ‘NoSQL’)

Monday, January 23rd, 2012

SolrCloud is Coming (and looking to mix in even more ‘NoSQL’) by Mark Miller.

From the post:

The second phase of SolrCloud has been in full swing for a couple of months now and it looks like we are going to be able to commit this work to trunk very soon! In Phase1 we built on top of Solr’s distributed search capabilities and added cluster state, central config, and built-in read side fault tolerance. Phase 2 is even more ambitious and focuses on the write side. We are talking full-blown fault tolerance for reads and writes, near real-time support, real-time GET, true single node durability, optimistic locking, cluster elasticity, improvements to the Phase 1 features, and more.

Once we get Phase2 into trunk we will work on hardening and finishing a couple missing features – then SolrCloud should be ready to be part of the upcoming Lucene/Solr 4.0 release.

If you want to read more about SolrCloud and where we are with Phase 2, check out the new wiki page that we are working on at – feedback appreciated!

Occurs to me that tweaking SolrCloud (or just Solr) might make a nice short course for library students. If not to become Solr mavens, just to get a better feel for the range of possibilities.

Distributed Indexing – SolrCloud

Saturday, January 7th, 2012

Distributed Indexing – SolrCloud

Not for the faint of heart but I noticed that progress is being made on distributed indexing for the SolrCloud project.

Whether you are a hard core coder or someone who is interested in using this feature (read feedback), now would be a good time to start paying attention to this work.

I added a new category for “Distributed Indexing” because this isn’t only going to come up for Solr. And I suspect there are aspects of “distributed indexing” that are going to be applicable to distributed topic maps as well.