The Mike McCandless post: Lucene index in RAM with Azul’s Zing JVM will help make your case for putting your index in RAM!
From the post:
Google’s entire index has been in RAM for at least 5 years now. Why not do the same with an Apache Lucene search index?
RAM has become very affordable recently, so for high-traffic sites the performance gains from holding the entire index in RAM should quickly pay for the up-front hardware cost.
The obvious approach is to load the index into Lucene’s
Unfortunately, this class is known to put a heavy load on the garbage collector (GC): each file is naively held as a
bytefragments (there are open Jira issues to address this but they haven’t been committed yet). It also has unnecessary synchronization. If the application is updating the index (not just searching), another challenge is how to persist ongoing changes from
RAMDirectoryback to disk. Startup is much slower as the index must first be loaded into RAM. Given these problems, Lucene developers generally recommend using
RAMDirectoryonly for small indices or for testing purposes, and otherwise trusting the operating system to manage RAM by using
MMapDirectory(see Uwe’s excellent post for more details).
Recently I heard about the Zing JVM, from Azul, which provides a pauseless garbage collector even for very large heaps. In theory the high GC load of
RAMDirectoryshould not be a problem for Zing. Let’s test it! But first, a quick digression on the importance of measuring search response time of all requests.
There are obvious speed advantages to holding indexes in RAM.
Curious, is RAM just a quick disk? Or do we need to think about data structures/access differently with RAM? Pointers?