The Mike McCandless post: Lucene index in RAM with Azul’s Zing JVM will help make your case for putting your index in RAM!
From the post:
Google’s entire index has been in RAM for at least 5 years now. Why not do the same with an Apache Lucene search index?
RAM has become very affordable recently, so for high-traffic sites the performance gains from holding the entire index in RAM should quickly pay for the up-front hardware cost.
The obvious approach is to load the index into Lucene’s
RAMDirectory
, right?Unfortunately, this class is known to put a heavy load on the garbage collector (GC): each file is naively held as a
List
ofbyte[1024]
fragments (there are open Jira issues to address this but they haven’t been committed yet). It also has unnecessary synchronization. If the application is updating the index (not just searching), another challenge is how to persist ongoing changes fromRAMDirectory
back to disk. Startup is much slower as the index must first be loaded into RAM. Given these problems, Lucene developers generally recommend usingRAMDirectory
only for small indices or for testing purposes, and otherwise trusting the operating system to manage RAM by usingMMapDirectory
(see Uwe’s excellent post for more details).While there are open issues to improve
RAMDirectory
(LUCENE-4123 and LUCENE-3659), they haven’t been committed and many users simply useRAMDirectory
anyway.Recently I heard about the Zing JVM, from Azul, which provides a pauseless garbage collector even for very large heaps. In theory the high GC load of
RAMDirectory
should not be a problem for Zing. Let’s test it! But first, a quick digression on the importance of measuring search response time of all requests.
There are obvious speed advantages to holding indexes in RAM.
Curious, is RAM just a quick disk? Or do we need to think about data structures/access differently with RAM? Pointers?