Lucene 4.0: The Revolution by Simon Willnauer.
From the post:
The near-limitless innovative potential of a thriving open source community often has to be tempered by the need for a steady roadmap with version compatibility. As a result, once the decision to break backward compatibility in Lucene 4.0 had been made, it opened the floodgates on a host of step changes, which, together, will deliver a product whose performance is unrecognisable from previous 3.x releases.
One of the most significant changes in Lucene 4.0 is the full switch to using bytes (UTF8) in place of text strings for indexing within the search engine library. This change has improved the efficiency of a number of core processes: the ‘term dictionary’, used as a core part of the index, can now be loaded up to 30 times faster; it uses 10% of the memory; and search speeds are increased by removing the need for string conversion.
This switch to using bytes for indexing has also facilitated one of the main goals for Lucene 4.0, which is ‘flexible indexing’. The data structure for the index format can now be chosen and loaded into Lucene as a pluggable codec. As such, optimised codecs can be loaded to suit the indexing of individual datasets or even individual fields.
The performance enhancements through flexible indexing are highly case specific. However, flexible indexing introduces an entirely new dimension to the Lucene project. New indexing codecs can be developed and existing ones updated without the need for hard-coding within Lucene. There is no longer any need for project-level compromise on the best general-purpose index formats and data structures. A new field of specialised codec development can take place independently from development of the Lucene kernel.
Looks like the time to be learning new features of Lucene 4.0 is now!
Flexible indexing! That sounds very cool.