Lucene 4.0.0 alpha, at long last! by Mike McCandless.
Grabbing enough of the post to make you crazy until you read it in full (there’s lots more):
The 4.0.0 alpha release of Lucene and Solr is finally out!
This is a major release with lots of great changes. Here I briefly describe the most important Lucene changes, but first the basics:
- All deprecated APIs as of 3.6.0 have been removed.
- Pre-3.0 indices are no longer supported.
- MIGRATE.txt describes how to update your application code.
- The index format won’t change (unless a serious bug fix requires it) between this release and 4.0 GA, but APIs may still change before 4.0.0 beta.
Please try the release and report back!
Pluggable Codec
The biggest change is the new pluggable Codec architecture, which provides full control over how all elements (terms, postings, stored fields, term vectors, deleted documents, segment infos, field infos) of the index are written. You can create your own or use one of the provided codecs, and you can customize the postings format on a per-field basis.
There are some fun core codecs:
Lucene40
is the default codec.Lucene3x
(read-only) reads any index written with Lucene 3.x.- SimpleText stores everything in plain text files (great for learning and debugging, but awful for production!).
- MemoryPostingsFormat stores all postings (terms, documents, positions, offsets) in RAM as a fast and compact FST, useful for fields with limited postings (primary key (id) field, date field, etc.)
- PulsingPostingsFormat inlines postings for low-frequency terms directly into the terms dictionary, saving a disk seek on lookup.
AppendingCodec
avoids seeking while writing, necessary for file-systems such as Hadoop DFS.If you create your own Codec it’s easy to confirm all of Lucene/Solr’s tests pass with it. If tests fail then likely your Codec has a bug!
A new 4-dimensional postings API (to read fields, terms, documents, positions) replaces the previous postings API.
….
A good thing that tomorrow is a holiday in the U.S. 😉