Building a new Lucene postings format by Mike McCandless.
From the post:
As of 4.0 Lucene has switched to a new pluggable codec architecture, giving the application full control over the on-disk format of all index files. We have a nice collection of builtin codec components, and developers can create their own such as this recent example using a Redis back-end to hold updatable fields. This is an important change since it removes the previous sizable barriers to innovating on Lucene’s index formats.
A codec is actually a collection of formats, one for each part of the index. For example, StoredFieldsFormat handles stored fields, NormsFormat handles norms, etc. There are eight formats in total, and a codec could simply be a new mix of pre-existing formats, or perhaps you create your own TermVectorsFormat and otherwise use all the formats from the Lucene40 codec, for example.
Current testing of formats requires the entire format be specified, which means errors are hard to diagnose.
Mike addresses that problem by creating a layered testing mechanism.
Great stuff!
PS: I think it will also be useful as an educational tool. Changing defined formats and testing as changes are made.