Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 29, 2012

Building a new Lucene postings format

Filed under: Indexing,Lucene — Patrick Durusau @ 10:08 am

Building a new Lucene postings format by Mike McCandless.

From the post:

As of 4.0 Lucene has switched to a new pluggable codec architecture, giving the application full control over the on-disk format of all index files. We have a nice collection of builtin codec components, and developers can create their own such as this recent example using a Redis back-end to hold updatable fields. This is an important change since it removes the previous sizable barriers to innovating on Lucene’s index formats.

A codec is actually a collection of formats, one for each part of the index. For example, StoredFieldsFormat handles stored fields, NormsFormat handles norms, etc. There are eight formats in total, and a codec could simply be a new mix of pre-existing formats, or perhaps you create your own TermVectorsFormat and otherwise use all the formats from the Lucene40 codec, for example.

Current testing of formats requires the entire format be specified, which means errors are hard to diagnose.

Mike addresses that problem by creating a layered testing mechanism.

Great stuff!

PS: I think it will also be useful as an educational tool. Changing defined formats and testing as changes are made.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress