Simon says: Single Byte Norms are Dead!

Simon says: Single Byte Norms are Dead!

From the post:

Apache Lucene turned 10 last year with a limitation that bugged many many users from day one. You may know Lucene’s core scoring model is based on TF/IDF (Vector Space Model). Lucene encapsulates all related calculations in a class called Similarity. Among pure TF/IDF factors Similarity also provides a norm value per document that is, by default a float value composed out of length normalization and field boost. Nothing special so far! However, this float value is encoded into a single byte and written down to the index. Lucene trades some precision lost for space on disk and eventually in memory since norms are loaded into memory per field upon first access. 

In lots of cases this precision lost is a fair trade-off, but once you find yourself in a situation where you need to store more information based on statistics collected during indexing you end up writing your own auxiliary data structure or “fork” Lucene for your app and mess with the source. 

The upcoming version of Lucene already added support for a lot more scoring models like:

The abstractions added to Lucene to implement those models already opens the door for applications that either want to roll their own “awesome” scoring model or modify the low level scorer implementations. Yet, norms are still one byte!

Don’t worry! The post has a happy ending!

Read on if you want to be on the cutting edge of Lucene work.

Thanks Lucene Team!

Leave a Reply

You must be logged in to post a comment.