Apache Lucene 5.0.0 « Another Word For It

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 22, 2015

Apache Lucene 5.0.0

Filed under: Indexing,Lucene — Patrick Durusau @ 7:35 pm

Apache Lucene 5.0.0

For the impatient:

http://lucene.apache.org/core/mirrors-core-latest-redir.html

Lucene CHANGES.txt

From the post:

Highlights of the Lucene release include:

Stronger index safety

All file access now uses Java’s NIO.2 APIs which give Lucene stronger index safety in terms of better error handling and safer commits.

Every Lucene segment now stores a unique id per-segment and per-commit to aid in accurate replication of index files.

During merging, IndexWriter now always checks the incoming segments for corruption before merging. This can mean, on upgrading to 5.0.0, that merging may uncover long-standing latent corruption in an older 4.x index.

Reduced heap usage

Lucene now supports random-writable and advance-able sparse bitsets (RoaringDocIdSet and SparseFixedBitSet), so the heap required is in proportion to how many bits are set, not how many total documents exist in the index.

Heap usage during IndexWriter merging is also much lower with the new Lucene50Codec, since doc values and norms for the segments being merged are no longer fully loaded into heap for all fields; now they are loaded for the one field currently being merged, and then dropped.

The default norms format now uses sparse encoding when appropriate, so indices that enable norms for many sparse fields will see a large reduction in required heap at search time.

5.0 has a new API to print a tree structure showing a recursive breakdown of which parts are using how much heap.

Other features

FieldCache is gone (moved to a dedicated UninvertingReader in the misc module). This means when you intend to sort on a field, you should index that field using doc values, which is much faster and less heap consuming than FieldCache.

Tokenizers and Analyzers no longer require Reader on init.

NormsFormat now gets its own dedicated NormsConsumer/Producer

SortedSetSortField, used to sort on a multi-valued field, is promoted from sandbox to Lucene’s core.

PostingsFormat now uses a “pull” API when writing postings, just like doc values. This is powerful because you can do things in your postings format that require making more than one pass through the postings such as iterating over all postings for each term to decide which compression format it should use.

New DateRangeField type enables Indexing and searching of date ranges, particularly multi-valued ones.

A new ExitableDirectoryReader extends FilterDirectoryReader and enables exiting requests that take too long to enumerate over terms.

Suggesters from multi-valued field can now be built as DocumentDictionary now enumerates each value separately in a multi-valued field.

ConcurrentMergeScheduler detects whether the index is on SSD or not and does a better job defaulting its settings. This only works on Linux for now; other OS’s will continue to use the previous defaults (tuned for spinning disks).

Auto-IO-throttling has been added to ConcurrentMergeScheduler, to rate limit IO writes for each merge depending on incoming merge rate.

CustomAnalyzer has been added that allows to configure analyzers like you do in Solr’s index schema. This class has a builder API to configure Tokenizers, TokenFilters, and CharFilters based on their SPI names and parameters as documented by the corresponding factories.

Memory index now supports payloads.

Added a filter cache with a usage tracking policy that caches filters based on frequency of use.

The default codec has an option to control BEST_SPEED or BEST_COMPRESSION for stored fields.

Stored fields are merged more efficiently, especially when upgrading from previous versions or using SortingMergePolicy

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 22, 2015

Apache Lucene 5.0.0

Highlights of the Lucene release include:

No Comments