Archive for the ‘Forward Index’ Category

Cleo: Flexible, partial, out-of-order and real-time typeahead search

Saturday, December 24th, 2011

Cleo: Flexible, partial, out-of-order and real-time typeahead search

From the webpage:

Cleo is a flexible software library for enabling rapid development of partial, out-of-order and real-time typeahead and autocomplete services. It is suitable for data sets of various sizes from different domains. The Cleo software library is published under the terms of the Apache Software License version 2.0, a copy of which has been included in the LICENSE file shipped with the Cleo distribution.

Not to be mistaken with query autocomplete, Cleo does not suggest search terms or queries. Cleo is a library for developing applications that can perform real typeahead queries and deliver instantaneous typeahead results/objects/elements as you type.

Cleo is also different from general-purpose search libraries because 1) it does not evaluate search terms but the prefixes of those terms, and 2) it enables search by means of Bloom Filter and forward indexes rather than inverted indexes.

You may be amused by the definition of “forward index” offered by NIST:

An index into a set of texts. This is usually created as the first step to making an inverted index.

Or perhaps more usefully from the Wikipedia entry on Index (Search Engine):

The forward index stores a list of words for each document. The following is a simplified form of the forward index:

Forward Index
Document Words
Document 1 the,cow,says,moo
Document 2 the,cat,and,the,hat
Document 3 the,dish,ran,away,with,the,spoon

The rationale behind developing a forward index is that as documents are parsing, it is better to immediately store the words per document. The delineation enables Asynchronous system processing, which partially circumvents the inverted index update bottleneck.[19] The forward index is sorted to transform it to an inverted index. The forward index is essentially a list of pairs consisting of a document and a word, collated by the document. Converting the forward index to an inverted index is only a matter of sorting the pairs by the words. In this regard, the inverted index is a word-sorted forward index.

So, was it an indexing performance issue that lead to use of a “forward index” or was it some other capability?

Suggestions on what “typeahead search” would/could mean in a topic map context?