Engineering basic algorithms of an in-memory text search engine Authors: Frederik Transier, Peter Sanders Keywords: Inverted index, in-memory search engine, randomization
Abstract:
Inverted index data structures are the key to fast text search engines. We first investigate one of the predominant operation on inverted indexes, which asks for intersecting two sorted lists of document IDs of different lengths. We explore compression and performance of different inverted list data structures. In particular, we present Lookup, a new data structure that allows intersection in expected time linear in the smaller list.
Based on this result, we present the algorithmic core of a full text data base that allows fast Boolean queries, phrase queries, and document reporting using less space than the input text. The system uses a carefully choreographed combination of classical data compression techniques and inverted-index-based search data structures. Our experiments show that inverted indexes are preferable over purely suffix-array-based techniques for in-memory (English) text search engines.
A similar system is now running in practice in each core of the distributed data base engine TREX of SAP.
An interesting comparison of inverted indexes with suffix-arrays.
I am troubled by the reconstruct the input aspects of the paper.
While it is understandable and in some cases, more efficient, for data to be held in a localized data structure, my question is what do we do when data exceeds local storage capacity?
Think about the data held by Lexis/Nexis for example. Where would we put it while creating a custom data structure for its access?
There are data sets, important data sets, that have to be accessed in place.
And those data sets need to be addressed using topic maps.
*****
You may recall from the TAO paper by Steve Pepper the illustration of topics, associations and occurrences floating above a data set.
While topic map formats have been useful in many ways, they have distracted from the vision of topic maps as an information overlay as opposed to yet-another-format.
Formats are just that, formats. Pick one.