Broccoli: Semantic Full-Text Search at your Fingertips by Hannah Bast, Florian Bäurle, Björn Buchhold, Elmar Haussmann.
We present Broccoli, a fast and easy-to-use search engine for what we call semantic full-text search. Semantic full-text search combines the capabilities of standard full-text search and ontology search. The search operates on four kinds of objects: ordinary words (e.g., edible), classes (e.g., plants), instances (e.g., Broccoli), and relations (e.g., occurs-with or native-to). Queries are trees, where nodes are arbitrary bags of these objects, and arcs are relations. The user interface guides the user in incrementally constructing such trees by instant (search-as-you-type) suggestions of words, classes, instances, or relations that lead to good hits. Both standard full-text search and pure ontology search are included as special cases. In this paper, we describe the query language of Broccoli, a new kind of index that enables fast processing of queries from that language as well as fast query suggestion, the natural language processing required, and the user interface. We evaluated query times and result quality on the full version of the EnglishWikipedia (32 GB XML dump) combined with the YAGO ontology (26 million facts). We have implemented a fully functional prototype based on our ideas, see http://broccoli.informatik.uni-freiburg.de.
The most impressive part of an impressive paper was the new index, context lists.
The second idea, which is the main idea behind our new index, is to have what we call context lists instead of inverted lists. The context list for a prex contains one index item per occurrence of a word starting with that prex, just like the inverted list for that prex would. But along with that it also contains one index item for each occurrence of an arbitrary entity in the same context as one of these words.
The performance numbers speak for themselves.
This should be a feature in the next release of Lucene/Solr. Or perhaps even configurable for the number of entities that can appear in a “context list.”
Was it happenstance or a desire for simplicity that caused the original indexing engines to parse text into single tokens?
Literature references on that point?