Searching relational content with Lucene’s BlockJoinQuery
Mike McCandless writes:
Lucene’s 3.4.0 release adds a new feature called index-time join (also sometimes called sub-documents, nested documents or parent/child documents), enabling efficient indexing and searching of certain types of relational content.
Most search engines can’t directly index relational content, as documents in the index logically behave like a single flat database table. Yet, relational content is everywhere! A job listing site has each company joined to the specific listings for that company. Each resume might have separate list of skills, education and past work experience. A music search engine has an artist/band joined to albums and then joined to songs. A source code search engine would have projects joined to modules and then files.
Mike covers how to index relational content with Lucene 3.4.0 as well as the current limitations on that relational indexing. Current work is projected to resolve some of those limitations.
This feature will be immediately useful in a number of contexts.
Even more promising is the development of thinking about indexing as more than term -> document. Both sides of that operator need more granularity.