From the post:
The solution was to build an indexing pipeline specifically to address this user requirement, by creating “virtual documents” about each member of staff. In this case, we used the Aspire content processing framework as it provided a lot more flexibility than the indexing pipeline of the incumbent search engine, and many of the components that were needed already existed in Aspire’s component library.
Merging was done selectively. For example, documents were identified that had been authored by the staff member concerned and from those documents, certain entities were extracted including customer names, dates and specific industry jargon. The information captured was kept in fields, and so could be searched in isolation if necessary.
The result was a new class of documents, which existed only in the search engine index, containing extended information about each member of staff; from basic data such as their billing rate, location, current availability and professional qualifications, through to a range of important concepts and keywords which described their previous work, and customer and industry sector knowledge.
Another tool to put in your belt but I wonder if there is a deeper lesson to be learned here?
Creating a “virtual” document, unlike anyone that existed in the target collection and indexing those “virtual” documents was a clever solution.
But it retains the notion of a “container” or “document” that is examined in isolation from all other “documents.”
Is that necessary? What are we missing if we retain it?
I don’t have any answers to those questions but will be thinking about them.