Topic-based Index Partitions for Efficient and Effective Selective Search Authors: Anagha Kulkarni and Jamie Callan
Abstract:
Indexes for large collections are often divided into shards that are distributed across multiple computers and searched in parallel to provide rapid interactive search. Typically, all index shards are searched for each query. This paper investigates document allocation policies that permit searching only a few shards for each query (selective search) without sacrificing search quality. Three types of allocation policies (random, source-based and topic-based) are studied. K-means clustering is used to create topic-based shards. We manage the computational cost of applying these techniques to large datasets by defining topics on a subset of the collection. Experiments with three large collections demonstrate that selective search using topic-based shards reduces search costs by at least an order of magnitude without reducing search accuracy.
What is unclear to me is whether mapping shards across independent and distinct collections that have topic-based shards would be as effective?
That would depend on the similarity of the shards but that is measurable. Not to mention mappable by a topic map.
It would be interesting if large collections started offering topic-based shard APIs to their contents.
Such that a distributed query could search shards that have been mapped as being relevant to a particular query.