Archive for the ‘Faceted Search’ Category

Dynamic faceting with Lucene

Wednesday, May 22nd, 2013

Dynamic faceting with Lucene by Michael McCandless.

From the post:

Lucene’s facet module has seen some great improvements recently: sizable (nearly 4X) speedups and new features like DrillSideways. The Jira issues search example showcases a number of facet features. Here I’ll describe two recently committed facet features: sorted-set doc-values faceting, already available in 4.3, and dynamic range faceting, coming in the next (4.4) release.

To understand these features, and why they are important, we first need a little background. Lucene’s facet module does most of its work at indexing time: for each indexed document, it examines every facet label, each of which may be hierarchical, and maps each unique label in the hierarchy to an integer id, and then encodes all ids into a binary doc values field. A separate taxonomy index stores this mapping, and ensures that, even across segments, the same label gets the same id.

At search time, faceting cost is minimal: for each matched document, we visit all integer ids and aggregate counts in an array, summarizing the results in the end, for example as top N facet labels by count.

This is in contrast to purely dynamic faceting implementations like ElasticSearch‘s and Solr‘s, which do all work at search time. Such approaches are more flexible: you need not do anything special during indexing, and for every query you can pick and choose exactly which facets to compute.

However, the price for that flexibility is slower searching, as each search must do more work for every matched document. Furthermore, the impact on near-real-time reopen latency can be horribly costly if top-level data-structures, such as Solr’s UnInvertedField, must be rebuilt on every reopen. The taxonomy index used by the facet module means no extra work needs to be done on each near-real-time reopen.

The dynamic range faceting sounds particularly useful.

Fun with Lucene’s faceted search module

Sunday, December 9th, 2012

Fun with Lucene’s faceted search module by Mike McCandless.

From the post:

These days faceted search and navigation is common and users have come to expect and rely upon it.

Lucene’s facet module, first appearing in the 3.4.0 release, offers a powerful implementation, making it trivial to add a faceted user interface to your search application. Shai Erera wrote up a nice overview here and worked through nice “getting started” examples in his second post.

The facet module has not been integrated into Solr, which has an entirely different implementation, nor into ElasticSearch, which also has its own entirely different implementation. Bobo is yet another facet implementation! I’m sure there are more…

The facet module can compute the usual counts for each facet, but also has advanced features such as aggregates other than hit count, sampling (for better performance when there are many hits) and complements aggregation (for better performance when the number of hits is more than half of the index). All facets are hierarchical, so the app is free to index an arbitrary tree structure for each document. With the upcoming 4.1, the facet module will fully support near-real-time (NRT) search.

Take some time over the holidays to play with faceted searches in Lucene.

Solr vs ElasticSearch: Part 4 – Faceting

Tuesday, October 30th, 2012

Solr vs ElasticSearch: Part 4 – Faceting by Rafał Kuć.

From the post:

Solr 4 (aka SolrCloud) has just been released, so it’s the perfect time to continue our ElasticSearch vs. Solr series. In the last three parts of the ElasticSearch vs. Solr series we gave a general overview of the two search engines, about data handling, and about their full text search capabilities. In this part we look at how these two engines handle faceting.

Rafał continues his excellent comparison of Solr and ElasticSearch.

Understanding your software options is almost as important as understanding your data.

Faceted classification – Drill Up/Down, Out?

Sunday, October 28th, 2012

Faceted classification

I use search facets in a number of contexts everyday.

But today this summary from Wikipedia struck me differently than most days:

A faceted classification system allows the assignment of an object to multiple characteristics (attributes), enabling the classification to be ordered in multiple ways, rather than in a single, predetermined, taxonomic order. A facet comprises “clearly defined, mutually exclusive, and collectively exhaustive aspects, properties or characteristics of a class or specific subject”.[1] For example, a collection of books might be classified using an author facet, a subject facet, a date facet, etc. (From Faceted classification at Wikipedia.)

My general experience is that facets are used to narrow search results. That is set result set is progressively narrowed to fewer and fewer items.

At the same time, a choice of facets can be discarded, returning to a broader result set.

So facets can move the searcher up and down in search result size, but within the bounds of the initial result set.

Has anyone experimented with adding facets from a broader pool? Say all the items in a database and not just those items in an initial search query?

Enabling the user to “drill out” from what we think of as the initial result set?

Which would raise questions about managing facets for a changing underlying set. For a user to broaden or narrow the result set in the more traditional way.

High Availability Search with SolrCloud

Sunday, October 28th, 2012

High Availability Search with SolrCloud by Brent Lemons.

Brent explains that using embedded ZooKeeper is useful for testing/learning SolrCloud, but high availaility requires more.

As in separate installations of SolrCloud and ZooKeeper, both as high availability applications.

He walks through the steps to create and test such an installation.

If you have or expect to have a high availability search requirement, Brent’s post will be helpful.

Faceting & result grouping

Saturday, April 14th, 2012

Faceting & result grouping by Martijn van Groningen

From the post:

Result grouping and faceting are in essence two different search features. Faceting counts the number of hits for specific field values matching the current query. Result grouping groups documents together with a common property and places these documents under a group. These groups are used as the hits in the search result. Usually result grouping and faceting are used together and a lot of times the results get misunderstood.

The main reason is that when using grouping people expect that a hit is represented by a group. Faceting isn’t aware of groups and thus the computed counts represent documents and not groups. This different behaviour can be very confusion. A lot of questions on the Solr user mailing list are about this exact confusion.

In the case that result grouping is used with faceting users expect grouped facet counts. What does this mean? This means that when counting the number of matches for a specific field value the grouped faceting should check whether the group a document belongs to isn’t already counted before. This is best illustrated with some example documents.

Examples follow that make the distinction between groups and facets in Lucene and Solr clear. Not to mention specific suggestions on configuration of your service.

Custom security filtering in Solr

Tuesday, April 3rd, 2012

Custom security filtering in Solr by Erik Hatcher

Yonik recently wrote about “Advanced Filter Caching in Solr” where he talked about expensive and custom filters; it was left as an exercise to the reader on the implementation details. In this post, I’m going to provide a concrete example of custom post filtering for the case of filtering documents based on access control lists.

Recap of Solr’s filtering and caching

First let’s review Solr’s filtering and caching capabilities. Queries to Solr involve a full-text, relevancy scored, query (the infamous q parameter). As users navigate they will browse into facets. The search application generates filter query (fq) parameters for faceted navigation (eg. fq=color:red, as in the article referenced above). The filter queries are not involved in document scoring, serving only to reduce the search space. Solr sports a filter cache, caching the document sets of each unique filter query. These document sets are generated in advance, cached, and reduce the documents considered by the main query. Caching can be turned off on a per-filter basis; when filters are not cached, they are used in parallel to the main query to “leap frog” to documents for consideration, and a cost can be associated with each filter in order to prioritize the leap-frogging (smallest set first would minimize documents being considered for matching).

Post filtering

Even without caching, filter sets default to generate in advance. In some cases it can be extremely expensive and prohibitive to generate a filter set. One example of this is with access control filtering that needs to take the users query context into account in order to know which documents are allowed to be returned or not. Ideally only matching documents, documents that match the query and straightforward filters, should be evaluated for security access control. It’s wasteful to evaluate any other documents that wouldn’t otherwise match anyway. So let’s run through an example… a contrived example for the sake of showing how Solr’s post filtering works.

Good examples but also heed the author’s warning to use the techniques in this article when necessary. Some times simple solutions are the best. Like using the network authentication layer to prevent unauthorized users from seeing the Solr application at all. No muss, no fuss.

Guided Exploration = Faceted Search, Backwards

Thursday, January 19th, 2012

Guided Exploration = Faceted Search, Backwards by Daniel Tunkelang.

Daniel starts off:

Information Scent

In the early 1990s, PARC researchers Peter Pirolli and Stuart Card developed the theory of information scent (more generally, information foraging) to evaluate user interfaces in terms of how well users can predict which paths will lead them to useful information. Like many HCIR researchers and practitioners, I’ve found this model to be a useful way to think about interactive information seeking systems.

Specifically, faceted search is an exemplary application of the theory of information scent. Faceted search allows users to express an information need as a keyword search, providing them with a series of opportunities to improve the precision of the initial result set by restricting it to results associated with particular facet values.

For example, if I’m looking for folks to hire for my team, I can start my search on LinkedIn with the keywords [information retrieval], restrict my results to Location: San Francisco Bay Area, and then further restrict to School: CMU.

But quickly comes to:

Guided exploration exchanges the roles of precision and recall. Faceted search starts with high recall and helps users increase precision while preserving as much recall as possible. In contrast, guided exploration starts with high precision and helps users increase recall while preserving as much precision as possible.

That sounds great in theory, but how can we implement guided exploration in practice?

A very interesting look at how to expand a result set and maintain precision at the same time.

Of particular interest for anyone who wants to implement dynamic merging of proxies based on subject similarity.

An open field of research that offers a number of exciting possibilities.