Bobo: Fast Faceted Search With Lucene
From the website:
Bobo is a Faceted Search implementation written purely in Java, an extension of Apache Lucene.
While Lucene is good with unstructured data, Bobo fills in the missing piece to handle semi-structured and structured data.
Bobo Browse is an information retrieval technology that provides navigational browsing into a semi-structured dataset. Beyond the result set from queries and selections, Bobo Browse also provides the facets from this point of browsing.
Features:
- No need for cache warm-up for the system to perform
- multi value sort – sort documents on fields that have multiple values per doc, .e.g tokenized fields
- fast field value retrieval – over 30x faster than IndexReader.document(int docid)
- facet count distribution analysis
- stable and small memory footprint
- support for runtime faceting
- result merge library for distributed facet search
I had to go look up the definition of facet. Merriam-Webster (I remember when it was just Webster) says:
any of the definable aspects that make up a subject (as of contemplation) or an object (as of consideration)
So a faceted search could search/browse, in theory at any rate, based on any property of a subject, even those I don’t recognize.
Different languages being the easiest example.
I could have aspects of a hotel room described in both German and Korean, both describing the same facets of the room.
Questions:
- How would you choose the facets for a subject to be included in faceted browsing? (3-5 pages, no citations)
- How would you design and test the presentation of facets to users? (3-5 pages, no citations)
- Compare the current TMQL proposal (post-Barta) with the query language for facet searching. If a topic map were treated (post-merging) as faceted subjects, which one would you prefer and why? (3-5 pages, no citations)
[…] about Bobo: Fast Faceted Search With Lucene, made me start to think about the various aspects of topic […]
Pingback by Aspects of Topic Maps « Another Word For It — December 8, 2010 @ 9:48 am
What Bobo (and many similar developments) show[s] is that the specialisation on “structured”, “semi-structured”, “what-an-academic-thinks-that-structure-is” content is eventually … contra-productive.
Content spans over the whole spectrum of “structuredness” and a query language should eventually cover this.
Speaking of TMQL-post-Barta: I certainly enjoy this new …. modesty. 😉
Comment by Robert Barta — December 9, 2010 @ 4:15 am
@Barta, +1!
Except that every abstraction is subject to being bridged by another abstraction.
That is there is no “Ur” abstraction that is the final one.
That has certainly been the case since the second data model (I wonder if we could identify that?) for digital computers but I would argue it has a much deeper and richer history than that.
I am looking forward to constructive comments from you on the next TMQL draft(s).
Although it is the path not taken by WG3 (SC 34 for that matter), nothing says a standard has to be written first. It is possible to standardize practice as well.
Of course, such a language would have to be promoted and become the de facto standard to become a candidate for standardization. (Immodesty may be required. 😉 )
Comment by Patrick Durusau — December 9, 2010 @ 6:52 am