Archive for the ‘Lucene’ Category

Eating dog food with Lucene

Tuesday, May 14th, 2013

Eating dog food with Lucene by Michael McCandless.

From the post:

Eating your own dog food is important in all walks of life: if you are a chef you should taste your own food; if you are a doctor you should treat yourself when you are sick; if you build houses for a living you should live in a house you built; if you are a parent then try living by the rules that you set for your kids (most parents would fail miserably at this!); and if you build software you should constantly use your own software.

So, for the past few weeks I’ve been doing exactly that: building a simple Lucene search application, searching all Lucene and Solr Jira issues, and using it instead of Jira’s search whenever I need to go find an issue.

It’s currently running at jirasearch.mikemccandless.com and it’s still quite rough (feedback welcome!).

Now there’s a way to learn the details!

Makes me think about the poor search capabilities at an SDO I frequent.

Could be a way to spend some quality time with Lucene and Solr.

Will have to give it some thought.

Apache Lucene / Solr 4.3 Release!

Monday, May 6th, 2013

See Lucene Changes.txt.

See Solr Changes.txt

More good news for a Monday!

Client-side search

Thursday, April 25th, 2013

Client-side search by Gene Golovchinsky.

From the post:

When we rolled out the CHI 2013 previews site, we got a couple of requests for being able to search the site with keywords. Of course interfaces for search are one of my core research interests, so that request got me thinking. How could we do search on this site? The problem with the conventional approach to search is that it requires some server-side code to do the searching and to return results to the client. This approach wouldn’t work for our simple web site, because from the server’s perspective, our site was static — just a few HTML files, a little bit of JavaScript, and about 600 videos. Using Google to search the site wouldn’t work either, because most of the searchable content is located on two pages, with hundreds of items on each page. So what to do?

I looked around briefly trying to find some client-side indexing and retrieval code, and struck out. Finally, I decided to take a crack at writing a search engine in JavaScript. Now, before you get your expectations up, I was not trying to re-implement Lucene in JavaScript. All I wanted was some rudimentary keyword search capability. Building that in JavaScript was not so difficult.

One simplifying assumption I could make was that my document collection was static: sorry, the submission deadline for the conference has passed. Thus, I could have a static index that could be made available to each client, and all the client needed to do was match and rank.

Each of my documents had a three character id, and a set of fields. I didn’t bother with the fields, and just lumped everything together in the index. The approach was simple, again due to lots of assumptions. I treated the inverted index as a hash table that maps keywords onto lists of document ids. OK, document ids and term frequencies. Including positional information is an exercise left to the reader.

A refreshing reminder that simplified requirements can lead to successful applications.

Or to put it another way, not every application has to meet every possible use case.

For example, I might want to have a photo matching application that only allows users to pick match/no match for any pair of photos.

Not why, what reasons for match/no match, etc.

But it does capture the users identity in an association as saying photo # and photo # are of the same person.

That doesn’t provide any basis for automated comparison of those judgments, but not every judgment is required to do so.

I am starting to think of subject identification as a continuum of practices, some of which enable more reuse than others.

Which of those you choose, depends upon your requirements, your resources and other factors.

How To Debug Solr With Eclipse

Tuesday, April 16th, 2013

How To Debug Solr With Eclipse by Doug Turnbull.

From the post:

Recently I was puzzled by some behavior Solr was showing me. I scratched my head and called over a colleague. We couldn’t quite figure out what was going on. Well Solr is open source so… next stop – Debuggersville!

Running Solr in the Eclipse debugger isn’t hard, but there are many scattered user group posts and blog articles that you’ll need to manually tie together into a coherent picture. So let me do you the favor of tying all of that info together for you here.

This looks very useful.

Curious of there are any statistical function debuggers?

That step you through the operations and show the state of values as they change?

Thinking that could be quite useful as a sanity test when the numbers just don’t jive.

Apache Lucene and Solr 4.2.1

Wednesday, April 10th, 2013

Bug fix releases for Apache Lucene and Solr.

Apache Lucene 4.2.1: Changes; Downloads.

Apache Solr 4.2.1: Changes; Downloads.

Beginners Guide To Enhancing Solr/Lucene Search…

Monday, April 8th, 2013

Beginners Guide To Enhancing Solr/Lucene Search With Mahout’s Machine Learning by Doug Turnbull.

From the post:

Yesterday, John and I gave a talk to the DC Hadoop Users Group about using Mahout with Solr to perform Latent Semantic Indexing — calculating and exploiting the semantic relationships between keywords. While we were there, I realized, a lot of people could benefit from a bigger picture, less in-depth, point of view outside of our specific story. In general where do Mahout and Solr fit together? What does that relationship look like, and how does one exploit Mahout to make search even more awesome? So I thought I’d blog about how you too get start to put these pieces together to simultaneously exploit Solr’s search and Mahout’s machine learning capabilities.

The root of how this all works is with a slightly obscure feature of Lucene based search — Term Vectors. Lucene based search applications give you the ability to generate term vectors from documents in the search index. Its a feature often turned on for specific search features, but other than that can appear to be a weird opaque feature to beginners. What is a term vector, you might ask? And why would you want to get one?

You know my misgivings about metric approaches to non-metric data (such as semantics) but there is no denying that Latent Semantic Indexing can be useful.

Think of Latent Semantic Indexing as a useful tool.

A saw is a tool too but not every cut made with a saw is a correct one.

Yes?

How NoSQL Paid Off for Telenor

Friday, March 29th, 2013

How NoSQL Paid Off for Telenor by Sebastian Verheughe and Katrina Sponheim.

A presentation I encountered while searching for something else.

Makes a business case for Lucene/Solr and Neo4j solutions to improve customer access to data.

As opposed to the world being a better place case.

What information process/need have you encountered where you can make a business case for topic maps?

Build a search engine in 20 minutes or less

Thursday, March 28th, 2013

Build a search engine in 20 minutes or less by Ben Ogorek.

I was suspicious but pleasantly surprised by the demonstration of the vector space model you will find here.

True, it doesn’t offer all the features of the latest Lucene/Solr releases but it will give you a firm grounding on vector space models.

Enjoy!

PS: One thing to keep in mind, semantics do not map to vector space. We can model word occurrences in vector space but occurrences are not semantics.

Lucene/Solr 4 – A Revolution in Enterprise Search Technology (Webinar)

Friday, March 22nd, 2013

Lucene/Solr 4 – A Revolution in Enterprise Search Technology (Webinar). Presenter: Erik Hatcher, Lucene/Solr Committer and PMC member.

Date: Wednesday, March 27, 2013
Time: 10:00am Pacific Time

From the signup page:

Lucene/Solr 4 is a ground breaking shift from previous releases. Solr 4.0 dramatically improves scalability, performance, reliability, and flexibility. Lucene 4 has been extensively upgraded. It now supports near real-time (NRT) capabilities that allow indexed documents to be rapidly visible and searchable. Additional Lucene improvements include pluggable scoring, much faster fuzzy and wildcard querying, and vastly improved memory usage.

The improvements in Lucene have automatically made Solr 4 substantially better. But Solr has also been considerably improved and magnifies these advances with a suite of new “SolrCloud” features that radically improve scalability and reliability.

In this Webinar, you will learn:

  • What are the Key Feature Enhancements of Lucene/Solr 4, including the new distributed capabilities of SolrCloud
  • How to Use the Improved Administrative User Interface
  • How Sharding has been improved
  • What are the improvements to GeoSpatial Searches, Highlighting, Advanced Query Parsers, Distributed search support, Dynamic core management, Performance statistics, and searches for rare values, such as Primary Key

Great way to get up to speed on the latest release of Lucene/Solr!

elasticsearch 0.90.0.RC1 Released

Thursday, March 21st, 2013

elasticsearch 0.90.0.RC1 Released by Shay Banon.

From the post:

elasticsearch version 0.90.0.RC1 is out, the first release candiate for the 0.90 release. You can download it here.

This release includes an upgrade to Lucene 4.2, many improvements to the suggester feature (including its own dedicated API), another round of memory improvements to field data (long type will now automatically “narrow” to the smallest type when loaded to memory) and several bug fixes. Upgrading to it from previous beta releases is highly recommended. (inserted URL to release notes)

Just to keep you on the cutting edge of search technology!

MongoDB 2.4 Release

Tuesday, March 19th, 2013

MongoDB 2.4 Release

From the webpage:

Developer Productivity

  • Capped Arrays simplify development by making it easy to incorporate fixed, sorted lists for features like leaderboards and logging.
  • Geospatial Enhancements enable new use cases with support for polygon intersections and analytics based on geospatial data.
  • Text Search provides a simplified, integrated approach to incorporating search functionality into apps (Note: this feature is currently in beta release).

Operations

  • Hash-Based Sharding simplifies deployment of large MongoDB systems.
  • Working Set Analyzer makes capacity planning easier for ops teams.
  • Improved Replication increases resiliency and reduces administration.
  • Mongo Client creates an intuitive, consistent feature set across all drivers.

Performance

  • Faster Counts and Aggregation Framework Refinements make it easier to leverage real-time, in-place analytics.
  • V8 JavaScript Engine offers better concurrency and faster performance for some operations, including MapReduce jobs.

Monitoring

  • On-Prem Monitoring provides comprehensive monitoring, visualization and alerting on more than 100 operational metrics of a MongoDB system in real time, based on the same application that powers 10gen’s popular MongoDB Monitoring Service (MMS). On-Prem Monitoring is only available with MongoDB Enterprise.



Security
….

  • Kerberos Authentication enables enterprise and government customers to integrate MongoDB into existing enterprise security systems. Kerberos support is only available in MongoDB Enterprise.
  • Role-Based Privileges allow organizations to assign more granular security policies for server, database and cluster administration.

You can read more about the improvements to MongoDB 2.4 in the Release Notes. Also, MongoDB 2.4 is available for download on MongoDB.org.

Lots to look at in MongoDB 2.4!

But I am curious about the beta text search feature.

MongoDB Text Search: Experimental Feature in MongoDB 2.4 says:

Text search (SERVER-380) is one of the most requested features for MongoDB 10gen is working on an experimental text-search feature, to be released in v2.4, and we’re already seeing some talk in the community about the native implementation within the server. We view this as an important step towards fulfilling a community need.

MongoDB text search is still in its infancy and we encourage you to try it out on your datasets. Many applications use both MongoDB and Solr/Lucene, but realize that there is still a feature gap. For some applications, the basic text search that we are introducing may be sufficient. As you get to know text search, you can determine when MongoDB has crossed the threshold for what you need. (emphasis added)

So, why isn’t MongoDB incorporating Solr/Lucene instead of a home grown text search feature?

Seems like users could leverage their Solr/Lucene skills with their MongoDB installations.

Yes?

Lux

Saturday, March 16th, 2013

Lux

From the readme:

Lux is an open source XML search engine formed by fusing two excellent technologies: the Apache Lucene/Solr search index and the Saxon XQuery/XSLT processor.

At its core, Lux provides XML-aware indexing, an XQuery 1.0 optimizer that rewrites queries to use the indexes, and a function library for interacting with Lucene via XQuery. These capabilities are tightly integrated with Solr, and leverage its application framework in order to deliver a REST service and application server.

The REST service is accessible to applications written in almost any language, but it will be especially convenient for developers already using Solr, for whom Lux operates as a Solr plugin that provides query services using the same REST APIs as other Solr search plugins, but using a different query language (XQuery). XML documents may be inserted (and updated) using standard Solr REST calls: XML-aware indexing is triggered by the presence of an XML-aware field in a document. This means that existing application frameworks written in many different languages are positioned to use Lux as a drop-in capability for indexing and querying semi-structured content.

The application server is a great way to get started with Lux: it provides the ability to write a complete application in XQuery and XSLT with data storage backed by Lucene.

If you are looking for experience with XQuery and Lucene/Solr, look no further!

May be a good excuse for me to look at defining equivalence statements using XQuery.

I first saw this in a tweet by Michael Kay.

A Peek Behind the Neo4j Lucene Index Curtain

Friday, March 15th, 2013

A Peek Behind the Neo4j Lucene Index Curtain by Max De Marzi.

Max suggests using a copy of your Neo4j database for this exercise.

Could be worth your while to go exploring.

And you will learn something about Lucene in the bargain.

Elasticsearch and Joining

Tuesday, March 12th, 2013

Elasticsearch and Joining by Felix Hürlimann.

From the post:

With the success of elasticsearch, people, including us, start to explore the possibilities and mightiness of the system. Including border cases for which the underlying core, Lucene, never was originally intended or optimized for. One of the many requests that come up pretty quickly is the whish for joining data across types or indexes, similar to an SQL join clause that combines records from two or more tables in a database. Unfortunately full join support is not (yet?) available out of the box. But there are some possibilities and some attempts to solve parts of issue. This post is about summarizing some of the ideas in this field.

To illustrate the different ideas, let’s work with the following example: we would like to index documents and comments with a one to many relationship between them. Each comment has an author and we would like to answer the question: Give me all documents that match a certain query and a specific author has commented on it.

The latest beta release of Elasticsearch is described as:

If you have more complex requirements for join, a new feature introdcued in the latest beta release may can help you. It introduces another feature that allows for a kind of join by looking up filter terms in another index or type. This allows then e.g. for queries like ‘Show me all comments from documents that relate to this document and the author is ‘John Doe’.

The “looking up” in a different index or type sounds quite interesting.

Have you looked at the new beta of Elasticsearch?

Apache Lucene/Solr 4.2 Out!

Tuesday, March 12th, 2013

Apache Lucene 4.2

Download

Changes

Apache Solr 4.2

Download

Changes

See the Lucene homepage for a summary of the 4.2 changes in Lucene and Solr.

Warning: Reference good only until the next Lucene/Solr release. ;-)

Solr: Custom Ranking with Function Queries

Sunday, March 10th, 2013

Solr: Custom Ranking with Function Queries by Sujit Pal.

From the post:

Solr has had support for Function Queries since version 3.1, but before sometime last week, I did not have a use for it. Which is probably why when I would read about Function Queries, they would seem like a nice idea, but not interesting enough to pursue further.

Most people get introduced to Function Queries through the bf parameter in the DisMax Query Parser or through the geodist function in Spatial Search. So far, I haven’t had the opportunity to personally use either feature in a real application. My introduction to Function Queries was through a problem posed to me by one of my coworkers.

The problem was as follows. We want to be able to customize our search results based on what a (logged-in) user tells us about himself or herself via their profile. This could be gender, age, ethnicity and a variety of other things. On the content side, we can annotate the document with various features corresponding to these profile features. For example, we can assign a score to a document that indicates its appeal/information value to males versus females that would correspond to the profile’s gender.

So the idea is that if we know that the profile is male, we should boost the documents that have a high male appeal score and deboost the ones that have a high female appeal score, and vice versa if the profile is female. This idea can be easily extended for multi-category features such as ethnicity as well. In this post, I will describe a possible implementation that uses Function Queries to rerank search results using male/female appeal document scores.

Does your topic map deliver information based on user characteristics?

Have you re-invented the ranking or are you using an off-the-shelf solution?

Solr + “flash sale site”

Saturday, March 9th, 2013

How Solr powers search on America’s largest flash sale site by Ade Trenaman.

The post caught my attention with “flash sale,” which I had to look up. ;-)

Even after discovering it means “deal of the day,” the slides were interesting.

Especially the commentary on synonym lists!

What someone else considers to be a synonym may not be one for your audience.

Drill Sideways faceting with Lucene

Monday, February 25th, 2013

Drill Sideways faceting with Lucene by Mike McCandless.

From the post:

Lucene’s facet module, as I described previously, provides a powerful implementation of faceted search for Lucene. There’s been a lot of progress recently, including awesome performance gains as measured by the nightly performance tests we run for Lucene:

[3.8X speedup!]

….

For example, try searching for an LED television at Amazon, and look at the Brand field, seen in the image to the right: this is a multi-select UI, allowing you to select more than one value. When you select a value (check the box or click on the value), your search is filtered as expected, but this time the field does not disappear: it stays where it was, allowing you to then drill sideways on additional values. Much better!

LinkedIn’s faceted search, seen on the left, takes this even further: not only are all fields drill sideways and multi-select, but there is also a text box at the bottom for you to choose a value not shown in the top facet values.

To recap, a single-select field only allows picking one value at a time for filtering, while a multi-select field allows picking more than one. Separately, drilling down means adding a new filter to your search, reducing the number of matching docs. Drilling up means removing an existing filter from your search, expanding the matching documents. Drilling sideways means changing an existing filter in some way, for example picking a different value to filter on (in the single-select case), or adding another or’d value to filter on (in the multi-select case). (images omitted)

More details: DrillSideways class being developed under LUCENE-4748.

Just following the progress on Lucene is enough to make you dizzy!

Text processing (part 2): Inverted Index

Sunday, February 24th, 2013

Text processing (part 2): Inverted Index by Ricky Ho.

From the post:

This is the second part of my text processing series. In this blog, we’ll look into how text documents can be stored in a form that can be easily retrieved by a query. I’ll used the popular open source Apache Lucene index for illustration.

Not only do you get to learn about inverted indexes but some Lucene in the bargain.

That’s not a bad deal!

Lucene 4 Finite State Automata In 10 Minutes (Intro & Tutorial)

Sunday, February 24th, 2013

Lucene 4 Finite State Automata In 10 Minutes (Intro & Tutorial) by Doug Turnbull.

From the post:

This article is intended to help you bootstrap your ability to work with Finite State Automata (note automata == plural of automaton). Automata are a unique data structure, requiring a bit of theory to process and understand. Hopefully what’s below can give you a foundation for playing with these fun and useful Lucene data structures!

Motivation, Why Automata?

When working in search, a big part of the job is making sense of loosely-structured text. For example, suppose we have a list of about 1000 valid first names and 100,000 last names. Before ingesting data into a search application, we need to extract first and last names from free-form text.

Unfortunately the data sometimes has full names in the format “LastName, FirstName” like “Turnbull, Doug”. In other places, however, full names are listed “FirstName LastName” like “Doug Turnbull”. Add a few extra representations, and to make sense out of what strings represent valid names becomes a chore.

This becomes especially troublesome when we’re depending on these as natural identifiers for looking up or joining across multiple data sets. Each data set might textually represent the natural identifier in subtly different ways. We want to capture the representations across multiple data sets to ensure our join works properly.

So… Whats a text jockey to do when faced with such annoying inconsistencies?

You might initially think “regular expression”. Sadly, a normal regular expression can’t help in this case. Just trying to write a regular expression that allows a controlled vocabulary of 100k valid last names but nothing else is non-trivial. Not to mention the task of actually using such a regular expression.

But there is one tool that looks promising for solving this problem. Lucene 4.0′s new Automaton API. Lets explore what this API has to offer by first reminding ourselves about a bit of CS theory.

Are you motivated?

I am!

See John Berryman’s comment about matching patterns of words.

Then think about finding topics, associations and occurrences in free form data.

Or creating a collection of automata as a tool set for building topic maps.

Searching for Dark Data

Tuesday, February 19th, 2013

Searching for Dark Data by Paul Doscher.

From the post:

We live in a highly connected world where every digital interaction spawns chain reactions of unfathomable data creation. The rapid explosion of text messaging, emails, video, digital recordings, smartphones, RFID tags and those ever-growing piles of paper – in what was supposed to be the paperless office – has created a veritable ocean of information.

Welcome to the world of Dark Data

Welcome to the world of Dark Data, the humongous mass of constantly accumulating information generated in the Information Age. Whereas Big Data refers to the vast collection of the bits and bytes that are being generated each nanosecond of each day, Dark Data is the enormous subset of unstructured, untagged information residing within it.

Research firm IDC estimates that the total amount of digital data, aka Big Data, will reach 2.7 zettabytes by the end of this year, a 48 percent increase from 2011. (One zettabyte is equal to one billion terabytes.) Approximately 90 percent of this data will be unstructured – or Dark.

Dark Data has thrown traditional business intelligence and reporting technologies for a loop. The software that countless executives have relied on to access information in the past simply cannot locate or make sense of the unstructured data that comprises the bulk of content today and tomorrow. These tools are struggling to tap the full potential of this new breed of data.

The good news is that there’s an emerging class of technologies that is ready to pick up where traditional tools left off and carry out the crucial task of extracting business value from this data.

Effective exploration of Dark Data will require something different from search tools that depend upon:

  • Pre-specified semantics (RDF) because Dark Data has no pre-specified semantics.
  • Structure because Dark Data has no structure.

Effective exploration of Dark Data will require:

Machine assisted-Interactive searching with gifted and grounded semantic comparators (people) creating pathways, tunnels and signposts into the wilderness of Dark Data.

I first saw this at: Delving into Dark Data.

Developing Your Own Solr Filter part 2

Sunday, February 17th, 2013

Developing Your Own Solr Filter part 2

From the post:

In the previous entry “Developing Your Own Solr Filter” we’ve shown how to implement a simple filter and how to use it in Apache Solr. Recently, one of our readers asked if we can extend the topic and show how to write more than a single token into the token stream. We decided to go for it and extend the previous blog entry about filter implementation.

What better way to start the week!

New Book: ElasticSearch Server!

Monday, February 11th, 2013

New Book: ElasticSearch Server!

In the blog post dedicated to Solr 4.0 Cookbook we give a small hint that cookbook was not the only project that occupies our free time. Today we can officially say that a few month of hard work is slowly coming to an end – we can announce a new book about one of the greatest piece of open-source software – ElasticSearch Server book!

ElasticSearch server book describes the most important and commonly used features of ElasticSearch (at least from our perspective). Example of topics discussed:

  • ElasticSearch installation and configuration
  • Static and dynamic index structure creation
  • Querying ElasticSearch with Query DSL explained
  • Using filters
  • Faceting
  • Routing
  • Indexing data that is not flat

BTW, some wag posted a comment saying a Solr blog should not talk about ElasticSearch.

I bet they don’t see the sunshine very often from that position either. ;-)

A Data Driven Approach to Query Expansion in Question Answering

Wednesday, January 30th, 2013

A Data Driven Approach to Query Expansion in Question Answering by Leon Derczynski, Jun Wang, Robert Gaizauskas, and Mark A. Greenwood.

Abstract:

Automated answering of natural language questions is an interesting and useful problem to solve. Question answering (QA) systems often perform information retrieval at an initial stage. Information retrieval (IR) performance, provided by engines such as Lucene, places a bound on overall system performance. For example, no answer bearing documents are retrieved at low ranks for almost 40% of questions.

In this paper, answer texts from previous QA evaluations held as part of the Text REtrieval Conferences (TREC) are paired with queries and analysed in an attempt to identify performance-enhancing words. These words are then used to evaluate the performance of a query expansion method.

Data driven extension words were found to help in over 70% of difficult questions. These words can be used to improve and evaluate query expansion methods. Simple blind relevance feedback (RF) was correctly predicted as unlikely to help overall performance, and an possible explanation is provided for its low value in IR for QA.

Work on query expansion in natural language answering systems. Closely related to synonymy.

Query expansion tools could be useful prompts for topic map authors seeking terms for mapping.

Getting real-time field values in Lucene

Sunday, January 27th, 2013

Getting real-time field values in Lucene by Mike McCandless.

From the post:

We know Lucene’s near-real-time search is very fast: you can easily refresh your searcher once per second, even at high indexing rates, so that any change to the index is available for searching or faceting at most one second later. For most applications this is plenty fast.

But what if you sometimes need even better than near-real-time? What if you need to look up truly live or real-time values, so for any document id you can retrieve the very last value indexed?

Just use the newly committed LiveFieldValues class!

It’s simple to use: when you instantiate it you provide it with your SearcherManager or NRTManager, so that it can subscribe to the RefreshListener to be notified when new searchers are opened, and then whenever you add, update or delete a document, you notify the LiveFieldValues instance. Finally, call the get method to get the last indexed value for a given document id.

I saw a webinar by Mike McCandless that is probably the only webinar I would ever repeat watching.

Organized, high quality technical content, etc.

Compare that to a recent webinar I watched that spent fify-five (55) minutes reviewing information know to anyone who could say the software’s name. The speaker then lamented the lack of time to get into substantive issues.

When you see a webinar like Mike’s, drop me a line. We need to promote that sort of presentation over the other.

Make your Filters Match: Faceting in Solr [Surveillance By and For The Public?]

Friday, January 25th, 2013

Make your Filters Match: Faceting in Solr Florian Hopf.

From the post:

Facets are a great search feature that let users easily navigate to the documents they are looking for. Solr makes it really easy to use them though when naively querying for facet values you might see some unexpected behaviour. Read on to learn the basics of what is happening when you are passing in filter queries for faceting. Also, I’ll show how you can leverage local params to choose a different query parser when selecting facet values.

Introduction

Facets are a way to display categories next to a users search results, often with a count of how many results are in this category. The user can then select one of those facet values to retrieve only those results that are assigned to this category. This way he doesn’t have to know what category he is looking for when entering the search term as all the available categories are delivered with the search results. This approach is really popular on sites like Amazon and eBay and is a great way to guide the user.

Solr brought faceting to the Lucene world and arguably the feature was an important driving factor for its success (Lucene 3.4 introduced faceting as well). Facets can be build from terms in the index, custom queries and ranges though in this post we will only look at field facets.

Excellent introduction to facets in Solr.

The amount of enterprise quality indexing and search software that is freely available, makes me wonder why the average citizen worries about privacy?

There are far more average citizens than denizens of c-suites, government offices, and the like.

Shouldn’t they be the ones worrying about what the rest of us are compiling together?

Instead of secret, Stasi-like archives, a public archive, with the observations of ordinary citizens.

Apache Lucene 4.1 and Apache SolrTM 4.1 available

Thursday, January 24th, 2013

Lucene 4.1 can be downloaded from: http://lucene.apache.org/core/mirrors-core-latest-redir.html

Lucene CHANGES.text

Solr 4.1 can be downloaded from: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Solr CHANGES.txt

That’s getting the new year off to a great start!

A new Lucene highlighter is born [The final inch problem]

Monday, January 7th, 2013

A new Lucene highlighter is born Mike McCandless.

From the post:

Robert has created an exciting new highlighter for Lucene, PostingsHighlighter, our third highlighter implementation (Highlighter and FastVectorHighlighter are the existing ones). It will be available starting in the upcoming 4.1 release.

Highlighting is crucial functionality in most search applications since it’s the first step of the hard-to-solve final inch problem, i.e. of getting the user not only to the best matching documents but getting her to the best spot(s) within each document. The larger your documents are, the more crucial it is that you address the final inch. Ideally, your user interface would let the user click on each highlight snippet to jump to where it occurs in the full document, or at least scroll to the first snippet when the user clicks on the document link. This is in general hard to solve: which application renders the content is dependent on its mime-type (i.e., the browser will render HTML, but will embed Acrobat Reader to render PDF, etc.).

Google’s Chrome browser has an ingenious solution to the final inch problem, when you use “Find…” to search the current web page: it highlights the vertical scroll bar showing you where the matches are on the page. You can then scroll to those locations, or, click on the highlights in the scroll bar to jump there. Wonderful!

All Lucene highlighters require search-time access to the start and end offsets per token, which are character offsets indicating where in the original content that token started and ended. Analyzers set these two integers per-token via the OffsetAttribute, though some analyzers and token filters are known to mess up offsets which will lead to incorrect highlights or exceptions during highlighting. Highlighting while using SynonymFilter is also problematic in certain cases, for example when a rule maps multiple input tokens to multiple output tokens, because the Lucene index doesn’t store the full token graph.

An interesting addition to the highlighters in Lucene.

Be sure to follow the link to Mike’s comments about the limitations on SynonymFilter and the difficulty of correction.

Closing the Knowledge Gap: A Case Study – How Cisco Unlocks Communications

Sunday, December 16th, 2012

Closing the Knowledge Gap: A Case Study – How Cisco Unlocks Communications by Tony Frazier, Director of Product Management, Cisco Systems and David Fishman, Marketing, Lucid Imagination.

A bit dated (2010) but I think you will find this interesting reading.

A couple of snippets to tempt you into reading the full post:

Consider this: you’re looking for information and immediately search the documents at your disposal to find the answer. Are you the first person who conducted this search? If you are in a reasonably large organization, given the scope and mix of electronic communications today, there could be more than 10 other employees looking for the same answer. Unearthing documents, one employee at a time, may not be the best way of tapping into that collective intellect and maximizing resources across an organization. Wouldn’t it make more sense to tap into existing discussions taking place across the network—over email, voice and increasingly video communications?

and,

Cisco’s approach to this project centered on vocabulary-based tagging and search. Every organization has the ability to define keywords for their personalized library. Cisco Pulse then tags a user’s activity, content and behavior in electronic communications to match the vocabulary, presenting valuable information that simplifies and accelerates knowledge sharing across an organization. Vocabulary-based tagging makes unlocking the relevant content of electronic communications safe and efficient.

Add a little vocabulary mapping with topic maps, toss and serve!

Taming Text [Coming real soon now!]

Thursday, December 13th, 2012

Taming Text by Grant S. Ingersoll, Thomas S. Morton, and Andrew L. Farris.

During a webinar today Grant said that “Taming Text” should be out in ebook form in just a week or two.

Grant is giving up the position of being the second longest running MEAP project. (He didn’t say who was first.)

Let’s all celebrate Grant and his co-authors crossing the finish line with a record number of sales!

This promises to be a real treat!

PS: Not going to put this on my wish list, too random and clumsy a process. Will just order it direct. ;-)