Archive for the ‘Finite State Automata’ Category

Lucene 4 Finite State Automata In 10 Minutes (Intro & Tutorial)

Sunday, February 24th, 2013

Lucene 4 Finite State Automata In 10 Minutes (Intro & Tutorial) by Doug Turnbull.

From the post:

This article is intended to help you bootstrap your ability to work with Finite State Automata (note automata == plural of automaton). Automata are a unique data structure, requiring a bit of theory to process and understand. Hopefully what’s below can give you a foundation for playing with these fun and useful Lucene data structures!

Motivation, Why Automata?

When working in search, a big part of the job is making sense of loosely-structured text. For example, suppose we have a list of about 1000 valid first names and 100,000 last names. Before ingesting data into a search application, we need to extract first and last names from free-form text.

Unfortunately the data sometimes has full names in the format “LastName, FirstName” like “Turnbull, Doug”. In other places, however, full names are listed “FirstName LastName” like “Doug Turnbull”. Add a few extra representations, and to make sense out of what strings represent valid names becomes a chore.

This becomes especially troublesome when we’re depending on these as natural identifiers for looking up or joining across multiple data sets. Each data set might textually represent the natural identifier in subtly different ways. We want to capture the representations across multiple data sets to ensure our join works properly.

So… Whats a text jockey to do when faced with such annoying inconsistencies?

You might initially think “regular expression”. Sadly, a normal regular expression can’t help in this case. Just trying to write a regular expression that allows a controlled vocabulary of 100k valid last names but nothing else is non-trivial. Not to mention the task of actually using such a regular expression.

But there is one tool that looks promising for solving this problem. Lucene 4.0′s new Automaton API. Lets explore what this API has to offer by first reminding ourselves about a bit of CS theory.

Are you motivated?

I am!

See John Berryman’s comment about matching patterns of words.

Then think about finding topics, associations and occurrences in free form data.

Or creating a collection of automata as a tool set for building topic maps.

Finite State Automata in Lucene

Monday, May 14th, 2012

Finite State Automata in Luceneby Mike McCandless

From the post:

Lucene Revolution 2012 is now done, and the talk Robert and I gave went well! We showed how we are using automata (FSAs and FSTs) to make great improvements throughout Lucene.

You can view the slides here.

This was the first time I used Google Docs exclusively for a talk, and I was impressed! The real-time collaboration was awesome: we each could see the edits the other was doing, live. You never have to “save” your document: instead, every time you make a change, the document is saved to a new revision and you can then use infinite undo, or step back through all revisions, to go back.

Finally, Google Docs covers the whole life-cycle of your talk: editing/iterating, presenting (it presents in full-screen just fine, but does require an internet connection; I exported to PDF ahead of time as a backup) and, finally, sharing with the rest of the world!

I must confess to disappointment when I read at slide 23 that “multi-token synonyms mess up graph.”

Particularly since I suspect that not only do synonyms need to be “multi-token” but “multi-dimensional” as well.