Archive for the ‘Finite State Automata’ Category

Taxonomies and Toolkits of Regular Language Algorithms

Wednesday, September 10th, 2014

Taxonomies and Toolkits of Regular Language Algorithms by Bruce William Watson.

From 1.1 Problem Statement:

A number of fundamental computing science problems have been extensively studied since the 1950s and the 1960s. As these problems were studied, numerous solutions (in the form of algorithms) were developed over the years. Although new algorithms still appear from time to time, each of these fields can be considered mature. In the solutions to many of the well-studied computing science problems, we can identify three deficiencies:

  1. Algorithms solving the same problem are difficult to compare to one another. This is usually due to the use of different programming languages, styles of presentation, or simply the addition of unnecessary details.
  2. Collections of implementations of algorithms solving a problem are difficult, if not impossible, to find. Some of the algorithms are presented in a relatively obsolete manner, either using old notations or programming languages for which no compilers exist, making it difficult to either implement the algorithm or find an existing implementation.
  3. Little is known about the comparative practical running time performance of the algorithms. The lack of existing implementations in one and the same framework, especially of the older algorithms, makes it difficult to determine the running time characteristics of the algorithms. A software engineer selecting one of the algorithms will usually do so on the basis of the algorithm’s theoretical running time, or simply by guessing.

In this dissertation, a solution to each of the three deficiencies is presented for each of the following three fundamental computing science problems:

  1. Keyword pattern matching in strings. Given a finite non-empty set of keywords (the patterns) and an input string, find the set of all occurrences of a keyword as a substring of the input string.
  2. Finite automata (FA) construction. Given a regular expression, construct a finite automaton which accepts the language denoted by the regular expression.
  3. Deterministic finite automata (DFA) minimization. Given a DFA, construct the unique minimal DFA accepting the same language.

We do not necessarily consider all the known algorithms solving the problems. For example, we restrict ourselves to batch-style algorithms1, as opposed to incremental algorithms2.

Requires updating given its age, 1995, but a work merits mention.

I first saw this in a tweet by silentbicycle.srec.

Parsing Drug Dosages in text…

Sunday, March 30th, 2014

Parsing Drug Dosages in text using Finite State Machines by Sujit Pal.

From the post:

Someone recently pointed out an issue with the Drug Dosage FSM in Apache cTakes on the cTakes mailing list. Looking at the code for it revealed a fairly complex implementation based on a hierarchy of Finite State Machines (FSM). The intuition behind the implementation is that Drug Dosage text in doctor’s notes tend to follow a standard-ish format, and FSMs can be used to exploit this structure and pull out relevant entities out of this text. The paper Extracting Structured Medication Event Information from Discharge Summaries has more information about this problem. The authors provide their own solution, called the Merki Medication Parser. Here is a link to their Online Demo and source code (Perl).

I’ve never used FSMs myself, although I have seen it used to model (more structured) systems. So the idea of using FSMs for parsing semi-structured text such as this seemed interesting and I decided to try it out myself. The implementation I describe here is nowhere nearly as complex as the one in cTakes, but on the flip side, is neither as accurate, nor broad nor bulletproof either.

My solution uses drug dosage phrase data provided in this Pattern Matching article by Erin Rhode (which also comes with a Perl based solution), as well as its dictionaries (with additions by me), to model the phrases with the state diagram below. I built the diagram by eyeballing the outputs from Erin Rhode’s program. I then implement the state diagram with a home-grown FSM implementation based on ideas from Electric Monk’s post on FSMs in Python and the documentation for the Java library Tungsten FSM. I initially tried to use Tungsten-FSM, but ended up with extremely verbose Scala code because of Scala’s stricter generics system.

This caught my attention because I was looking at a data import handler recently that was harvesting information from a minimal XML wrapper around mediawiki markup. Works quite well but seems like a shame to miss all the data in wiki markup.

I say “miss all the data in wiki markup” and that’s not really fair. It is dumped into a single field for indexing. But that is a field that loses the context distinctions between a note, appendix, bibliography, or even the main text.

If you need distinctions that aren’t the defaults, you may be faced with rolling your own FSM. This post should help get you started.

Lucene 4 Finite State Automata In 10 Minutes (Intro & Tutorial)

Sunday, February 24th, 2013

Lucene 4 Finite State Automata In 10 Minutes (Intro & Tutorial) by Doug Turnbull.

From the post:

This article is intended to help you bootstrap your ability to work with Finite State Automata (note automata == plural of automaton). Automata are a unique data structure, requiring a bit of theory to process and understand. Hopefully what’s below can give you a foundation for playing with these fun and useful Lucene data structures!

Motivation, Why Automata?

When working in search, a big part of the job is making sense of loosely-structured text. For example, suppose we have a list of about 1000 valid first names and 100,000 last names. Before ingesting data into a search application, we need to extract first and last names from free-form text.

Unfortunately the data sometimes has full names in the format “LastName, FirstName” like “Turnbull, Doug”. In other places, however, full names are listed “FirstName LastName” like “Doug Turnbull”. Add a few extra representations, and to make sense out of what strings represent valid names becomes a chore.

This becomes especially troublesome when we’re depending on these as natural identifiers for looking up or joining across multiple data sets. Each data set might textually represent the natural identifier in subtly different ways. We want to capture the representations across multiple data sets to ensure our join works properly.

So… Whats a text jockey to do when faced with such annoying inconsistencies?

You might initially think “regular expression”. Sadly, a normal regular expression can’t help in this case. Just trying to write a regular expression that allows a controlled vocabulary of 100k valid last names but nothing else is non-trivial. Not to mention the task of actually using such a regular expression.

But there is one tool that looks promising for solving this problem. Lucene 4.0′s new Automaton API. Lets explore what this API has to offer by first reminding ourselves about a bit of CS theory.

Are you motivated?

I am!

See John Berryman’s comment about matching patterns of words.

Then think about finding topics, associations and occurrences in free form data.

Or creating a collection of automata as a tool set for building topic maps.

Finite State Automata in Lucene

Monday, May 14th, 2012

Finite State Automata in Luceneby Mike McCandless

From the post:

Lucene Revolution 2012 is now done, and the talk Robert and I gave went well! We showed how we are using automata (FSAs and FSTs) to make great improvements throughout Lucene.

You can view the slides here.

This was the first time I used Google Docs exclusively for a talk, and I was impressed! The real-time collaboration was awesome: we each could see the edits the other was doing, live. You never have to “save” your document: instead, every time you make a change, the document is saved to a new revision and you can then use infinite undo, or step back through all revisions, to go back.

Finally, Google Docs covers the whole life-cycle of your talk: editing/iterating, presenting (it presents in full-screen just fine, but does require an internet connection; I exported to PDF ahead of time as a backup) and, finally, sharing with the rest of the world!

I must confess to disappointment when I read at slide 23 that “multi-token synonyms mess up graph.”

Particularly since I suspect that not only do synonyms need to be “multi-token” but “multi-dimensional” as well.