Archive for the ‘Automata’ Category

Recreational Constraint Programmer

Monday, November 16th, 2015 [Embedding of the video disabled at the source. Follow the link.]

From the description:

Many of us have hazy memories of finite state machines from computer science theory classes in college. But finite state machines (FSMs) have real, practical value, and it is useful to know how to build and apply them in Clojure. For example, FSMs have long been popular to model game AIs and workflow rules, and FSMs provide the behind-the-scenes magic that powers Java’s regexes and core.async’s go blocks. In this talk, we’ll look at two programming puzzles that, suprisingly, have very elegant solutions when looked at through the lens of FSMs, with code demonstrations using two different Clojure libraries for automata (automat and reduce-fsm), as well as loco, a Clojure constraint solver.

If you have never heard anyone describe themselves as a “recreational constraint programmer,” you really need to see this video!

If you think about having a single representative for a subject as a constraint on a set of topics, the question becomes what properties must each topic have to facilitate that constraint?

Some properties, such as family names, will lead to over-merging of topics and other properties, such as possession of one and only one social security number, will under-merge topics where a person has multiple social security numbers.

The best code demonstration in the video was the generation of a fairly complex cross-word puzzle, sans the clues for each word. I think the clues were left as an exercise for the reader. 😉

Code Repositories:

Encouraging enough that you might want to revisit regular expressions.


Taxonomies and Toolkits of Regular Language Algorithms

Wednesday, September 10th, 2014

Taxonomies and Toolkits of Regular Language Algorithms by Bruce William Watson.

From 1.1 Problem Statement:

A number of fundamental computing science problems have been extensively studied since the 1950s and the 1960s. As these problems were studied, numerous solutions (in the form of algorithms) were developed over the years. Although new algorithms still appear from time to time, each of these fields can be considered mature. In the solutions to many of the well-studied computing science problems, we can identify three deficiencies:

  1. Algorithms solving the same problem are difficult to compare to one another. This is usually due to the use of different programming languages, styles of presentation, or simply the addition of unnecessary details.
  2. Collections of implementations of algorithms solving a problem are difficult, if not impossible, to find. Some of the algorithms are presented in a relatively obsolete manner, either using old notations or programming languages for which no compilers exist, making it difficult to either implement the algorithm or find an existing implementation.
  3. Little is known about the comparative practical running time performance of the algorithms. The lack of existing implementations in one and the same framework, especially of the older algorithms, makes it difficult to determine the running time characteristics of the algorithms. A software engineer selecting one of the algorithms will usually do so on the basis of the algorithm’s theoretical running time, or simply by guessing.

In this dissertation, a solution to each of the three deficiencies is presented for each of the following three fundamental computing science problems:

  1. Keyword pattern matching in strings. Given a finite non-empty set of keywords (the patterns) and an input string, find the set of all occurrences of a keyword as a substring of the input string.
  2. Finite automata (FA) construction. Given a regular expression, construct a finite automaton which accepts the language denoted by the regular expression.
  3. Deterministic finite automata (DFA) minimization. Given a DFA, construct the unique minimal DFA accepting the same language.

We do not necessarily consider all the known algorithms solving the problems. For example, we restrict ourselves to batch-style algorithms1, as opposed to incremental algorithms2.

Requires updating given its age, 1995, but a work merits mention.

I first saw this in a tweet by silentbicycle.srec.

Scoring tennis using finite-state automata

Saturday, August 30th, 2014

Scoring tennis using finite-state automata by Michael McCandless.

From the post:

For some reason having to do with the medieval French, the scoring system for tennis is very strange.

In actuality, the game is easy to explain: to win, you must score at least 4 points and win by at least 2. Yet in practice, you are supposed to use strange labels like “love” (0 points), “15” (1 point), “30” (2 points), “40” (3 points), “deuce” (3 or more points each, and the players are tied), “all” (players are tied) instead of simply tracking points as numbers, as other sports do.

This is of course wildly confusing to newcomers. Fortunately, the convoluted logic is easy to express as a finite-state automaton (FSA):

And you thought that CS course in automata wasn’t going to be useful. 😉

Michael goes on to say:

FSA minimization saved only 3 states for the game of tennis, resulting in a 10% smaller automaton, and maybe this simplifies keeping track of scores in your games by a bit, but in other FSA applications in Lucene, such as the analyzing suggester, MemoryPostingsFormat and the terms index, minimization is vital since saves substantial disk and RAM for Lucene applications!

A funny introduction with a serious purpose!

Speaking of Automata

Tuesday, August 5th, 2014

Since I just mentioned Michael McCandless’ post on automata in Lucene 4.10, it seems like a good time to say that Jeffrey Ullman will be teaching automata-003 starting Monday Sept. 1, 2014.

I got a bulk email from the course administrators saying that Stanford had given its permission and that arrangements are underway with Coursera.

If you want to see the prior course: Or you could start watching early!

A new proximity query for Lucene, using automatons

Tuesday, August 5th, 2014

A new proximity query for Lucene, using automatons by Michael McCandless.

From the post:

As of Lucene 4.10 there will be a new proximity query to further generalize on MultiPhraseQuery and the span queries: it allows you to directly build an arbitrary automaton expressing how the terms must occur in sequence, including any transitions to handle slop.


This is a very expert query, allowing you fine control over exactly what sequence of tokens constitutes a match. You build the automaton state-by-state and transition-by-transition, including explicitly adding any transitions (sorry, no QueryParser support yet, patches welcome!). Once that’s done, the query determinizes the automaton and then uses the same infrastructure (e.g. CompiledAutomaton) that queries like FuzzyQuery use for fast term matching, but applied to term positions instead of term bytes. The query is naively scored like a phrase query, which may not be ideal in some cases.

Micahael walks through current proximity queries before diving into the new proximity query for Lucene 4.10.

As always, this is a real treat!

Automata [Starts 4 Nov. 2013]

Tuesday, October 8th, 2013

Automata by Jeff Ullman.

From the course description:

Why Study Automata Theory?

This subject is not just for those planning to enter the field of complexity theory, although it is a good place to start if that is your goal. Rather, the course will emphasize those aspects of the theory that people really use in practice. Finite automata, regular expressions, and context-free grammars are ideas that have stood the test of time. They are essential tools for compilers. But more importantly, they are used in many systems that require input that is less general than a full programming language yet more complex than “push this button.”

The concepts of undecidable problems and intractable problems serve a different purpose. Undecidable problems are those for which no computer solution can ever exist, while intractable problems are those for which there is strong evidence that, although they can be solved by a computer, they cannot be solved sufficiently fast that the solution is truly useful in practice. Understanding this theory, and in particular being able to prove that a problem you are facing belongs to one of these classes, allows you to justify taking another approach — simplifying the problem or writing code to approximate the solution, for example.

During the course, I’m going to prove a number of things. The purpose of these proofs is not to torture you or confuse you. Neither are the proofs there because I doubt you would believe me were I merely to state some well-known fact. Rather, understanding how these proofs, especially inductive proofs, work, lets you think more clearly about your own work. I do not advocate proofs that programs are correct, but whenever you attempt something a bit complex, it is good to have in mind the inductive proofs that would be needed to guarantee that what you are doing really works in all cases.

Recommended Background

You should have had a second course in Computer Science — one that covers basic data structures (e.g., lists, trees, hashing), and basic algorithms (e.g., tree traversals, recursive programming, big-oh running time). In addition, a course in discrete mathematics covering propositional logic, graphs, and inductive proofs is valuable background.

If you need to review or learn some of these topics, there is a free on-line textbook Foundations of Computer Science, written by Al Aho and me, available at Recommended chapters include 2 (Recursion and Induction), 3 (Running Time of Programs), 5 (Trees), 6 (Lists), 7 (Sets), 9 (Graphs), and 12 (Propositional Logic). You will also find introductions to finite automata, regular expressions, and context-free grammars in Chapters 10 and 11. Reading Chapter 10 would be good preparation for the first week of the course.

The course includes two programming exercises for which a knowledge of Java is required. However, these exercises are optional. You will receive automated feedback, but the results will not be recorded or used to grade the course. So if you are not familiar with Java, you can still take the course without concern for prerequisites.

All of “Foundations of Computer Science” is worth reading but for this course:

Chapter 2 Iteration, Induction, and Recursion
Chapter 3 The Running Time of Programs
Chapter 5 The Tree Data Model
Chapter 6 The List Data Model
Chapter 7 The Set Data Model
Chapter 9 The Graph Data Model
Chapter 10 Patterns, Automata, and Regular Expressions
Chapter 11 Recursive Description of Patterns
Chapter 12 Propositional Logic

Six very intensive weeks but on the bright side, you will be done before the holiday season. 😉