Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 19, 2015

Aho-Corasick

Filed under: Algorithms,Search Algorithms — Patrick Durusau @ 2:31 pm

Aho-Corasick – Java implementation of the Aho-Corasick algorithm for efficient string matching.

From the webpage:

Nowadays most free-text searching is based on Lucene-like approaches, where the search text is parsed into its various components. For every keyword a lookup is done to see where it occurs. When looking for a couple of keywords this approach is great. But what about it if you are not looking for just a couple of keywords, but a 100,000 of them? Like, for example, checking against a dictionary?

This is where the Aho-Corasick algorithm shines. Instead of chopping up the search text, it uses all the keywords to build up a construct called a Trie. There are three crucial components to Aho-Corasick:

  • goto
  • fail
  • output

Every character encountered is presented to a state object within the goto structure. If there is a matching state, that will be elevated to the new current state.

However, if there is no matching state, the algorithm will signal a fail and fall back to states with less depth (ie, a match less long) and proceed from there, until it found a matching state, or it has reached the root state.

Whenever a state is reached that matches an entire keyword, it is emitted to an output set which can be read after the entire scan has completed.

The beauty of the algorithm is that it is O(n). No matter how many keywords you have, or how big the search text is, the performance will decline in a linear way.

Some examples you could use the Aho-Corasick algorithm for:

  • looking for certain words in texts in order to URL link or emphasize them
  • adding semantics to plain text
  • checking against a dictionary to see if syntactic errors were made

This library is the Java implementation of the afore-mentioned Aho-Corasick algorithm for efficient string matching. The algorithm is explained in great detail in the white paper written by Aho and Corasick: ftp://163.13.200.222/assistant/bearhero/prog/%A8%E4%A5%A6/ac_bm.pdf

The link to the Aho-Corasick paper timed out on me. Try: Efficient String Matching: An Aid to Bibliographic Search (CiteSeer).

The line that caught my eye was “adding semantics to plain text.

Apologies for the “lite” posting over the past several days. I have just completed a topic maps paper for the Balisage conference in collaboration with Sam Hunting. My suggestion is that you register before all the seats are gone.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress