Archive for the ‘Suffix Tree’ Category

Suffix Trees and their Applications in String Algorithms

Wednesday, November 19th, 2014

Suffix Trees and their Applications in String Algorithms by Roberto Grossi and Giuseppe F. Italiano.

Abstract:

The suffix tree is a compacted trie that stores all suffixes of a given text string. This data structure has been intensively employed in pattern matching on strings and trees, with a wide range of applications, such as molecular biology, data processing, text editing, term rewriting, interpreter design, information retrieval, abstract data types and many others.

In this paper, we survey some applications of suffix trees and some algorithmic techniques for their construction. Special emphasis is given to the most recent developments in this area, such as parallel algorithms for suffix tree construction and generalizations of suffix trees to higher dimensions, which are important in multidimensional pattern matching.

The authors point out that “suffix tree” is only one of the names for this subject:

The importance of the suffix tree is underlined by the fact that it has been rediscovered many times in the scientific literature, disguised under different names, and that it has been studied under numerous variations. Just to mention a few appearances of the suffix tree, we cite the compacted bi-tree [101], the prefix tree [24], the PAT tree [50], the position tree [3, 65, 75], the repetition finder [82], and the subword tree [8, 24]….

Which is an advantage if you are researching another survey paper and tracing every thread on suffix trees by whatever name, not so much of an advantage if you miss this paper or an application under name other than “suffix tree.”

Of course, a search with:

“suffix tree” OR “impacted by-tree” OR “prefix tree” OR “PAT tree” OR “position tree” OR “repetition finder” OR “subword tree”

that returns About 1,830,000 results (0.22 seconds), isn’t very helpful.

In part because no one is going to examine 1,830,000 results and that is a subset of all the terms for suffix trees.

I think we can do better than that and without an unreasonable expenditure of resources. (See: Less Than Universal & Uniform Indexing)

…[S]uffix array construction algorithms

Tuesday, March 25th, 2014

A bioinformatician’s guide to the forefront of suffix array construction algorithms by Anish Man Singh Shrestha, Martin C. Frith, and Paul Horton.

Abstract:

The suffix array and its variants are text-indexing data structures that have become indispensable in the field of bioinformatics. With the uninitiated in mind, we provide an accessible exposition of the SA-IS algorithm, which is the state of the art in suffix array construction. We also describe DisLex, a technique that allows standard suffix array construction algorithms to create modified suffix arrays designed to enable a simple form of inexact matching needed to support ‘spaced seeds’ and ‘subset seeds’ used in many biological applications.

If this doesn’t sound like a real page turner, consider the authors’ concluding paragraph:

The suffix array and its variants are text-indexing data structures that have become indispensable in the field of bioinformatics. With the uninitiated in mind, we provide an accessible exposition of the SA-IS algorithm, which is the state of the art in suffix array construction. We also describe DisLex, a technique that allows standard suffix array construction algorithms to create modified suffix arrays designed to enable a simple form of inexact matching needed to support ‘spaced seeds’ and ‘subset seeds’ used in many biological applications.

Reminds me that computer science departments need to start offering courses in “string theory,” to capitalize on the popularity of that phrase. 😉

Enhancing Graph Database Indexing by Suffix Tree Structure

Tuesday, October 19th, 2010

Enhancing Graph Database Indexing by Suffix Tree Structure
Authors: Vincenzo Bonnici, Alfredo Ferro, Rosalba Giugno, Alfredo Pulvirenti, Dennis Shasha Keywords: subgraph isomorphism, graph database search, indexing, suffix tree, molecular database

Abstract:

Biomedical and chemical databases are large and rapidly growing in size. Graphs naturally model such kinds of data. To fully exploit the wealth of information in these graph databases, scientists require systems that search for all occurrences of a query graph. To deal efficiently with graph searching, advanced methods for indexing, representation and matching of graphs have been proposed.

This paper presents GraphGrepSX. The system implements efficient graph searching algorithms together with an advanced filtering technique. GraphGrepSX is compared with SING, GraphFind, CTree and GCoding. Experiments show that GraphGrepSX outperforms the compared systems on a very large collection of molecular data. In particular, it reduces the size and the time for the construction of large database index and outperforms the most popular systems. (hyperlinks added.)

Be aware that bioinformatics is at the cutting edge of search/retrieval technology. Pick up any proceedings volume for the last year to see what I mean.

A credible topic map is going to incorporate one or more of the techniques you will find there, plus semantic mapping based on those techniques.

Saying Topic-Association-Occurrence is only going to get you past the first two minutes of your presentation. You will need something familiar (to your audience) and domain specific to fill the rest of your time.

BTW, see the audience posting earlier today. Don’t guess what will interest your audience. Ask someone in that community what interests them.