Archive for the ‘Suffix Array’ Category

Speedy Short and Long DNA Reads

Monday, August 25th, 2014

Acceleration of short and long DNA read mapping without loss of accuracy using suffix array by Joaquín Tárraga, et al. (Bioinformatics (2014) doi: 10.1093/bioinformatics/btu553)

Abstract:

HPG Aligner applies suffix arrays for DNA read mapping. This implementation produces a highly sensitive and extremely fast mapping of DNA reads that scales up almost linearly with read length. The approach presented here is faster (over 20x for long reads) and more sensitive (over 98% in a wide range of read lengths) than the current, state-of-the-art mappers. HPG Aligner is not only an optimal alternative for current sequencers but also the only solution available to cope with longer reads and growing throughputs produced by forthcoming sequencing technologies.

Always nice to see an old friend, suffix arrays, in the news!

Source code: https://github.com/opencb/hpg-aligner.

For documentation and software: http://wiki.opencb.org/projects/hpg/doku.php?id=aligner:overview

I first saw this in a tweet by Bioinfocipf.

…[S]uffix array construction algorithms

Tuesday, March 25th, 2014

A bioinformatician’s guide to the forefront of suffix array construction algorithms by Anish Man Singh Shrestha, Martin C. Frith, and Paul Horton.

Abstract:

The suffix array and its variants are text-indexing data structures that have become indispensable in the field of bioinformatics. With the uninitiated in mind, we provide an accessible exposition of the SA-IS algorithm, which is the state of the art in suffix array construction. We also describe DisLex, a technique that allows standard suffix array construction algorithms to create modified suffix arrays designed to enable a simple form of inexact matching needed to support ‘spaced seeds’ and ‘subset seeds’ used in many biological applications.

If this doesn’t sound like a real page turner, consider the authors’ concluding paragraph:

The suffix array and its variants are text-indexing data structures that have become indispensable in the field of bioinformatics. With the uninitiated in mind, we provide an accessible exposition of the SA-IS algorithm, which is the state of the art in suffix array construction. We also describe DisLex, a technique that allows standard suffix array construction algorithms to create modified suffix arrays designed to enable a simple form of inexact matching needed to support ‘spaced seeds’ and ‘subset seeds’ used in many biological applications.

Reminds me that computer science departments need to start offering courses in “string theory,” to capitalize on the popularity of that phrase. 😉

On-demand Synonym Extraction Using Suffix Arrays

Saturday, August 24th, 2013

On-demand Synonym Extraction Using Suffix Arrays by Minoru Yoshida, Hiroshi Nakagawa, and Akira Terada. (Yoshida, M., Nakagawa, H. & Terada, A. (2013). On-demand Synonym Extraction Using Suffix Arrays. Information Extraction from the Internet. ISBN: 978-1463743994. iConcept Press. Retrieved from http://www.iconceptpress.com/books//information-extraction-from-the-internet/)

From the introduction:

The amount of electronic documents available on the World Wide Web (WWW) is continuously growing. The situation is the same in a limited part of the WWW, e.g., Web documents from specific web sites such as ones of some specific companies or universities, or some special-purpose web sites such as www.wikipedia.org, etc. This chapter mainly focuses on such a limited-size corpus. Automatic analysis of this large amount of data by text-mining techniques can produce useful knowledge that is not found by human efforts only.

We can use the power of on-memory text mining for such a limited-size corpus. Fast search for required strings or words available by putting whole documents on memory contributes to not only speeding up of basic search operations like word counting, but also making possible more complicated tasks that require a number of search operations. For such advanced text-mining tasks, this chapter considers the problem of extracting synonymous strings for a query given by users. Synonyms, or paraphrases, are words or phrases that have the same meaning but different surface strings. “HDD” and “hard drive” in documents related to computers and “BBS” and “message boards” in Web pages are examples of synonyms. They appear ubiquitously in different types of documents because the same concept can often be described by two or more expressions, and different writers may select different words or phrases to describe the same concept. In such cases, the documents that include the string “hard drive” might not be found by if the query “HDD” is used, which results in a drop in the coverage of the search system. This could become a serious problem, especially for searches of limited-size corpora. Therefore, being able to find such synonyms significantly improves the usability of various systems. Our goal is to develop an algorithm that can find strings synonymous with the user input. The applications of such an algorithm include augmenting queries with synonyms in information retrieval or text-mining systems, and assisting input systems by suggesting expressions similar to the user input.

The authors concede the results of their method are inferior to the best results of other synonym extraction methods but go on to say:

However, note that the main advantage of our method is not its accuracy, but its ability to extract synonyms of any query without a priori construction of thesauri or preprocessing using other linguistic tools like POS taggers or dependency parsers, which are indispensable for previous methods.

An important point to remember about all semantic technologies. How appropriate a technique is for your project depends on your requirements, not qualities of a technique in the abstract.

Technique N may not support machine reasoning but sending coupons to mobile phones “near” a restaurant doesn’t require that overhead. (Neither does standing outside the restaurant with flyers.)

Choose semantic techniques based on their suitability for your purposes.

Massively Parallel Suffix Array Queries…

Saturday, May 11th, 2013

Massively Parallel Suffix Array Queries and On-Demand Phrase Extraction for Statistical Machine Translation Using GPUs by Hua He, Jimmy Lin, Adam Lopez.

Abstract:

Translation models can be scaled to large corpora and arbitrarily-long phrases by looking up translations of source phrases on the fly in an indexed parallel text. However, this is impractical because on-demand extraction of phrase tables is a major computational bottleneck. We solve this problem by developing novel algorithms for general purpose graphics processing units (GPUs), which enable suffix array queries for phrase lookup and phrase extractions to be massively parallelized. Our open-source implementation improves the speed of a highly-optimized, state-of-the-art serial CPU-based implementation by at least an order of magnitude. In a Chinese-English translation task, our GPU implementation extracts translation tables from approximately 100 million words of parallel text in less than 30 milliseconds.

If you think about topic maps as mapping the identification of a subject in multiple languages to a single representative, then the value of translation software becomes obvious.

You may or may not, depending upon project requirements, want to rely solely on automated mappings of phrases.

Whether you use automated mapping of phrases as an “assist” to or as a sanity check on human curation, this work looks very interesting.