A Compressed Self-Index for Genomic Databases by Travis Gagie, Juha Kärkkäinen, Yakov Nekrich, and Simon J. Puglisi.
Abstract:
Advances in DNA sequencing technology will soon result in databases of thousands of genomes. Within a species, individuals’ genomes are almost exact copies of each other; e.g., any two human genomes are 99.9% the same. Relative Lempel-Ziv (RLZ) compression takes advantage of this property: it stores the first genome uncompressed or as an FM-index, then compresses the other genomes with a variant of LZ77 that copies phrases only from the first genome. RLZ achieves good compression and supports fast random access; in this paper we show how to support fast search as well, thus obtaining an efficient compressed self-index.
As the authors note, an area with rapidly increasing need for efficient effective indexing.
It would be a step forward to see a comparison of this method on a common genome set with:
- A hash trie filter method for approximate string matching in genomic databases by Ye-In Chang, Jiun-Rung Chen, and Min-Tze Hsu. Applied Intelligence, Volume 33 Issue 1, August 2010, Kluwer Academic Publishers Hingham, MA, USA.
- Bitmapped Vector Trie, Extreme Cleverness: Functional Data Structures in Scala (presentation by Daniel Spiewak), http://github.com/djspiewak/extreme-cleverness (source code).
I suppose I am presuming a common genome data set for indexing demonstrations.
Questions:
- Is there a common genome data set for comparison of indexing techniques?
- Are there other indexing techniques that should be included in a comparison?
Obviously important for topic maps used in genome projects.
But insights about identification of subjects that vary only slightly in one (or more) dimensions to identify different subjects, will be useful in other contexts.
An easy example would be isotopes. Let’s see, ah, celestial or other coordinate systems. Don’t know but would guess that spectra from stars/galaxies are largely common. (Do you know for sure?) What other data sets have subjects that are identified on the basis of small or incremental changes in a largely identical identifier?