Efficient comparison of sets of intervals with NC-lists by Matthias Zytnicki, YuFei Luo and Hadi Quesneville. (Bioinformatics (2013) 29 (7): 933-939. doi: 10.1093/bioinformatics/btt070)
Abstract:
Motivation: High-throughput sequencing produces in a small amount of time a large amount of data, which are usually difficult to analyze. Mapping the reads to the transcripts they originate from, to quantify the expression of the genes, is a simple, yet time demanding, example of analysis. Fast genomic comparison algorithms are thus crucial for the analysis of the ever-expanding number of reads sequenced.
Results: We used NC-lists to implement an algorithm that compares a set of query intervals with a set of reference intervals in two steps. The first step, a pre-processing done once for all, requires time O[#R log(#R) + #Q log(#Q)], where Q and R are the sets of query and reference intervals. The search phase requires constant space, and time O(#R + #Q + #M), where M is the set of overlaps. We showed that our algorithm compares favorably with five other algorithms, especially when several comparisons are performed.
Availability: The algorithm has been included to S–MART, a versatile tool box for RNA-Seq analysis, freely available at http://urgi.versailles.inra.fr/Tools/S-Mart. The algorithm can be used for many kinds of data (sequencing reads, annotations, etc.) in many formats (GFF3, BED, SAM, etc.), on any operating system. It is thus readily useable for the analysis of next-generation sequencing data.
Before you search for “NC-lists,” be aware that you will get this article as the first “hit” today in some popular search engines. Followed by a variety of lists for North Carolina.
A more useful search engine would allow me to choose the correct usage of a term and to re-run the query using the distinguished subject.
The expansion helps: Nested Containment List (NCList).
Familiar if you are working in bioinformatics.
More generally, consider the need to compare complex sequences of values for merging purposes.
Not a magic bullet but a technique you should keep in mind.
Origin: Nested Containment List (NCList): a new algorithm for accelerating interval query of genome alignment and interval databases, Alexander V. Alekseyenko and Christopher J. Lee. (Bioinformatics (2007) 23 (11): 1386-1393. doi: 10.1093/bioinformatics/btl647)