Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 17, 2012

Memory Efficient De Bruijn Graph Construction [Attn: Graph Builders, Chess Anyone?]

Filed under: Bioinformatics,Genome,Graphs,Networks — Patrick Durusau @ 10:44 am

Memory Efficient De Bruijn Graph Construction by Yang Li, Pegah Kamousi, Fangqiu Han, Shengqi Yang, Xifeng Yan, and Subhash Suri.

Abstract:

Massively parallel DNA sequencing technologies are revolutionizing genomics research. Billions of short reads generated at low costs can be assembled for reconstructing the whole genomes. Unfortunately, the large memory footprint of the existing de novo assembly algorithms makes it challenging to get the assembly done for higher eukaryotes like mammals. In this work, we investigate the memory issue of constructing de Bruijn graph, a core task in leading assembly algorithms, which often consumes several hundreds of gigabytes memory for large genomes. We propose a disk-based partition method, called Minimum Substring Partitioning (MSP), to complete the task using less than 10 gigabytes memory, without runtime slowdown. MSP breaks the short reads into multiple small disjoint partitions so that each partition can be loaded into memory, processed individually and later merged with others to form a de Bruijn graph. By leveraging the overlaps among the k-mers (substring of length k), MSP achieves astonishing compression ratio: The total size of partitions is reduced from $\Theta(kn)$ to $\Theta(n)$, where $n$ is the size of the short read database, and $k$ is the length of a $k$-mer. Experimental results show that our method can build de Bruijn graphs using a commodity computer for any large-volume sequence dataset.

A discovery in one area of data processing can have a large impact in a number of others. I suspect that will be the case with the technique described here.

The use of substrings for compression and to determine the creation of partitions was particularly clever.

Software and data sets

Questions:

  1. What are the substring characteristics of your data?
  2. How would you use a De Bruijn graph with your data?

If you don’t know the answers to those questions, you might want to find out.

Additional Resources:

De Bruijn Graph (Wikipedia)

De Bruijn Sequence (Wikipedia)

How to apply de Bruijn graphs to genome assembly by Phillip E C Compeau, Pavel A Pevzner, and Glenn Tesler. Nature Biotechnology 29, 987–991 (2011) doi:10.1038/nbt.2023

And De Bruijn graphs/sequences are not just for bioinformatics: from the Chess Programming Wiki: De Bruijn Sequences. (Lots of pointers and additional references.)

1 Comment

  1. […] If “de Bruijn graphs,” sounds familiar, see: Memory Efficient De Bruijn Graph Construction [Attn: Graph Builders, Chess Anyone?]. […]

    Pingback by Is “Massive Data” > “Big Data”? « Another Word For It — August 3, 2012 @ 8:33 am

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress