Archive for the ‘De Bruijn Graphs’ Category

The Story Behind “Scaling Metagenome Assembly with Probabilistic de Bruijn Graphs”

Friday, August 10th, 2012

The Story Behind “Scaling Metagenome Assembly with Probabilistic de Bruijn Graphs” by C. Titus Brown.

From the post:

This is the story behind our PNAS paper, “Scaling Metagenome Assembly with Probabilistic de Bruijn Graphs” (released from embargo this past Monday).

Why did we write it? How did it get started? Well, rewind the tape 2 years and more…

There we were in May 2010, sitting on 500 million Illumina reads from shotgun DNA sequencing of an Iowa prairie soil sample. We wanted to reconstruct the microbial community contents and structure of the soil sample, but we couldn’t figure out how to do that from the data. We knew that, in theory, the data contained a number of partial microbial genomes, and we had a technique — de novo genome assembly — that could (again, in theory) reconstruct those partial genomes. But when we ran the software, it choked — 500 million reads was too big a data set for the software and computers we had. Plus, we were looking forward to the future, when we would get even more data; if the software was dying on us now, what would we do when we had 10, 100, or 1000 times as much data?

A perfect post to read over the weekend!

Not all research ends successfully, but when it does, it is a story that inspires.

De novo assembly and genotyping of variants using colored de Bruijn graphs

Friday, August 3rd, 2012

De novo assembly and genotyping of variants using colored de Bruijn graphs by Zamin Iqbal, Mario Caccamo, Isaac Turner, Paul Flicek & Gil McVean. (Nature Genetics 44, 226–232 (2012))


Detecting genetic variants that are highly divergent from a reference sequence remains a major challenge in genome sequencing. We introduce de novo assembly algorithms using colored de Bruijn graphs for detecting and genotyping simple and complex genetic variants in an individual or population. We provide an efficient software implementation, Cortex, the first de novo assembler capable of assembling multiple eukaryotic genomes simultaneously. Four applications of Cortex are presented. First, we detect and validate both simple and complex structural variations in a high-coverage human genome. Second, we identify more than 3 Mb of sequence absent from the human reference genome, in pooled low-coverage population sequence data from the 1000 Genomes Project. Third, we show how population information from ten chimpanzees enables accurate variant calls without a reference sequence. Last, we estimate classical human leukocyte antigen (HLA) genotypes at HLA-B, the most variable gene in the human genome.

You will need access to Nature Genetics but rounding out today’s posts on de Bruijn graphs with a recent research article.

Comments on the Cortex software appreciated.

Genome assembly and comparison using de Bruijn graphs

Friday, August 3rd, 2012

Genome assembly and comparison using de Bruijn graphs by Daniel Robert Zerbino. (thesis)


Recent advances in sequencing technology made it possible to generate vast amounts of sequence data. The fragments produced by these high-throughput methods are, however, far shorter than in traditional Sanger sequencing. Previously, micro-reads of less than 50 base pairs were considered useful only in the presence of an existing assembly. This thesis describes solutions for assembling short read sequencing data de novo, in the absence of a reference genome.

The algorithms developed here are based on the de Bruijn graph. This data structure is highly suitable for the assembly and comparison of genomes for the following reasons. It provides a flexible tool to handle the sequence variants commonly found in genome evolution such as duplications, inversions or transpositions. In addition, it can combine sequences of highly different lengths, from short reads to assembled genomes. Finally, it ensures an effective data compression of highly redundant datasets.

This thesis presents the development of a collection of methods, called Velvet, to convert a de Bruijn graph into a traditional assembly of contiguous sequences. The first step of the process, termed Tour Bus, removes sequencing errors and handles biological variations such as polymorphisms. In its second part, Velvet aims to resolve repeats based on the available information, from low coverage long reads (Rock Band) or paired shotgun reads (Pebble). These methods were tested on various simulations for precision and efficiency, then on control experimental datasets.

De Bruijn graphs can also be used to detect and analyse structural variants from unassembled data. The final chapter of this thesis presents the results of collaborative work on the analysis of several experimental unassembled datasets.

De Bruijn graphs are covered in pages 22-42 if you want to cut to the chase.

Obviously of interest to the bioinformatics community.

Where else would you use de Bruijn graph structures?