## Archive for the ‘Genome’ Category

### Biological Database of Images and Genomes

Wednesday, April 3rd, 2013

Biological Database of Images and Genomes: tools for community annotations linking image and genomic information by Andrew T Oberlin, Dominika A Jurkovic, Mitchell F Balish and Iddo Friedberg. (Database (2013) 2013 : bat016 doi: 10.1093/database/bat016)

Abstract:

Genomic data and biomedical imaging data are undergoing exponential growth. However, our understanding of the phenotype–genotype connection linking the two types of data is lagging behind. While there are many types of software that enable the manipulation and analysis of image data and genomic data as separate entities, there is no framework established for linking the two. We present a generic set of software tools, BioDIG, that allows linking of image data to genomic data. BioDIG tools can be applied to a wide range of research problems that require linking images to genomes. BioDIG features the following: rapid construction of web-based workbenches, community-based annotation, user management and web services. By using BioDIG to create websites, researchers and curators can rapidly annotate a large number of images with genomic information. Here we present the BioDIG software tools that include an image module, a genome module and a user management module. We also introduce a BioDIG-based website, MyDIG, which is being used to annotate images of mycoplasmas.

Database URL: BioDIG website: http://biodig.org

BioDIG source code repository: http://github.com/FriedbergLab/BioDIG

The MyDIG database: http://mydig.biodig.org/

Linking image data to genomic data. Sounds like associations to me.

You?

Not to mention the heterogeneity of genomic data.

Imagine extending an image/genomic data association by additional genomic data under a different identification.

### How Stable is Your Ontology?

Tuesday, February 19th, 2013

Assessing identity, redundancy and confounds in Gene Ontology annotations over time by Jesse Gillis and Paul Pavlidis. (Bioinformatics (2013) 29 (4): 476-482. doi: 10.1093/bioinformatics/bts727)

Abstract:

Motivation: The Gene Ontology (GO) is heavily used in systems biology, but the potential for redundancy, confounds with other data sources and problems with stability over time have been little explored.

Results: We report that GO annotations are stable over short periods, with 3% of genes not being most semantically similar to themselves between monthly GO editions. However, we find that genes can alter their ‘functional identity’ over time, with 20% of genes not matching to themselves (by semantic similarity) after 2 years. We further find that annotation bias in GO, in which some genes are more characterized than others, has declined in yeast, but generally increased in humans. Finally, we discovered that many entries in protein interaction databases are owing to the same published reports that are used for GO annotations, with 66% of assessed GO groups exhibiting this confound. We provide a case study to illustrate how this information can be used in analyses of gene sets and networks.

Availability: Data available at http://chibi.ubc.ca/assessGO.

How does your ontology account for changes in identity over time?

### New Public-Access Source With 3-D Information for Protein Interactions

Friday, December 21st, 2012

New Public-Access Source With 3-D Information for Protein Interactions

From the post:

Researchers have developed a platform that compiles all the atomic data, previously stored in diverse databases, on protein structures and protein interactions for eight organisms of relevance. They apply a singular homology-based modelling procedure.

The scientists Roberto Mosca, Arnaud Ceol and Patrick Aloy provide the international biomedical community with Interactome3D (interactome3d.irbbarcelona.org), an open-access and free web platform developed entirely by the Institute for Research in Biomedicine (IRB Barcelona). Interactome 3D offers for the first time the possibility to anonymously access and add molecular details of protein interactions and to obtain the information in 3D models. For researchers, atomic level details about the reactions are fundamental to unravel the bases of biology, disease development, and the design of experiments and drugs to combat diseases.

Interactome 3D provides reliable information about more than 12,000 protein interactions for eight model organisms, namely the plant Arabidopsis thaliana, the worm Caenorhabditis elegans, the fly Drosophila melanogaster, the bacteria Escherichia coli and Helicobacter pylori, the brewer’s yeast Saccharomyces cerevisiae, the mouse Mus musculus, and Homo sapiens. These models are considered the most relevant in biomedical research and genetic studies. The journal Nature Methods presents the research results and accredits the platform on the basis of it high reliability and precision in modelling interactions, which reaches an average of 75%.

Further details can be found at:

Interactome3D: adding structural details to protein networks by Roberto Mosca, Arnaud Céol and Patrick Aloy. (Nature Methods (2012) doi:10.1038/nmeth.2289)

Abstract:

Network-centered approaches are increasingly used to understand the fundamentals of biology. However, the molecular details contained in the interaction networks, often necessary to understand cellular processes, are very limited, and the experimental difficulties surrounding the determination of protein complex structures make computational modeling techniques paramount. Here we present Interactome3D, a resource for the structural annotation and modeling of protein-protein interactions. Through the integration of interaction data from the main pathway repositories, we provide structural details at atomic resolution for over 12,000 protein-protein interactions in eight model organisms. Unlike static databases, Interactome3D also allows biologists to upload newly discovered interactions and pathways in any species, select the best combination of structural templates and build three-dimensional models in a fully automated manner. Finally, we illustrate the value of Interactome3D through the structural annotation of the complement cascade pathway, rationalizing a potential common mechanism of action suggested for several disease-causing mutations.

Interesting not only for its implications for bioinformatics but for the development of homology modeling (superficially, similar proteins have similar interaction sites) to assist in their work.

The topic map analogy would be to show a subject domain, different identifications of the same subject tend to have the same associations or to fall into other patterns.

Then constructing a subject identity test based upon a template of associations or other values.

### KEGG: Kyoto Encyclopedia of Genes and Genomes

Monday, November 12th, 2012

KEGG: Kyoto Encyclopedia of Genes and Genomes

From the webpage:

KEGG is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies (See Release notes for new and updated features).

Anyone in biological research is probably already using KEGG. Take the opportunity to educate yourself about this resource. In particular how to use it with other resources.

The KEGG project, like any other project, needs funding. Consider passing this site along to funders interested in biological research resources.

I first saw this in a tweet by Anita de Waard.

### Bio4j 0.8, some numbers

Saturday, October 20th, 2012

Bio4j 0.8, some numbers by Pablo Pareja Tobes.

Bio4j 0.8 was recently released and now it’s time to have a deeper look at its numbers (as you can see we are quickly approaching the 1 billion relationships and 100M nodes):

• Number of Relationships: 717.484.649
• Number of Nodes: 92.667.745
• Relationship types: 144
• Node types: 42

If Pablo gets tired of his brilliant career in bioinformatics he can always run for office in the United States with claims like: “…we are quickly approaching the 1 billion relationships….”

Still, a stunning achievement!

See Pablo’s post for more analysis.

Pass the project along to anyone with doubts about graph databases.

### The “O” Word (Ontology) Isn’t Enough

Tuesday, October 16th, 2012

The Units Ontology makes reference to the Gene Ontology as an example of a successful web ontology effort.

As it should. The Gene Ontology (GO) is the only successful web ontology effort. A universe with one (1) inhabitant.

The GO has a number of differences from wannabe successful ontology candidates. (see the article below)

The first difference echoes loudly across the semantic engineering universe:

One of the factors that account for GO’s success is that it originated from within the biological community rather than being created and subsequently imposed by external knowledge engineers. Terms were created by those who had expertise in the domain, thus avoiding the huge effort that would have been required for a computer scientist to learn and organize large amounts of biological functional information. This also led to general acceptance of the terminology and its organization within the community. This is not to say that there have been no disagreements among biologists over the conceptualization, and there is of course a protocol for arriving at a consensus when there is such a disagreement. However, a model of a domain is more likely to conform to the shared view of a community if the modelers are within or at least consult to a large degree with members of that community.

Did you catch that first line?

One of the factors that account for GO’s success is that it originated from within the biological community rather than being created and subsequently imposed by external knowledge engineers.

Saying the “O” word, ontology, that will benefit everyone if they will just listen to you, isn’t enough.

There are other factors to consider:

A Short Study on the Success of the Gene Ontology by Michael Bada, Robert Stevens, Carole Goble, Yolanda Gil, Michael Ashburner, Judith A. Blake, J. Michael Cherry, Midori Harris, Suzanna Lewis.

Abstract:

While most ontologies have been used only by the groups who created them and for their initially defined purposes, the Gene Ontology (GO), an evolving structured controlled vocabulary of nearly 16,000 terms in the domain of biological functionality, has been widely used for annotation of biological-database entries and in biomedical research. As a set of learned lessons offered to other ontology developers, we list and briefly discuss the characteristics of GO that we believe are most responsible for its success: community involvement; clear goals; limited scope; simple, intuitive structure; continuous evolution; active curation; and early use.

### Bio4j 0.8 is here!

Saturday, October 13th, 2012

Bio4j 0.8 is here! by Pablo Pareja Tobes.

You will find “5.488.000 new proteins and 3.233.000 genes” and other improvements!

Whether you are interested in graph databases (Neo4j), bioinformatics or both, this is welcome news!

### PathNet: A tool for pathway analysis using topological information

Friday, October 12th, 2012

PathNet: A tool for pathway analysis using topological information by Bhaskar Dutta, Anders Wallqvist and Jaques Reifman. (Source Code for Biology and Medicine 2012, 7:10 doi:10.1186/1751-0473-7-10)

Abstract:

Background

Identification of canonical pathways through enrichment of differentially expressed genes in a given pathway is a widely used method for interpreting gene lists generated from highthroughput experimental studies. However, most algorithms treat pathways as sets of genes, disregarding any inter- and intra-pathway connectivity information, and do not provide insights beyond identifying lists of pathways.

Results

We developed an algorithm (PathNet) that utilizes the connectivity information in canonical pathway descriptions to help identify study-relevant pathways and characterize non-obvious dependencies and connections among pathways using gene expression data. PathNet considers both the differential expression of genes and their pathway neighbors to strengthen the evidence that a pathway is implicated in the biological conditions characterizing the experiment. As an adjunct to this analysis, PathNet uses the connectivity of the differentially expressed genes among all pathways to score pathway contextual associations and statistically identify biological relations among pathways. In this study, we used PathNet to identify biologically relevant results in two Alzheimers disease microarray datasets, and compared its performance with existing methods. Importantly, PathNet identified deregulation of the ubiquitin-mediated proteolysis pathway as an important component in Alzheimers disease progression, despite the absence of this pathway in the standard enrichment analyses.

Conclusions

PathNet is a novel method for identifying enrichment and association between canonical pathways in the context of gene expression data. It takes into account topological information present in pathways to reveal biological information. PathNet is available as an R workspace image from http://www.bhsai.org/downloads/pathnet/.

Important work for genomics but also a reminder that a list of paths is just that, a list of paths.

The value-add and creative aspect of data analysis is in the scoring of those paths in order to wring more information from them.

How is it for you? Just lists of paths or something a bit more clever?

### Mapping solution to heterogeneous data sources

Monday, September 10th, 2012

dbSNO: a database of cysteine S-nitrosylation by Tzong-Yi Lee, Yi-Ju Chen, Cheng-Tsung Lu, Wei-Chieh Ching, Yu-Chuan Teng, Hsien-Da Huang and Yu-Ju Chen. (Bioinformatics (2012) 28 (17): 2293-2295. doi: 10.1093/bioinformatics/bts436)

OK, the title doesn’t jump out and say “mapping solution here!

Reading a bit further, you discover that text mining is used to locate sequences and that data is then mapped to “UniProtKB protein entries.”

• UniProt ID
• Organism
• Position
• PubMed Id
• Sequence

My concern is what happens when X is mapped to a UniProtKB protein entry to:

• The prior identifier for X (in the article or source), and
• The mapping from X to the UniProtKB protein entry?

If both of those are captured, then prior literature can be annotated upon rendering to point to later aggregation of information on a subject.

If the prior identifier, place of usage, the mapping, etc., are not captured, then prior literature, when we encounter it, remains frozen in time.

Mapping solutions work, but repay the effort several times over if the prior identifier and its mapping to the “new” identifier are captured as part of the process.

Abstract

Summary: S-nitrosylation (SNO), a selective and reversible protein post-translational modification that involves the covalent attachment of nitric oxide (NO) to the sulfur atom of cysteine, critically regulates protein activity, localization and stability. Due to its importance in regulating protein functions and cell signaling, a mass spectrometry-based proteomics method rapidly evolved to increase the dataset of experimentally determined SNO sites. However, there is currently no database dedicated to the integration of all experimentally verified S-nitrosylation sites with their structural or functional information. Thus, the dbSNO database is created to integrate all available datasets and to provide their structural analysis. Up to April 15, 2012, the dbSNO has manually accumulated >3000 experimentally verified S-nitrosylated peptides from 219 research articles using a text mining approach. To solve the heterogeneity among the data collected from different sources, the sequence identity of these reported S-nitrosylated peptides are mapped to the UniProtKB protein entries. To delineate the structural correlation and consensus motif of these SNO sites, the dbSNO database also provides structural and functional analyses, including the motifs of substrate sites, solvent accessibility, protein secondary and tertiary structures, protein domains and gene ontology.

Availability: The dbSNO is now freely accessible via http://dbSNO.mbc.nctu.edu.tw. The database content is regularly updated upon collecting new data obtained from continuously surveying research articles.

### Reveal—visual eQTL analytics [Statistics of Identity/Association]

Monday, September 10th, 2012

Reveal—visual eQTL analytics by Günter Jäger, Florian Battke and Kay Nieselt. (Bioinformatics (2012) 28 (18): i542-i548. doi: 10.1093/bioinformatics/bts382)

Abstract

Motivation: The analysis of expression quantitative trait locus (eQTL) data is a challenging scientific endeavor, involving the processing of very large, heterogeneous and complex data. Typical eQTL analyses involve three types of data: sequence-based data reflecting the genotypic variations, gene expression data and meta-data describing the phenotype. Based on these, certain genotypes can be connected with specific phenotypic outcomes to infer causal associations of genetic variation, expression and disease.

To this end, statistical methods are used to find significant associations between single nucleotide polymorphisms (SNPs) or pairs of SNPs and gene expression. A major challenge lies in summarizing the large amount of data as well as statistical results and to generate informative, interactive visualizations.

Results: We present Reveal, our visual analytics approach to this challenge. We introduce a graph-based visualization of associations between SNPs and gene expression and a detailed genotype view relating summarized patient cohort genotypes with data from individual patients and statistical analyses.

Availability: Reveal is included in Mayday, our framework for visual exploration and analysis. It is available at http://it.inf.uni-tuebingen.de/software/reveal/.

Contact: guenter.jaeger@uni-tuebingen.de

Interesting work on a number of fronts, not the least of it being “…analysis of expression quantitative trait locus (eQTL) data.”

Its use of statistical methods to discover “significant associations,” interactive visualizations and processing of “large, heterogeneous and complex data” are of more immediate interest to me.

Wikipedia is evidence for subjects (including relationships) that can be usefully identified using URLs. But that is only a fraction of all the subjects and relationships we may want to include in our topic maps.

An area I need to work up for my next topic map course is probabilistic identification of subjects and their relationships. What statistical techniques are useful for what fields? Or even what subjects within what fields? What are the processing tradeoffs versus certainty of identification?

### Next Generation Sequencing, GNU-Make and .INTERMEDIATE

Friday, August 31st, 2012

Next Generation Sequencing, GNU-Make and .INTERMEDIATE by Pierre Lindenbaum.

From the post:

I gave a crash course about NGS to a few colleagues today. For my demonstration I wrote a simple Makefile. Basically, it downloads a subset of the human chromosome 22, indexes it with bwa, generates a set of fastqs with wgsim, align the fastqs, generates the *.sai, the *.sam, the *.bam, sorts the bam and calls the SNPs with mpileup.

An illustration that there is plenty of life left in GNU Make.

Plus an interesting tip on the use of .intermediate in make scripts.

As a starting point, consider Make (software).

### BiologicalNetworks

Wednesday, August 15th, 2012

BiologicalNetworks

From the webpage:

BiologicalNetworks research environment enables integrative analysis of:

• Interaction networks, metabolic and signaling pathways together with transcriptomic, metabolomic and proteomic experiments data
• Transcriptional regulation modules (modular networks)
• Genomic sequences including gene regulatory regions (e.g. binding sites, promoter regions) and respective transcription factors, as well as NGS data
• Comparative genomics, homologous/orthologous genes and phylogenies
• 3D protein structures and ligand binding, small molecules and drugs
• Multiple ontologies including GeneOntology, Cell and Tissue types, Diseases, Anatomy and taxonomies

BiologicalNetworks backend database (IntegromeDB) integrates >1000 curated data sources (from the NAR list) for thousands of eukaryotic, prokaryotic and viral organisms and millions of public biomedical, biochemical, drug, disease and health-related web resources.

Correction: As of 3 July 2012, “IntegromeDB’s index reaches 1 Billion (biomedical resources links) milestone.”

IntegromeDB collects all the biomedical, biochemical, drug and disease related data available in the public domain and brings you the most relevant data for your search. It provides you with an integrative view on the genomic, proteomic, transcriptomic, genetic and functional information featuring gene/protein names, synonyms and alternative IDs, gene function, orthologies, gene expression, pathways and molecular (protein-protein, TF-gene, genetic, etc.) interactions, mutations and SNPs, disease relationships, drugs and compounds, and many other. Explore and enjoy!

Sounds a lot like a topic map doesn’t it?

One interesting feature is Inconsistency in the integrated data.

How would you:

• Improve the consistency of integrated data?
• Enable crowd participation in curation of data?
• Enable the integration of data files into other data systems?

### The Story Behind “Scaling Metagenome Assembly with Probabilistic de Bruijn Graphs”

Friday, August 10th, 2012

From the post:

This is the story behind our PNAS paper, “Scaling Metagenome Assembly with Probabilistic de Bruijn Graphs” (released from embargo this past Monday).

Why did we write it? How did it get started? Well, rewind the tape 2 years and more…

There we were in May 2010, sitting on 500 million Illumina reads from shotgun DNA sequencing of an Iowa prairie soil sample. We wanted to reconstruct the microbial community contents and structure of the soil sample, but we couldn’t figure out how to do that from the data. We knew that, in theory, the data contained a number of partial microbial genomes, and we had a technique — de novo genome assembly — that could (again, in theory) reconstruct those partial genomes. But when we ran the software, it choked — 500 million reads was too big a data set for the software and computers we had. Plus, we were looking forward to the future, when we would get even more data; if the software was dying on us now, what would we do when we had 10, 100, or 1000 times as much data?

A perfect post to read over the weekend!

Not all research ends successfully, but when it does, it is a story that inspires.

### The 2012 Nucleic Acids Research Database Issue…

Wednesday, August 8th, 2012

The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection by Michael Y. Galperin, and Xosé M. Fernández-Suárez.

Abstract:

The 19th annual Database Issue of Nucleic Acids Research features descriptions of 92 new online databases covering various areas of molecular biology and 100 papers describing recent updates to the databases previously described in NAR and other journals. The highlights of this issue include, among others, a description of neXtProt, a knowledgebase on human proteins; a detailed explanation of the principles behind the NCBI Taxonomy Database; NCBI and EBI papers on the recently launched BioSample databases that store sample information for a variety of database resources; descriptions of the recent developments in the Gene Ontology and UniProt Gene Ontology Annotation projects; updates on Pfam, SMART and InterPro domain databases; update papers on KEGG and TAIR, two universally acclaimed databases that face an uncertain future; and a separate section with 10 wiki-based databases, introduced in an accompanying editorial. The NAR online Molecular Biology Database Collection, available at http://www.oxfordjournals.org/nar/database/a/, has been updated and now lists 1380 databases. Brief machine-readable descriptions of the databases featured in this issue, according to the BioDBcore standards, will be provided at the http://biosharing.org/biodbcore web site. The full content of the Database Issue is freely available online on the Nucleic Acids Research web site (http://nar.oxfordjournals.org/).

Abstract of the article describing: Nucleic Acids Research, Database issue, Volume 40 Issue D1 January 2012.

Very much like being a kid in a candy store. Hard to know what to look at next! Both for subject matter experts and those of us interested in the technology aspects of the databases.

### ANNOVAR: functional annotation of genetic variants….

Wednesday, August 8th, 2012

ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data by Kai Wang, Mingyao Li, and Hakon Hakonarson. (Nucl. Acids Res. (2010) 38 (16): e164. doi: 10.1093/nar/gkq603)

Just in case you are unfamiliar with ANNOVAR, the software mentioned in: gSearch: a fast and flexible general search tool for whole-genome sequencing:

Abstract:

High-throughput sequencing platforms are generating massive amounts of genetic variation data for diverse genomes, but it remains a challenge to pinpoint a small subset of functionally important variants. To fill these unmet needs, we developed the ANNOVAR tool to annotate single nucleotide variants (SNVs) and insertions/deletions, such as examining their functional consequence on genes, inferring cytogenetic bands, reporting functional importance scores, finding variants in conserved regions, or identifying variants reported in the 1000 Genomes Project and dbSNP. ANNOVAR can utilize annotation databases from the UCSC Genome Browser or any annotation data set conforming to Generic Feature Format version 3 (GFF3). We also illustrate a ‘variants reduction’ protocol on 4.7 million SNVs and indels from a human genome, including two causal mutations for Miller syndrome, a rare recessive disease. Through a stepwise procedure, we excluded variants that are unlikely to be causal, and identified 20 candidate genes including the causal gene. Using a desktop computer, ANNOVAR requires ∼4 min to perform gene-based annotation and ∼15 min to perform variants reduction on 4.7 million variants, making it practical to handle hundreds of human genomes in a day. ANNOVAR is freely available at http://www.openbioinformatics.org/annovar/.

Approximately two years separates ANNOVAR from gSearch. Should give you an idea of the speed of development in bioinformatics. They haven’t labored over finding a syntax for everyone to use for more than a decade. I suspect there is a lesson in there somewhere.

### gSearch: a fast and flexible general search tool for whole-genome sequencing

Wednesday, August 8th, 2012

gSearch: a fast and flexible general search tool for whole-genome sequencing by Taemin Song, Kyu-Baek Hwang, Michael Hsing, Kyungjoon Lee, Justin Bohn, and Sek Won Kong.

Abstract:

Background: Various processes such as annotation and filtering of variants or comparison of variants in different genomes are required in whole-genome or exome analysis pipelines. However, processing different databases and searching among millions of genomic loci is not trivial.

Results: gSearch compares sequence variants in the Genome Variation Format (GVF) or Variant Call Format (VCF) with a pre-compiled annotation or with variants in other genomes. Its search algorithms are subsequently optimized and implemented in a multi-threaded manner. The proposed method is not a stand-alone annotation tool with its own reference databases. Rather, it is a search utility that readily accepts public or user-prepared reference files in various formats including GVF, Generic Feature Format version 3 (GFF3), Gene Transfer Format (GTF), VCF and Browser Extensible Data (BED) format. Compared to existing tools such as ANNOVAR, gSearch runs more than 10 times faster. For example, it is capable of annotating 52.8 million variants with allele frequencies in 6 min.

Availability: gSearch is available at http://ml.ssu.ac.kr/gSearch. It can be used as an independent search tool or can easily be integrated to existing pipelines through various programming environments such as Perl, Ruby and Python.

As the abstract says: “…searching among millions of genomic loci is not trivial.”

Either for integration with topic map tools in a pipeline or for searching technology, definitely worth a close reading.

### De novo assembly and genotyping of variants using colored de Bruijn graphs

Friday, August 3rd, 2012

De novo assembly and genotyping of variants using colored de Bruijn graphs by Zamin Iqbal, Mario Caccamo, Isaac Turner, Paul Flicek & Gil McVean. (Nature Genetics 44, 226–232 (2012))

Abstract:

Detecting genetic variants that are highly divergent from a reference sequence remains a major challenge in genome sequencing. We introduce de novo assembly algorithms using colored de Bruijn graphs for detecting and genotyping simple and complex genetic variants in an individual or population. We provide an efficient software implementation, Cortex, the first de novo assembler capable of assembling multiple eukaryotic genomes simultaneously. Four applications of Cortex are presented. First, we detect and validate both simple and complex structural variations in a high-coverage human genome. Second, we identify more than 3 Mb of sequence absent from the human reference genome, in pooled low-coverage population sequence data from the 1000 Genomes Project. Third, we show how population information from ten chimpanzees enables accurate variant calls without a reference sequence. Last, we estimate classical human leukocyte antigen (HLA) genotypes at HLA-B, the most variable gene in the human genome.

You will need access to Nature Genetics but rounding out today’s posts on de Bruijn graphs with a recent research article.

Comments on the Cortex software appreciated.

### Genome assembly and comparison using de Bruijn graphs

Friday, August 3rd, 2012

Genome assembly and comparison using de Bruijn graphs by Daniel Robert Zerbino. (thesis)

Abstract:

Recent advances in sequencing technology made it possible to generate vast amounts of sequence data. The fragments produced by these high-throughput methods are, however, far shorter than in traditional Sanger sequencing. Previously, micro-reads of less than 50 base pairs were considered useful only in the presence of an existing assembly. This thesis describes solutions for assembling short read sequencing data de novo, in the absence of a reference genome.

The algorithms developed here are based on the de Bruijn graph. This data structure is highly suitable for the assembly and comparison of genomes for the following reasons. It provides a flexible tool to handle the sequence variants commonly found in genome evolution such as duplications, inversions or transpositions. In addition, it can combine sequences of highly different lengths, from short reads to assembled genomes. Finally, it ensures an effective data compression of highly redundant datasets.

This thesis presents the development of a collection of methods, called Velvet, to convert a de Bruijn graph into a traditional assembly of contiguous sequences. The first step of the process, termed Tour Bus, removes sequencing errors and handles biological variations such as polymorphisms. In its second part, Velvet aims to resolve repeats based on the available information, from low coverage long reads (Rock Band) or paired shotgun reads (Pebble). These methods were tested on various simulations for precision and efficiency, then on control experimental datasets.

De Bruijn graphs can also be used to detect and analyse structural variants from unassembled data. The final chapter of this thesis presents the results of collaborative work on the analysis of several experimental unassembled datasets.

De Bruijn graphs are covered in pages 22-42 if you want to cut to the chase.

Obviously of interest to the bioinformatics community.

Where else would you use de Bruijn graph structures?

### Is “Massive Data” > “Big Data”?

Friday, August 3rd, 2012

Science News announces: New Computational Technique Relieves Logjam from Massive Amounts of Data, which is a better title than: Scaling metagenome sequence assembly with probabilistic de Bruijn graphs by Jason Pell, Arend Hintze, Rosangela Canino-Koning, Adina Howe, James M. Tiedje, and C. Titus Brown.

But I have to wonder about “massive data,” versus “big data,” versus “really big data,” versus “massive data streams,” as informative phrases. True, I have a weakness for an eye-catching headline but in prose, shouldn’t we say what data is under consideration? Let the readers draw their own conclusions?

Deep sequencing has enabled the investigation of a wide range of environmental microbial ecosystems, but the high memory requirements for de novo assembly of short-read shotgun sequencing data from these complex populations are an increasingly large practical barrier. Here we introduce a memory-efficient graph representation with which we can analyze the k-mer connectivity of metagenomic samples. The graph representation is based on a probabilistic data structure, a Bloom filter, that allows us to efficiently store assembly graphs in as little as 4 bits per k-mer, albeit inexactly. We show that this data structure accurately represents DNA assembly graphs in low memory. We apply this data structure to the problem of partitioning assembly graphs into components as a prelude to assembly, and show that this reduces the overall memory requirements for de novo assembly of metagenomes. On one soil metagenome assembly, this approach achieves a nearly 40-fold decrease in the maximum memory requirements for assembly. This probabilistic graph representation is a significant theoretical advance in storing assembly graphs and also yields immediate leverage on metagenomic assembly.

If “de Bruijn graphs,” sounds familiar, see: Memory Efficient De Bruijn Graph Construction [Attn: Graph Builders, Chess Anyone?].

### BioExtract Server

Monday, July 30th, 2012

BioExtract Server: data access, analysis, storage and workflow creation

BioExtract harnesses the power of online informatics tools for creating and customizing workflows. Users can query online sequence data, analyze it using an array of informatics tools (web service and desktop), create and share custom workflows for repeated analysis, and save the resulting data and workflows in standardized reports. This work was initially supported by NSF grant 0090732. Current work is being supported by NSF DBI-0606909.

A great tool for sequence data researchers and a good example of what is possible with other structured data sets.

Much has been made (and rightly so) of the need for and difficulties of processing unstructured data.

But we should not ignore the structured data dumps being released by governments and other groups around the world.

And we should recognize that hosted workflows and processing can make insights into data a matter of skill, rather than ownership of enough hardware.

### Mining the pharmacogenomics literature—a survey of the state of the art

Thursday, July 26th, 2012

Mining the pharmacogenomics literature—a survey of the state of the art by Udo Hahn, K. Bretonnel Cohen, and Yael Garten. (Brief Bioinform (2012) 13 (4): 460-494. doi: 10.1093/bib/bbs018)

Abstract:

This article surveys efforts on text mining of the pharmacogenomics literature, mainly from the period 2008 to 2011. Pharmacogenomics (or pharmacogenetics) is the field that studies how human genetic variation impacts drug response. Therefore, publications span the intersection of research in genotypes, phenotypes and pharmacology, a topic that has increasingly become a focus of active research in recent years. This survey covers efforts dealing with the automatic recognition of relevant named entities (e.g. genes, gene variants and proteins, diseases and other pathological phenomena, drugs and other chemicals relevant for medical treatment), as well as various forms of relations between them. A wide range of text genres is considered, such as scientific publications (abstracts, as well as full texts), patent texts and clinical narratives. We also discuss infrastructure and resources needed for advanced text analytics, e.g. document corpora annotated with corresponding semantic metadata (gold standards and training data), biomedical terminologies and ontologies providing domain-specific background knowledge at different levels of formality and specificity, software architectures for building complex and scalable text analytics pipelines and Web services grounded to them, as well as comprehensive ways to disseminate and interact with the typically huge amounts of semiformal knowledge structures extracted by text mining tools. Finally, we consider some of the novel applications that have already been developed in the field of pharmacogenomic text mining and point out perspectives for future research.

At thirty-six (36) pages and well over 200 references, this is going to take a while to digest.

How are entity recognition issues same/different?

What techniques have you seen before? How different/same?

What other techniques would you suggest?

### Memory Efficient De Bruijn Graph Construction [Attn: Graph Builders, Chess Anyone?]

Tuesday, July 17th, 2012

Memory Efficient De Bruijn Graph Construction by Yang Li, Pegah Kamousi, Fangqiu Han, Shengqi Yang, Xifeng Yan, and Subhash Suri.

Abstract:

Massively parallel DNA sequencing technologies are revolutionizing genomics research. Billions of short reads generated at low costs can be assembled for reconstructing the whole genomes. Unfortunately, the large memory footprint of the existing de novo assembly algorithms makes it challenging to get the assembly done for higher eukaryotes like mammals. In this work, we investigate the memory issue of constructing de Bruijn graph, a core task in leading assembly algorithms, which often consumes several hundreds of gigabytes memory for large genomes. We propose a disk-based partition method, called Minimum Substring Partitioning (MSP), to complete the task using less than 10 gigabytes memory, without runtime slowdown. MSP breaks the short reads into multiple small disjoint partitions so that each partition can be loaded into memory, processed individually and later merged with others to form a de Bruijn graph. By leveraging the overlaps among the k-mers (substring of length k), MSP achieves astonishing compression ratio: The total size of partitions is reduced from $\Theta(kn)$ to $\Theta(n)$, where $n$ is the size of the short read database, and $k$ is the length of a $k$-mer. Experimental results show that our method can build de Bruijn graphs using a commodity computer for any large-volume sequence dataset.

A discovery in one area of data processing can have a large impact in a number of others. I suspect that will be the case with the technique described here.

The use of substrings for compression and to determine the creation of partitions was particularly clever.

Software and data sets

Questions:

1. What are the substring characteristics of your data?
2. How would you use a De Bruijn graph with your data?

If you don’t know the answers to those questions, you might want to find out.

De Bruijn Graph (Wikipedia)

De Bruijn Sequence (Wikipedia)

How to apply de Bruijn graphs to genome assembly by Phillip E C Compeau, Pavel A Pevzner, and Glenn Tesler. Nature Biotechnology 29, 987–991 (2011) doi:10.1038/nbt.2023

And De Bruijn graphs/sequences are not just for bioinformatics: from the Chess Programming Wiki: De Bruijn Sequences. (Lots of pointers and additional references.)

### ISA-TAB

Sunday, July 15th, 2012

ISA-TAB format page at SourceForge.

Where you will find:

ISA-TAB 1.0 – Candidate release (PDF file)

Example ISA-TAB files.

ISAValidator

Abstract from ISA-TAB 1.0:

This document describes ISA-TAB, a general purpose framework with which to capture and communicate the complex metadata required to interpret experiments employing combinations of technologies, and the associated data files. Sections 1 to 3 introduce the ISA-TAB proposal, describe the rationale behind its development, provide an overview of its structure and relate it to other formats. Section 4 describes the specification in detail; section 5 provides examples of design patterns.

ISA-TAB builds on the existing paradigm that is MAGE-TAB – a tab-delimited format to exchange microarray data. ISA-TAB necessarily maintains backward compatibility with existing MAGE-TAB files to facilitate adoption; conserving the simplicity of MAGE-TAB for simple experimental designs, while incorporating new features to capture the full complexity of experiments employing a combination of technologies. Like MAGE-TAB before it, ISA-TAB is simply a format; the decision on how to regulate its use (i.e. enforcing completion of mandatory fields or use of a controlled terminology) is a matter for those communities, which will implement the format in their systems and for which submission and exchange of minimal information is critical. In this case, an additional layer or of constraints should be agreed and required on top of the ISA-TAB specification.

Knowledge of the MAGE-TAB format is required, on which see: MAGE-TAB.

As terminologies/vocabularies/ontologies evolve, ISA-TAB formatted files are a good example of targets for topic maps.

Researchers can continue their use of ISA-TAB formatted files undisturbed by changes in terminology, vocabulary or even ontology due to the semantic navigation layer provided by topic maps.

Or perhaps more correctly, one researcher or librarian can create a mapping of such changes that benefit all the other members of their lab.

### Journal of Data Mining in Genomics and Proteomics

Saturday, July 14th, 2012

Journal of Data Mining in Genomics and Proteomics

From the Aims and Scope page:

Journal of Data Mining in Genomics & Proteomics (JDMGP), a broad-based journal was founded on two key tenets: To publish the most exciting researches with respect to the subjects of Proteomics & Genomics. Secondly, to provide a rapid turn-around time possible for reviewing and publishing, and to disseminate the articles freely for research, teaching and reference purposes.

In today’s wired world information is available at the click of the button, curtsey the Internet. JDMGP-Open Access gives a worldwide audience larger than that of any subscription-based journal in OMICS field, no matter how prestigious or popular, and probably increases the visibility and impact of published work. JDMGP-Open Access gives barrier-free access to the literature for research. It increases convenience, reach, and retrieval power. Free online literature is available for software that facilitates full-text searching, indexing, mining, summarizing, translating, querying, linking, recommending, alerting, “mash-ups” and other forms of processing and analysis. JDMGP-Open Access puts rich and poor on an equal footing for these key resources and eliminates the need for permissions to reproduce and distribute content.

A publication (among many) from the OMICS Publishing Group, which sponsors a large number of online publications.

Has the potential to be an interesting source of information. Not much in the way of back files but then it is a very young journal.

### Finding Structure in Text, Genome and Other Symbolic Sequences

Saturday, July 14th, 2012

Finding Structure in Text, Genome and Other Symbolic Sequences by Ted Dunning. (thesis, 1998)

Abstract:

The statistical methods derived and described in this thesis provide new ways to elucidate the structural properties of text and other symbolic sequences. Generically, these methods allow detection of a difference in the frequency of a single feature, the detection of a difference between the frequencies of an ensemble of features and the attribution of the source of a text. These three abstract tasks suffice to solve problems in a wide variety of settings. Furthermore, the techniques described in this thesis can be extended to provide a wide range of additional tests beyond the ones described here.

A variety of applications for these methods are examined in detail. These applications are drawn from the area of text analysis and genetic sequence analysis. The textually oriented tasks include finding interesting collocations and cooccurent phrases, language identification, and information retrieval. The biologically oriented tasks include species identification and the discovery of previously unreported long range structure in genes. In the applications reported here where direct comparison is possible, the performance of these new methods substantially exceeds the state of the art.

Overall, the methods described here provide new and effective ways to analyse text and other symbolic sequences. Their particular strength is that they deal well with situations where relatively little data are available. Since these methods are abstract in nature, they can be applied in novel situations with relative ease.

Recently posted but dating from 1998.

Older materials are interesting because the careers of their authors can be tracked, say at DBPL Ted Dunning.

Or it can lead you to check an author in Citeseer:

Abstract:

Much work has been done on the statistical analysis of text. In some cases reported in the literature, inappropriate statistical methods have been used, and statistical significance of results have not been addressed. In particular, asymptotic normality assumptions have often been used unjustifiably, leading to flawed results.This assumption of normal distribution limits the ability to analyze rare events. Unfortunately rare events do make up a large fraction of real text.However, more applicable methods based on likelihood ratio tests are available that yield good results with relatively small samples. These tests can be implemented efficiently, and have been used for the detection of composite terms and for the determination of domain-specific terms. In some cases, these measures perform much better than the methods previously used. In cases where traditional contingency table methods work well, the likelihood ratio tests described here are nearly identical.This paper describes the basis of a measure based on likelihood ratios that can be applied to the analysis of text.

Which has over 600 citations, only one of which is from the author. (I could comment about a well know self-citing ontologist but I won’t.)

The observations in the thesis about “large” data sets are dated but it merits your attention as fundamental work in the field of textual analysis.

As a bonus, it is quite well written and makes an enjoyable read.

### Compressive Genomics [Compression as Merging]

Wednesday, July 11th, 2012

Compressive genomics by Po-Ru Loh, Michael Baym, and Bonnie Berger (Nature Biotechnology 30, 627–630 (2012) doi:10.1038/nbt.2241)

From the introduction:

In the past two decades, genomic sequencing capabilities have increased exponentially[cites omitted] outstripping advances in computing power[cites omitted]. Extracting new insights from the data sets currently being generated will require not only faster computers, but also smarter algorithms. However, most genomes currently sequenced are highly similar to ones already collected[cite omitted]; thus, the amount of new sequence information is growing much more slowly.

Here we show that this redundancy can be exploited by compressing sequence data in such a way as to allow direct computation on the compressed data using methods we term ‘compressive’ algorithms. This approach reduces the task of computing on many similar genomes to only slightly more than that of operating on just one. Moreover, its relative advantage over existing algorithms will grow with the accumulation of genomic data. We demonstrate this approach by implementing compressive versions of both the Basic Local Alignment Search Tool (BLAST)[cite omitted] and the BLAST-Like Alignment Tool (BLAT)[cite omitted], and we emphasize how compressive genomics will enable biologists to keep pace with current data.

Software available at: Compression-accelerated BLAST and BLAT.

A new line of attack on searching “big data.”

Making “big data” into “smaller data” and enabling analysis of it while still “smaller data.”

Enabling the searching of highly similar genomes by compression is a form of merging isn’t it? That is a sequence (read subject) that occurs multiple times over similar genomes is given a single representative, while preserving its relationship to all the individual genome instances.

What makes merger computationally tractable here and yet topic may systems, at least some of them, are reported to have scalability issues: Scalability of Topic Map Systems by Marcel Hoyer?

What other examples of computationally tractable merging would you suggest? Including different merging approaches/algorithms. Thinking it might be a useful paper/study to work from scalable merging examples towards less scalable ones. Perhaps to discover what choices have an impact on scalability.

### MicrobeDB: a locally maintainable database of microbial genomic sequences

Sunday, July 8th, 2012

MicrobeDB: a locally maintainable database of microbial genomic sequences by Morgan G. I. Langille, Matthew R. Laird, William W. L. Hsiao, Terry A. Chiu, Jonathan A. Eisen, and Fiona S. L. Brinkman. (Bioinformatics (2012) 28 (14): 1947-1948. doi: 10.1093/bioinformatics/bts273)

Abstract

Summary: Analysis of microbial genomes often requires the general organization and comparison of tens to thousands of genomes both from public repositories and unpublished sources. MicrobeDB provides a foundation for such projects by the automation of downloading published, completed bacterial and archaeal genomes from key sources, parsing annotations of all genomes (both public and private) into a local database, and allowing interaction with the database through an easy to use programming interface. MicrobeDB creates a simple to use, easy to maintain, centralized local resource for various large-scale comparative genomic analyses and a back-end for future microbial application design.

Availability: MicrobeDB is freely available under the GNU-GPL at: http://github.com/mlangill/microbedb/

No doubt a useful project but the article seems to be at war with itself:

Although many of these centers provide genomic data in a variety of static formats such as Genbank and Fasta, these are often inadequate for complex queries. To carry out these analyses efficiently, a relational database such as MySQL (http://mysql.com) can be used to allow rapid querying across many genomes at once. Some existing data providers such as CMR allow downloading of their database files directly, but these databases are designed for large web-based infrastructures and contain numerous tables that demand a steep learning curve. Also, addition of unpublished genomes to these databases is often not supported. A well known and widely used system is the Generic Model Organism Database (GMOD) project (http://gmod.org). GMOD is an open-source project that provides a common platform for building model organism databases such as FlyBase (McQuilton et al., 2011) and WormBase (Yook et al., 2011). GMOD supports a variety of options such as GBrowse (Stein et al., 2002) and a variety of database choices including Chado (Mungall and Emmert, 2007) and BioSQL (http://biosql.org). GMOD provides a comprehensive system, but for many researchers such a complex system is not needed.

On one hand, current solutions are “…often inadequate for complex queries” and just a few lines later, “…such a complex system is not needed.”

I have no doubt that using unfamiliar and complex table structures is a burden on any user. Not to mention lacking the ability to add “unpublished genomes” or fixing versions of data for analysis.

What concerns me is the “solution” being seen as yet another set of “local” options. Which impedes the future use of the now “localized” data.

The issue raised here need to be addressed but one-off solutions seem like a particularly poor choice.

### Genome-scale analysis of interaction dynamics reveals organization of biological networks

Saturday, July 7th, 2012

Genome-scale analysis of interaction dynamics reveals organization of biological networks by Jishnu Das, Jaaved Mohammed, and Haiyuan Yu. (Bioinformatics (2012) 28 (14): 1873-1878. doi: 10.1093/bioinformatics/bts283)

Summary:

Analyzing large-scale interaction networks has generated numerous insights in systems biology. However, such studies have primarily been focused on highly co-expressed, stable interactions. Most transient interactions that carry out equally important functions, especially in signal transduction pathways, are yet to be elucidated and are often wrongly discarded as false positives. Here, we revisit a previously described Smith–Waterman-like dynamic programming algorithm and use it to distinguish stable and transient interactions on a genomic scale in human and yeast. We find that in biological networks, transient interactions are key links topologically connecting tightly regulated functional modules formed by stable interactions and are essential to maintaining the integrity of cellular networks. We also perform a systematic analysis of interaction dynamics across different technologies and find that high-throughput yeast two-hybrid is the only available technology for detecting transient interactions on a large scale.

Research of obvious importance to anyone investigating biological networks but I mention it for the problem of how to represent transient relationships/interactions in a network?

Assuming a graph/network typology, how does a transient relationship impact a path traversal?

Assuming a graph/network typology, do we ignore the transience for graph theoretical properties such as shortest path?

Do we need graph theoretical queries versus biological network queries? Are the results always the same?

Can transient relationships results in transient properties? How do we record those?

Better yet, how do we ignore transient properties and under what conditions? (Leaving to one side how we would formally/computationally accomplish that ignorance.) What are the theoretical issues?

You can find the full text of this article at Professor Yu’s site: http://yulab.icmb.cornell.edu/PDF/Das_B2012.pdf

### Mosaic: making biological sense of complex networks

Thursday, July 5th, 2012

Mosaic: making biological sense of complex networks by Chao Zhang, Kristina Hanspers, Allan Kuchinsky, Nathan Salomonis, Dong Xu, and Alexander R. Pico. (Bioinformatics (2012) 28 (14): 1943-1944. doi: 10.1093/bioinformatics/bts278)

Abstract:

We present a Cytoscape plugin called Mosaic to support interactive network annotation, partitioning, layout and coloring based on gene ontology or other relevant annotations.

From the Introduction:

The increasing throughput and quality of molecular measurements in the domains of genomics, proteomics and metabolomics continue to fuel the understanding of biological processes. Collected per molecule, the scope of these data extends to physical, genetic and biochemical interactions that in turn comprise extensive networks. There are software tools available to visualize and analyze data-derived biological networks (Smoot et al., 2011). One challenge faced by these tools is how to make sense of such networks often represented as massive ‘hairballs’. Many network analysis algorithms filter or partition networks based on topological features, optionally weighted by orthogonal node or edge data (Bader and Hogue, 2003; Royer et al., 2008). Another approach is to mathematically model networks and rely on their statistical properties to make associations with other networks, phenotypes and drug effects, sidestepping the issue of making sense of the network itself altogether (Machado et al., 2011). Acknowledging that there is still great value in engaging the minds of researchers in exploratory data analysis at the level of networks (Kelder et al., 2010), we have produced a Cytoscape plugin called Mosaic to support interactive network annotation and visualization that includes partitioning, layout and coloring based on biologically relevant ontologies (Fig. 1). Mosaic shows slices of a given network in the visual language of biological pathways, which are familiar to any biologist and are ideal frameworks for integrating knowledge.

[Fig. 1 omitted}

Cytoscape is a free and open source network visualization platform that actively supports independent plugin development (Smoot et al., 2011). For annotation, Mosaic relies primarily on the full gene ontology (GO) or simplified ‘slim’ versions (http://www.geneontology.org/GO.slims.shtml). The cellular layout of partitioned subnetworks strictly depends on the cellular component branch of GO, but the other two functions, partitioning and coloring, can be driven by any annotation associated with a major gene or protein identifier system.

You will need:

As per the Mosaic project page.

The Mosaic page offers additional documentation, which will take a while to process. I am particularly interested in annotations of the network driving partitioning.

### MuteinDB

Friday, June 29th, 2012

MuteinDB: the mutein database linking substrates, products and enzymatic reactions directly with genetic variants of enzymes by Andreas Braun, Bettina Halwachs, Martina Geier, Katrin Weinhandl, Michael Guggemos, Jan Marienhagen, Anna J. Ruff, Ulrich Schwaneberg, Vincent Rabin, Daniel E. Torres Pazmiño, Gerhard G. Thallinger, and Anton Glieder.

Abstract:

Mutational events as well as the selection of the optimal variant are essential steps in the evolution of living organisms. The same principle is used in laboratory to extend the natural biodiversity to obtain better catalysts for applications in biomanufacturing or for improved biopharmaceuticals. Furthermore, single mutation in genes of drug-metabolizing enzymes can also result in dramatic changes in pharmacokinetics. These changes are a major cause of patient-specific drug responses and are, therefore, the molecular basis for personalized medicine. MuteinDB systematically links laboratory-generated enzyme variants (muteins) and natural isoforms with their biochemical properties including kinetic data of catalyzed reactions. Detailed information about kinetic characteristics of muteins is available in a systematic way and searchable for known mutations and catalyzed reactions as well as their substrates and known products. MuteinDB is broadly applicable to any known protein and their variants and makes mutagenesis and biochemical data searchable and comparable in a simple and easy-to-use manner. For the import of new mutein data, a simple, standardized, spreadsheet-based data format has been defined. To demonstrate the broad applicability of the MuteinDB, first data sets have been incorporated for selected cytochrome P450 enzymes as well as for nitrilases and peroxidases.

Database URL: http://www.muteindb.org/

Why is this relevant to topic maps or semantic diversity you ask?

I will let the author’s answer:

Information about specific proteins and their muteins are widely spread in the literature. Many studies only describe single mutation and its effects without comparison to already known muteins. Possible additive effects of single amino acid changes are scarcely described or used. Even after a thorough and time-consuming literature search, researchers face the problem of assembling and presenting the data in an easy understandable and comprehensive way. Essential information may be lost such as details about potentially cooperative mutations or reactions one would not expect in certain protein families. Therefore, a web-accessible database combining available knowledge about a specific enzyme and its muteins in a single place are highly desirable. Such a database would allow researchers to access relevant information about their protein of interest in a fast and easy way and accelerate the engineering of new and improved variants. (Third paragraph of the introduction)

I would have never dreamed that gene data would be spread to Hell and back.

The article will give you insight into how gene data is collected, searched, organized, etc. All of which will be valuable to you whether you are designing or using information systems in this area.

I was a bit let down when I read about data formats:

Most of them are XML based, which can be difficult to create and manipulate. Therefore, simpler, spreadsheet-based formats have been introduced which are more accessible for the individual researcher.

I’ve never had any difficulties with XML based formats but will admit that may not be a universal experience. Sounds to me like the XML community should concentrate a bit less on making people write angle-bang syntax and more on long term useful results. (Which I think XML can deliver.)