Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 15, 2012

ISA-TAB

Filed under: Bioinformatics,Biomedical,Genome — Patrick Durusau @ 9:06 am

ISA-TAB format page at SourceForge.

Where you will find:

ISA-TAB 1.0 – Candidate release (PDF file)

Example ISA-TAB files.

ISAValidator

Abstract from ISA-TAB 1.0:

This document describes ISA-TAB, a general purpose framework with which to capture and communicate the complex metadata required to interpret experiments employing combinations of technologies, and the associated data files. Sections 1 to 3 introduce the ISA-TAB proposal, describe the rationale behind its development, provide an overview of its structure and relate it to other formats. Section 4 describes the specification in detail; section 5 provides examples of design patterns.

ISA-TAB builds on the existing paradigm that is MAGE-TAB – a tab-delimited format to exchange microarray data. ISA-TAB necessarily maintains backward compatibility with existing MAGE-TAB files to facilitate adoption; conserving the simplicity of MAGE-TAB for simple experimental designs, while incorporating new features to capture the full complexity of experiments employing a combination of technologies. Like MAGE-TAB before it, ISA-TAB is simply a format; the decision on how to regulate its use (i.e. enforcing completion of mandatory fields or use of a controlled terminology) is a matter for those communities, which will implement the format in their systems and for which submission and exchange of minimal information is critical. In this case, an additional layer or of constraints should be agreed and required on top of the ISA-TAB specification.

Knowledge of the MAGE-TAB format is required, on which see: MAGE-TAB.

As terminologies/vocabularies/ontologies evolve, ISA-TAB formatted files are a good example of targets for topic maps.

Researchers can continue their use of ISA-TAB formatted files undisturbed by changes in terminology, vocabulary or even ontology due to the semantic navigation layer provided by topic maps.

Or perhaps more correctly, one researcher or librarian can create a mapping of such changes that benefit all the other members of their lab.

July 14, 2012

Journal of Data Mining in Genomics and Proteomics

Filed under: Bioinformatics,Biomedical,Data Mining,Genome,Medical Informatics,Proteomics — Patrick Durusau @ 12:20 pm

Journal of Data Mining in Genomics and Proteomics

From the Aims and Scope page:

Journal of Data Mining in Genomics & Proteomics (JDMGP), a broad-based journal was founded on two key tenets: To publish the most exciting researches with respect to the subjects of Proteomics & Genomics. Secondly, to provide a rapid turn-around time possible for reviewing and publishing, and to disseminate the articles freely for research, teaching and reference purposes.

In today’s wired world information is available at the click of the button, curtsey the Internet. JDMGP-Open Access gives a worldwide audience larger than that of any subscription-based journal in OMICS field, no matter how prestigious or popular, and probably increases the visibility and impact of published work. JDMGP-Open Access gives barrier-free access to the literature for research. It increases convenience, reach, and retrieval power. Free online literature is available for software that facilitates full-text searching, indexing, mining, summarizing, translating, querying, linking, recommending, alerting, “mash-ups” and other forms of processing and analysis. JDMGP-Open Access puts rich and poor on an equal footing for these key resources and eliminates the need for permissions to reproduce and distribute content.

A publication (among many) from the OMICS Publishing Group, which sponsors a large number of online publications.

Has the potential to be an interesting source of information. Not much in the way of back files but then it is a very young journal.

Finding Structure in Text, Genome and Other Symbolic Sequences

Filed under: Genome,Statistics,Symbol,Text Analytics,Text Corpus,Text Mining — Patrick Durusau @ 8:58 am

Finding Structure in Text, Genome and Other Symbolic Sequences by Ted Dunning. (thesis, 1998)

Abstract:

The statistical methods derived and described in this thesis provide new ways to elucidate the structural properties of text and other symbolic sequences. Generically, these methods allow detection of a difference in the frequency of a single feature, the detection of a difference between the frequencies of an ensemble of features and the attribution of the source of a text. These three abstract tasks suffice to solve problems in a wide variety of settings. Furthermore, the techniques described in this thesis can be extended to provide a wide range of additional tests beyond the ones described here.

A variety of applications for these methods are examined in detail. These applications are drawn from the area of text analysis and genetic sequence analysis. The textually oriented tasks include finding interesting collocations and cooccurent phrases, language identification, and information retrieval. The biologically oriented tasks include species identification and the discovery of previously unreported long range structure in genes. In the applications reported here where direct comparison is possible, the performance of these new methods substantially exceeds the state of the art.

Overall, the methods described here provide new and effective ways to analyse text and other symbolic sequences. Their particular strength is that they deal well with situations where relatively little data are available. Since these methods are abstract in nature, they can be applied in novel situations with relative ease.

Recently posted but dating from 1998.

Older materials are interesting because the careers of their authors can be tracked, say at DBPL Ted Dunning.

Or it can lead you to check an author in Citeseer:

Accurate Methods for the Statistics of Surprise and Coincidence (1993)

Abstract:

Much work has been done on the statistical analysis of text. In some cases reported in the literature, inappropriate statistical methods have been used, and statistical significance of results have not been addressed. In particular, asymptotic normality assumptions have often been used unjustifiably, leading to flawed results.This assumption of normal distribution limits the ability to analyze rare events. Unfortunately rare events do make up a large fraction of real text.However, more applicable methods based on likelihood ratio tests are available that yield good results with relatively small samples. These tests can be implemented efficiently, and have been used for the detection of composite terms and for the determination of domain-specific terms. In some cases, these measures perform much better than the methods previously used. In cases where traditional contingency table methods work well, the likelihood ratio tests described here are nearly identical.This paper describes the basis of a measure based on likelihood ratios that can be applied to the analysis of text.

Which has over 600 citations, only one of which is from the author. (I could comment about a well know self-citing ontologist but I won’t.)

The observations in the thesis about “large” data sets are dated but it merits your attention as fundamental work in the field of textual analysis.

As a bonus, it is quite well written and makes an enjoyable read.

July 11, 2012

Compressive Genomics [Compression as Merging]

Filed under: Bioinformatics,Compression,Genome,Merging,Scalability — Patrick Durusau @ 2:27 pm

Compressive genomics by Po-Ru Loh, Michael Baym, and Bonnie Berger (Nature Biotechnology 30, 627–630 (2012) doi:10.1038/nbt.2241)

From the introduction:

In the past two decades, genomic sequencing capabilities have increased exponentially[cites omitted] outstripping advances in computing power[cites omitted]. Extracting new insights from the data sets currently being generated will require not only faster computers, but also smarter algorithms. However, most genomes currently sequenced are highly similar to ones already collected[cite omitted]; thus, the amount of new sequence information is growing much more slowly.

Here we show that this redundancy can be exploited by compressing sequence data in such a way as to allow direct computation on the compressed data using methods we term ‘compressive’ algorithms. This approach reduces the task of computing on many similar genomes to only slightly more than that of operating on just one. Moreover, its relative advantage over existing algorithms will grow with the accumulation of genomic data. We demonstrate this approach by implementing compressive versions of both the Basic Local Alignment Search Tool (BLAST)[cite omitted] and the BLAST-Like Alignment Tool (BLAT)[cite omitted], and we emphasize how compressive genomics will enable biologists to keep pace with current data.

Software available at: Compression-accelerated BLAST and BLAT.

A new line of attack on searching “big data.”

Making “big data” into “smaller data” and enabling analysis of it while still “smaller data.”

Enabling the searching of highly similar genomes by compression is a form of merging isn’t it? That is a sequence (read subject) that occurs multiple times over similar genomes is given a single representative, while preserving its relationship to all the individual genome instances.

What makes merger computationally tractable here and yet topic may systems, at least some of them, are reported to have scalability issues: Scalability of Topic Map Systems by Marcel Hoyer?

What other examples of computationally tractable merging would you suggest? Including different merging approaches/algorithms. Thinking it might be a useful paper/study to work from scalable merging examples towards less scalable ones. Perhaps to discover what choices have an impact on scalability.

July 8, 2012

MicrobeDB: a locally maintainable database of microbial genomic sequences

Filed under: Bioinformatics,Biomedical,Database,Genome,MySQL — Patrick Durusau @ 3:54 pm

MicrobeDB: a locally maintainable database of microbial genomic sequences by Morgan G. I. Langille, Matthew R. Laird, William W. L. Hsiao, Terry A. Chiu, Jonathan A. Eisen, and Fiona S. L. Brinkman. (Bioinformatics (2012) 28 (14): 1947-1948. doi: 10.1093/bioinformatics/bts273)

Abstract

Summary: Analysis of microbial genomes often requires the general organization and comparison of tens to thousands of genomes both from public repositories and unpublished sources. MicrobeDB provides a foundation for such projects by the automation of downloading published, completed bacterial and archaeal genomes from key sources, parsing annotations of all genomes (both public and private) into a local database, and allowing interaction with the database through an easy to use programming interface. MicrobeDB creates a simple to use, easy to maintain, centralized local resource for various large-scale comparative genomic analyses and a back-end for future microbial application design.

Availability: MicrobeDB is freely available under the GNU-GPL at: http://github.com/mlangill/microbedb/

No doubt a useful project but the article seems to be at war with itself:

Although many of these centers provide genomic data in a variety of static formats such as Genbank and Fasta, these are often inadequate for complex queries. To carry out these analyses efficiently, a relational database such as MySQL (http://mysql.com) can be used to allow rapid querying across many genomes at once. Some existing data providers such as CMR allow downloading of their database files directly, but these databases are designed for large web-based infrastructures and contain numerous tables that demand a steep learning curve. Also, addition of unpublished genomes to these databases is often not supported. A well known and widely used system is the Generic Model Organism Database (GMOD) project (http://gmod.org). GMOD is an open-source project that provides a common platform for building model organism databases such as FlyBase (McQuilton et al., 2011) and WormBase (Yook et al., 2011). GMOD supports a variety of options such as GBrowse (Stein et al., 2002) and a variety of database choices including Chado (Mungall and Emmert, 2007) and BioSQL (http://biosql.org). GMOD provides a comprehensive system, but for many researchers such a complex system is not needed.

On one hand, current solutions are “…often inadequate for complex queries” and just a few lines later, “…such a complex system is not needed.”

I have no doubt that using unfamiliar and complex table structures is a burden on any user. Not to mention lacking the ability to add “unpublished genomes” or fixing versions of data for analysis.

What concerns me is the “solution” being seen as yet another set of “local” options. Which impedes the future use of the now “localized” data.

The issue raised here need to be addressed but one-off solutions seem like a particularly poor choice.

July 7, 2012

Genome-scale analysis of interaction dynamics reveals organization of biological networks

Filed under: Bioinformatics,Biomedical,Genome,Graphs,Networks — Patrick Durusau @ 5:25 am

Genome-scale analysis of interaction dynamics reveals organization of biological networks by Jishnu Das, Jaaved Mohammed, and Haiyuan Yu. (Bioinformatics (2012) 28 (14): 1873-1878. doi: 10.1093/bioinformatics/bts283)

Summary:

Analyzing large-scale interaction networks has generated numerous insights in systems biology. However, such studies have primarily been focused on highly co-expressed, stable interactions. Most transient interactions that carry out equally important functions, especially in signal transduction pathways, are yet to be elucidated and are often wrongly discarded as false positives. Here, we revisit a previously described Smith–Waterman-like dynamic programming algorithm and use it to distinguish stable and transient interactions on a genomic scale in human and yeast. We find that in biological networks, transient interactions are key links topologically connecting tightly regulated functional modules formed by stable interactions and are essential to maintaining the integrity of cellular networks. We also perform a systematic analysis of interaction dynamics across different technologies and find that high-throughput yeast two-hybrid is the only available technology for detecting transient interactions on a large scale.

Research of obvious importance to anyone investigating biological networks but I mention it for the problem of how to represent transient relationships/interactions in a network?

Assuming a graph/network typology, how does a transient relationship impact a path traversal?

Assuming a graph/network typology, do we ignore the transience for graph theoretical properties such as shortest path?

Do we need graph theoretical queries versus biological network queries? Are the results always the same?

Can transient relationships results in transient properties? How do we record those?

Better yet, how do we ignore transient properties and under what conditions? (Leaving to one side how we would formally/computationally accomplish that ignorance.) What are the theoretical issues?

You can find the full text of this article at Professor Yu’s site: http://yulab.icmb.cornell.edu/PDF/Das_B2012.pdf

July 5, 2012

Mosaic: making biological sense of complex networks

Filed under: Bioinformatics,Gene Ontology,Genome,Graphs,Networks — Patrick Durusau @ 12:14 pm

Mosaic: making biological sense of complex networks by Chao Zhang, Kristina Hanspers, Allan Kuchinsky, Nathan Salomonis, Dong Xu, and Alexander R. Pico. (Bioinformatics (2012) 28 (14): 1943-1944. doi: 10.1093/bioinformatics/bts278)

Abstract:

We present a Cytoscape plugin called Mosaic to support interactive network annotation, partitioning, layout and coloring based on gene ontology or other relevant annotations.

From the Introduction:

The increasing throughput and quality of molecular measurements in the domains of genomics, proteomics and metabolomics continue to fuel the understanding of biological processes. Collected per molecule, the scope of these data extends to physical, genetic and biochemical interactions that in turn comprise extensive networks. There are software tools available to visualize and analyze data-derived biological networks (Smoot et al., 2011). One challenge faced by these tools is how to make sense of such networks often represented as massive ‘hairballs’. Many network analysis algorithms filter or partition networks based on topological features, optionally weighted by orthogonal node or edge data (Bader and Hogue, 2003; Royer et al., 2008). Another approach is to mathematically model networks and rely on their statistical properties to make associations with other networks, phenotypes and drug effects, sidestepping the issue of making sense of the network itself altogether (Machado et al., 2011). Acknowledging that there is still great value in engaging the minds of researchers in exploratory data analysis at the level of networks (Kelder et al., 2010), we have produced a Cytoscape plugin called Mosaic to support interactive network annotation and visualization that includes partitioning, layout and coloring based on biologically relevant ontologies (Fig. 1). Mosaic shows slices of a given network in the visual language of biological pathways, which are familiar to any biologist and are ideal frameworks for integrating knowledge.

[Fig. 1 omitted}

Cytoscape is a free and open source network visualization platform that actively supports independent plugin development (Smoot et al., 2011). For annotation, Mosaic relies primarily on the full gene ontology (GO) or simplified ‘slim’ versions (http://www.geneontology.org/GO.slims.shtml). The cellular layout of partitioned subnetworks strictly depends on the cellular component branch of GO, but the other two functions, partitioning and coloring, can be driven by any annotation associated with a major gene or protein identifier system.

You will need:

As per the Mosaic project page.

The Mosaic page offers additional documentation, which will take a while to process. I am particularly interested in annotations of the network driving partitioning.

June 29, 2012

MuteinDB

Filed under: Bioinformatics,Biomedical,Genome,XML — Patrick Durusau @ 3:16 pm

MuteinDB: the mutein database linking substrates, products and enzymatic reactions directly with genetic variants of enzymes by Andreas Braun, Bettina Halwachs, Martina Geier, Katrin Weinhandl, Michael Guggemos, Jan Marienhagen, Anna J. Ruff, Ulrich Schwaneberg, Vincent Rabin, Daniel E. Torres Pazmiño, Gerhard G. Thallinger, and Anton Glieder.

Abstract:

Mutational events as well as the selection of the optimal variant are essential steps in the evolution of living organisms. The same principle is used in laboratory to extend the natural biodiversity to obtain better catalysts for applications in biomanufacturing or for improved biopharmaceuticals. Furthermore, single mutation in genes of drug-metabolizing enzymes can also result in dramatic changes in pharmacokinetics. These changes are a major cause of patient-specific drug responses and are, therefore, the molecular basis for personalized medicine. MuteinDB systematically links laboratory-generated enzyme variants (muteins) and natural isoforms with their biochemical properties including kinetic data of catalyzed reactions. Detailed information about kinetic characteristics of muteins is available in a systematic way and searchable for known mutations and catalyzed reactions as well as their substrates and known products. MuteinDB is broadly applicable to any known protein and their variants and makes mutagenesis and biochemical data searchable and comparable in a simple and easy-to-use manner. For the import of new mutein data, a simple, standardized, spreadsheet-based data format has been defined. To demonstrate the broad applicability of the MuteinDB, first data sets have been incorporated for selected cytochrome P450 enzymes as well as for nitrilases and peroxidases.

Database URL: http://www.muteindb.org/

Why is this relevant to topic maps or semantic diversity you ask?

I will let the author’s answer:

Information about specific proteins and their muteins are widely spread in the literature. Many studies only describe single mutation and its effects without comparison to already known muteins. Possible additive effects of single amino acid changes are scarcely described or used. Even after a thorough and time-consuming literature search, researchers face the problem of assembling and presenting the data in an easy understandable and comprehensive way. Essential information may be lost such as details about potentially cooperative mutations or reactions one would not expect in certain protein families. Therefore, a web-accessible database combining available knowledge about a specific enzyme and its muteins in a single place are highly desirable. Such a database would allow researchers to access relevant information about their protein of interest in a fast and easy way and accelerate the engineering of new and improved variants. (Third paragraph of the introduction)

I would have never dreamed that gene data would be spread to Hell and back. 😉

The article will give you insight into how gene data is collected, searched, organized, etc. All of which will be valuable to you whether you are designing or using information systems in this area.

I was a bit let down when I read about data formats:

Most of them are XML based, which can be difficult to create and manipulate. Therefore, simpler, spreadsheet-based formats have been introduced which are more accessible for the individual researcher.

I’ve never had any difficulties with XML based formats but will admit that may not be a universal experience. Sounds to me like the XML community should concentrate a bit less on making people write angle-bang syntax and more on long term useful results. (Which I think XML can deliver.)

June 25, 2012

…a phylogeny-aware graph algorithm

Filed under: Algorithms,Alignment,Genome,Graphs,Sequence Detection — Patrick Durusau @ 2:43 pm

Accurate extension of multiple sequence alignments using a phylogeny-aware graph algorithm by Loytynoja, A., Vilella, A. J., Goldman, N.

From the post:

Motivation: Accurate alignment of large numbers of sequences is demanding and the computational burden is further increased by downstream analyses depending on these alignments. With the abundance of sequence data, an integrative approach of adding new sequences to existing alignments without their full re-computation and maintaining the relative matching of existing sequences is an attractive option. Another current challenge is the extension of reference alignments with fragmented sequences, as those coming from next-generation metagenomics, that contain relatively little information. Widely used methods for alignment extension are based on profile representation of reference sequences. These do not incorporate and use phylogenetic information and are affected by the composition of the reference alignment and the phylogenetic positions of query sequences.

Results: We have developed a method for phylogeny-aware alignment of partial-order sequence graphs and apply it here to the extension of alignments with new data. Our new method, called PAGAN, infers ancestral sequences for the reference alignment and adds new sequences in their phylogenetic context, either to predefined positions or by finding the best placement for sequences of unknown origin. Unlike profile-based alternatives, PAGAN considers the phylogenetic relatedness of the sequences and is not affected by inclusion of more diverged sequences in the reference set. Our analyses show that PAGAN outperforms alternative methods for alignment extension and provides superior accuracy for both DNA and protein data, the improvement being especially large for fragmented sequences. Moreover, PAGAN-generated alignments of noisy next-generation sequencing (NGS) sequences are accurate enough for the use of RNA-seq data in evolutionary analyses.

Availability: PAGAN is written in C++, licensed under the GPL and its source code is available at http://code.google.com/p/pagan-msa.

Contact: ari.loytynoja@helsinki.fi

Does your graph software support “…phylogeny-aware alignment of partial-order sequence graphs…?”

June 23, 2012

Big Data in Genomics and Cancer Treatment

Filed under: BigData,Genome,Hadoop,MapReduce — Patrick Durusau @ 6:48 pm

Big Data in Genomics and Cancer Treatment by Tanya Maslyanko.

From the post:

Why genomics?

Big data. These are two words the world has been hearing a lot lately and it has been in relevance to a wide array of use cases in social media, government regulation, auto insurance, retail targeting, etc. The list goes on. However, a very important concept that should receive the same (if not more) recognition is the presence of big data in human genome research.

Three billion base pairs make up the DNA present in humans. It’s probably safe to say that such a massive amount of data should be organized in a useful way, especially if it presents the possibility of eliminating cancer. Cancer treatment has been around since its first documented case in Egypt (1500 BC) when humans began distinguishing between malignant and benign tumors by learning how to surgically remove them. It is intriguing and scientifically helpful to take a look at how far the world’s knowledge of cancer has progressed since that time and what kind of role big data (and its management and analysis) plays in the search for a cure.

The most concerning issue with cancer, and the ultimate reason for why it still hasn’t been completely cured, is that it mutates differently for every individual and reacts in unexpected ways with people’s genetic make up. Professionals and researchers in the field of oncology have to assert the fact that each patient requires personalized treatment and medication in order to manage the specific type of cancer that they have. Elaine Mardis, PhD, co-director of the Genome Institute at the School of Medicine, believes that it is essential to identify mutations at the root of each tumor and to map their genetic evolution in order to make progress in the battle against cancer. “Genome analysis can play a role at multiple time points during a patient’s treatment, to identify ‘driver’ mutations in the tumor genome and to determine whether cells carrying those mutations have been eliminated by treatment.”

A not terribly technical but useful summary and pointers to the use of Hadoop in connection with genomics and cancer research/treatment. It may help give some substance to the buzz words “big data.”

June 12, 2012

Network Medicine: Using Visualization to Decode Complex Diseases

Filed under: Bioinformatics,Biomedical,Genome,Graphs,Networks — Patrick Durusau @ 6:26 pm

Network Medicine: Using Visualization to Decode Complex Diseases

From the post:

Albert Làszló Barabàsi is a physicist, but maybe best known for his work in the field of network theory. In his TEDMED talk titled “Network Medicine: A Network Based Approach to Decode Complex Diseases” [tedmed.com], Albert-Làszló applies advanced network theory to the field of biology.

Using a metaphor of Manhattan maps, he explains how an all-encompassing map of the relationships between genes, proteins and metabolites can form the key to truly understand the mechanisms behind many diseases. He further makes the point that diseases should not be divided up in organ-based separate branches of medicin, but rather as a tightly interconnected network.

More information and movies at the post (information aesthetics)

Turns out that relationships (can you say graph/network?) are going to be critical in the treatment of disease. (Not treatment of symptoms, treatment of disease.)

May 30, 2012

How to Stay Current in Bioinformatics/Genomics [Role for Topic Maps as Filters?]

Filed under: Bioinformatics,Filters,Genome — Patrick Durusau @ 3:09 pm

How to Stay Current in Bioinformatics/Genomics by Stephen Turner.

From the post:

A few folks have asked me how I get my news and stay on top of what’s going on in my field, so I thought I’d share my strategy. With so many sources of information begging for your attention, the difficulty is not necessarily finding what’s interesting, but filtering out what isn’t. What you don’t read is just as important as what you do, so when it comes to things like RSS, Twitter, and especially e-mail, it’s essential to filter out sources where the content consistently fails to be relevant or capture your interest. I run a bioinformatics core, so I’m more broadly interested in applied methodology and study design rather than any particular phenotype, model system, disease, or method. With that in mind, here’s how I stay current with things that are relevant to me. Please leave comments with what you’re reading and what you find useful that I omitted here.

Here is a concrete example of the information feeds used to stay current on bioinformatics/genomics.

A topic map mantra has been: “All the information about a subject in one place.”

Should that change to: “Current information about subject(s) ….,” rather than aggregation, topic maps as a filtering strategy?

I think of filters as “subtractive” but that is only one view of filtering.

Can have “additive” filters as well.

Take a look at the information feeds Stephen is using.

Would you use topic maps as “additive” or “subtractive” filters?

May 29, 2012

Statistics for Genomics (Spring 2012)

Filed under: Genome,R,Statistics — Patrick Durusau @ 6:27 pm

Statistics for Genomics (Spring 2012) by Rafael Irizarry.

Rafael is in the process of posting lectures from his statistics for genomics course online.

Updates:

RafaLab’s Facebook page

Twitter feed

Good way to learn R, statistics and a good bit about genomics.

May 8, 2012

Downloading the XML data from the Exome Variant Server

Filed under: Bioinformatics,Biomedical,Genome — Patrick Durusau @ 10:44 am

Downloading the XML data from the Exome Variant Server

Pierre Lindenbaum writes:

From EVS: “The goal of the NHLBI GO Exome Sequencing Project (ESP) is to discover novel genes and mechanisms contributing to heart, lung and blood disorders by pioneering the application of next-generation sequencing of the protein coding regions of the human genome across diverse, richly-phenotyped populations and to share these datasets and findings with the scientific community to extend and enrich the diagnosis, management and treatment of heart, lung and blood disorders.

The NHLBI Exome Sequencing Project provides a download area but I wanted to build a local database for the richer XML data returned by their Web Services (previously described here on my blog ). The following java program sends some XML/SOAP requests to the EVS server for each chromosome using a genomic window of 150000 bp and parses the XML response.

If you are interested in tools that will assist you in populating a genome-centric topic map, Pierre’s blog is an important one to follow.

May 2, 2012

Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform

Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform by Anthony J. Cox, Markus J. Bauer, Tobias Jakobi, and Giovanna Rosone.

Abstract:

Motivation

The Burrows-Wheeler transform (BWT) is the foundation of many algorithms for compression and indexing of text data, but the cost of computing the BWT of very large string collections has prevented these techniques from being widely applied to the large sets of sequences often encountered as the outcome of DNA sequencing experiments. In previous work, we presented a novel algorithm that allows the BWT of human genome scale data to be computed on very moderate hardware, thus enabling us to investigate the BWT as a tool for the compression of such datasets.

Results

We first used simulated reads to explore the relationship between the level of compression and the error rate, the length of the reads and the level of sampling of the underlying genome and compare choices of second-stage compression algorithm.

We demonstrate that compression may be greatly improved by a particular reordering of the sequences in the collection and give a novel `implicit sorting’ strategy that enables these benefits to be realised without the overhead of sorting the reads. With these techniques, a 45x coverage of real human genome sequence data compresses losslessly to under 0.5 bits per base, allowing the 135.3Gbp of sequence to fit into only 8.2Gbytes of space (trimming a small proportion of low-quality bases from the reads improves the compression still further).

This is more than 4 times smaller than the size achieved by a standard BWT-based compressor (bzip2) on the untrimmed reads, but an important further advantage of our approach is that it facilitates the building of compressed full text indexes such as the FM-index on large-scale DNA sequence collections.

Important work for several reasons.

First, if the human genome is thought of as “big data,” it opens the possibility that compressed full text indexes can be build for other instances of “big data.”

Second, indexing is similar to topic mapping in the sense that pointers to information about a particular subject are gathered to a common location. Indexes often account for synonyms (see also) and distinguish the use of the same word for different subjects (polysemy).

Third, depending on the granularity of tokenizing and indexing, index entries should be capable of recombination to create new index entries.

Source code for this approach:

Code to construct the BWT and SAP-array on large genomic data sets is part of the BEETL library, available as a github respository at git@github.com:BEETL/BEETL.git.

Comments?

« Newer Posts

Powered by WordPress