Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 31, 2012

Processing Rat Brain Neuronal Signals Using A Hadoop Computing Cluster – Part I

Filed under: Bioinformatics,Biomedical,Hadoop,Signal Processing — Patrick Durusau @ 4:54 pm

Processing Rat Brain Neuronal Signals Using A Hadoop Computing Cluster – Part I by Jadin C. Jackson, PhD & Bradley S. Rubin, PhD.

From the introduction:

In this three-part series of posts, we will share our experiences tackling a scientific computing challenge that may serve as a useful practical example for those readers considering Hadoop and Hive as an option to meet their growing technical and scientific computing needs. This first part describes some of the background behind our application and the advantages of Hadoop that make it an attractive framework in which to implement our solution. Part II dives into the technical details of the data we aimed to analyze and of our solution. Finally, we wrap up this series in Part III with a description of some of our main results, and most importantly perhaps, a list of things we learned along the way, as well as future possibilities for improvements.

And:

Problem Statement

Prior to starting this work, Jadin had data previously gathered by himself and from neuroscience researchers who are interested in the role of the brain region called the hippocampus. In both rats and humans, this region is responsible for both spatial processing and memory storage and retrieval. For example, as a rat runs a maze, neurons in the hippocampus, each representing a point in space, fire in sequence. When the rat revisits a path, and pauses to make decisions about how to proceed, those same neurons fire in similar sequences as the rat considers the previous consequences of taking one path versus another. In addition to this binary-like firing of neurons, brain waves, produced by ensembles of neurons, are present in different frequency bands. These act somewhat like clock signals, and the phase relationships of these signals correlate to specific brain signal pathways that provide input to this sub-region of the hippocampus.

The goal of the underlying neuroscience research is to correlate the physical state of the rat with specific characteristics of the signals coming from the neural circuitry in the hippocampus. Those signal differences reflect the origin of signals to the hippocampus. Signals that arise within the hippocampus indicate actions based on memory input, such as reencountering previously encountered situations. Signals that arise outside the hippocampus correspond to other cognitive processing. In this work, we digitally signal process the individual neuronal signal output and turn it into spectral information related to the brain region of origin for the signal input.

If this doesn’t sound like a topic map related problem on your first read, what would you call the “…brain region of origin for the signal input[?]”

That is if you wanted to say something about it. Or wanted to associate information, oh, I don’t know, captured from a signal processing application with it?

Hmmm, that’s what I thought too.

Besides, it is a good opportunity for you to exercise your Hadoop skills. Never a bad thing to work on the unfamiliar.

July 30, 2012

BioExtract Server

Filed under: Bioinformatics,Genome — Patrick Durusau @ 2:54 pm

BioExtract Server: data access, analysis, storage and workflow creation

From “About us:”

BioExtract harnesses the power of online informatics tools for creating and customizing workflows. Users can query online sequence data, analyze it using an array of informatics tools (web service and desktop), create and share custom workflows for repeated analysis, and save the resulting data and workflows in standardized reports. This work was initially supported by NSF grant 0090732. Current work is being supported by NSF DBI-0606909.

A great tool for sequence data researchers and a good example of what is possible with other structured data sets.

Much has been made (and rightly so) of the need for and difficulties of processing unstructured data.

But we should not ignore the structured data dumps being released by governments and other groups around the world.

And we should recognize that hosted workflows and processing can make insights into data a matter of skill, rather than ownership of enough hardware.

Neo4j and Bioinformatics [Webinar]

Filed under: Bio4j,Bioinformatics,Neo4j — Patrick Durusau @ 5:39 am

Neo4j and Bioinformatics [Webinar]

Thursday August 9 10:00 PDT / 19:00 CEST

From the webpage:

The world of data is changing. Big Data and NOSQL are bringing new ways of understanding your data.

This opens a whole new world of possibilities for a wide range of fields, and bioinformatics is no exception. This paradigm provides bioinformaticians with a powerful and intuitive framework, to deal with biological data that is naturally interconnected.

Pablo Pareja will give an overview of Bio4j project, and then move to some of its recent applications.

  • BG7: a new system for bacterial genome annotation designed for NGS data
  • MG7: metagenomics + taxonomy integration
  • Evolutionary studies, transcriptional networks, network analysis..
  • Future directions

Speaker: Pablo Pareja, Project Leader of Bio4j

If you are thinking about “scale,” consider the current stats on Bio4j:

The current version of Bio4j includes:

Relationships: 530.642.683

Nodes: 76.071.411

Relationship types: 139

Node types: 38

With room to spare!

July 26, 2012

Network biology methods integrating biological data for translational science

Filed under: Bioinformatics,Text Mining — Patrick Durusau @ 1:35 pm

Network biology methods integrating biological data for translational science by Gurkan Bebek, Mehmet Koyutürk, Nathan D. Price, and Mark R. Chance. (Brief Bioinform (2012) 13 (4): 446-459. doi: 10.1093/bib/bbr075)

Abstract:

The explosion of biomedical data, both on the genomic and proteomic side as well as clinical data, will require complex integration and analysis to provide new molecular variables to better understand the molecular basis of phenotype. Currently, much data exist in silos and is not analyzed in frameworks where all data are brought to bear in the development of biomarkers and novel functional targets. This is beginning to change. Network biology approaches, which emphasize the interactions between genes, proteins and metabolites provide a framework for data integration such that genome, proteome, metabolome and other -omics data can be jointly analyzed to understand and predict disease phenotypes. In this review, recent advances in network biology approaches and results are identified. A common theme is the potential for network analysis to provide multiplexed and functionally connected biomarkers for analyzing the molecular basis of disease, thus changing our approaches to analyzing and modeling genome- and proteome-wide data.

Integrating as well as filtering data for various modeling purposes are standard topic map fare.

Looking forward to complex integration needs driving further development of topic maps!

Mining the pharmacogenomics literature—a survey of the state of the art

Filed under: Bioinformatics,Genome,Pharmaceutical Research,Text Mining — Patrick Durusau @ 1:23 pm

Mining the pharmacogenomics literature—a survey of the state of the art by Udo Hahn, K. Bretonnel Cohen, and Yael Garten. (Brief Bioinform (2012) 13 (4): 460-494. doi: 10.1093/bib/bbs018)

Abstract:

This article surveys efforts on text mining of the pharmacogenomics literature, mainly from the period 2008 to 2011. Pharmacogenomics (or pharmacogenetics) is the field that studies how human genetic variation impacts drug response. Therefore, publications span the intersection of research in genotypes, phenotypes and pharmacology, a topic that has increasingly become a focus of active research in recent years. This survey covers efforts dealing with the automatic recognition of relevant named entities (e.g. genes, gene variants and proteins, diseases and other pathological phenomena, drugs and other chemicals relevant for medical treatment), as well as various forms of relations between them. A wide range of text genres is considered, such as scientific publications (abstracts, as well as full texts), patent texts and clinical narratives. We also discuss infrastructure and resources needed for advanced text analytics, e.g. document corpora annotated with corresponding semantic metadata (gold standards and training data), biomedical terminologies and ontologies providing domain-specific background knowledge at different levels of formality and specificity, software architectures for building complex and scalable text analytics pipelines and Web services grounded to them, as well as comprehensive ways to disseminate and interact with the typically huge amounts of semiformal knowledge structures extracted by text mining tools. Finally, we consider some of the novel applications that have already been developed in the field of pharmacogenomic text mining and point out perspectives for future research.

At thirty-six (36) pages and well over 200 references, this is going to take a while to digest.

Some questions to be thinking about while reading:

How are entity recognition issues same/different?

What techniques have you seen before? How different/same?

What other techniques would you suggest?

July 20, 2012

PyKnot: a PyMOL tool for the discovery and analysis of knots in proteins

Filed under: Bioinformatics,Graphics,Visualization — Patrick Durusau @ 4:25 pm

PyKnot: a PyMOL tool for the discovery and analysis of knots in proteins (Rhonald C. Lua PyKnot: a PyMOL tool for the discovery and analysis of knots in proteins Bioinformatics 2012 28: 2069-2071. )

Abstract:

Summary: Understanding the differences between knotted and unknotted protein structures may offer insights into how proteins fold. To characterize the type of knot in a protein, we have developed PyKnot, a plugin that works seamlessly within the PyMOL molecular viewer and gives quick results including the knot’s invariants, crossing numbers and simplified knot projections and backbones. PyKnot may be useful to researchers interested in classifying knots in macromolecules and provides tools for students of biology and chemistry with which to learn topology and macromolecular visualization.

Availability: PyMOL is available at http://www.pymol.org. The PyKnot module and tutorial videos are available at http://youtu.be/p95aif6xqcM.

Contact: rhonald.lua@gmail.com

Apologies but this article is not open access.

You can reach the PyMOL and PyKnot software and supporting documentation.

Learning how others use visualization can’t be a bad thing!

Optimal simultaneous superpositioning of multiple structures with missing data

Filed under: Alignment,Bioinformatics,Multidimensional,Subject Identity,Superpositioning — Patrick Durusau @ 3:55 pm

Optimal simultaneous superpositioning of multiple structures with missing data (Douglas L. Theobald and Phillip A. Steindel Optimal simultaneous superpositioning of multiple structures with missing data Bioinformatics 2012 28: 1972-1979. )

Abstract:

Motivation: Superpositioning is an essential technique in structural biology that facilitates the comparison and analysis of conformational differences among topologically similar structures. Performing a superposition requires a one-to-one correspondence, or alignment, of the point sets in the different structures. However, in practice, some points are usually ‘missing’ from several structures, for example, when the alignment contains gaps. Current superposition methods deal with missing data simply by superpositioning a subset of points that are shared among all the structures. This practice is inefficient, as it ignores important data, and it fails to satisfy the common least-squares criterion. In the extreme, disregarding missing positions prohibits the calculation of a superposition altogether.

Results: Here, we present a general solution for determining an optimal superposition when some of the data are missing. We use the expectation–maximization algorithm, a classic statistical technique for dealing with incomplete data, to find both maximum-likelihood solutions and the optimal least-squares solution as a special case.

Availability and implementation: The methods presented here are implemented in THESEUS 2.0, a program for superpositioning macromolecular structures. ANSI C source code and selected compiled binaries for various computing platforms are freely available under the GNU open source license from http://www.theseus3d.org.

Contact: dtheobald@brandeis.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

From the introduction:

How should we properly compare and contrast the 3D conformations of similar structures? This fundamental problem in structural biology is commonly addressed by performing a superposition, which removes arbitrary differences in translation and rotation so that a set of structures is oriented in a common reference frame (Flower, 1999). For instance, the conventional solution to the superpositioning problem uses the least-squares optimality criterion, which orients the structures in space so as to minimize the sum of the squared distances between all corresponding points in the different structures. Superpositioning problems, also known as Procrustes problems, arise frequently in many scientific fields, including anthropology, archaeology, astronomy, computer vision, economics, evolutionary biology, geology, image analysis, medicine, morphometrics, paleontology, psychology and molecular biology (Dryden and Mardia, 1998; Gower and Dijksterhuis, 2004; Lele and Richtsmeier, 2001). A particular case we consider here is the superpositioning of multiple 3D macromolecular coordinate sets, where the points to be superpositioned correspond to atoms. Although our analysis specifically concerns the conformations of macromolecules, the methods developed herein are generally applicable to any entity that can be represented as a set of Cartesian points in a multidimensional space, whether the particular structures under study are proteins, skulls, MRI scans or geological strata.

We draw an important distinction here between a structural ‘alignment’ and a ‘superposition.’ An alignment is a discrete mapping between the residues of two or more structures. One of the most common ways to represent an alignment is using the familiar row and column matrix format of sequence alignments using the single letter abbreviations for residues (Fig. 1). An alignment may be based on sequence information or on structural information (or on both). A superposition, on the other hand, is a particular orientation of structures in 3D space. [emphasis added]

I have deep reservations about the representations of semantics using Cartesian metrics but in fact that happens quite frequently. And allegedly, usefully.

Leaving my doubts to one side, this superpositioning technique could prove to be a useful exploration technique.

If you experiment with this technique, a report of your experiences would be appreciated.

Software support for SBGN maps: SBGN-ML and LibSBGN

Filed under: Bioinformatics,Biomedical,Graphs,Hypergraphs — Patrick Durusau @ 3:30 pm

Software support for SBGN maps: SBGN-ML and LibSBGN (Martijn P. van Iersel, Alice C. Villéger, Tobias Czauderna, Sarah E. Boyd, Frank T. Bergmann, Augustin Luna, Emek Demir, Anatoly Sorokin, Ugur Dogrusoz, Yukiko Matsuoka, Akira Funahashi, Mirit I. Aladjem, Huaiyu Mi, Stuart L. Moodie, Hiroaki Kitano, Nicolas Le Novère, and Falk Schreiber
Software support for SBGN maps: SBGN-ML and LibSBGN Bioinformatics 2012 28: 2016-2021. )

Warning: Unless you really like mapping and markup languages this is likely to be a boring story. If you do (and I do), it is the sort of thing you will print out and enjoy reading. Just so you know.

Abstract:

Motivation: LibSBGN is a software library for reading, writing and manipulating Systems Biology Graphical Notation (SBGN) maps stored using the recently developed SBGN-ML file format. The library (available in C++ and Java) makes it easy for developers to add SBGN support to their tools, whereas the file format facilitates the exchange of maps between compatible software applications. The library also supports validation of maps, which simplifies the task of ensuring compliance with the detailed SBGN specifications. With this effort we hope to increase the adoption of SBGN in bioinformatics tools, ultimately enabling more researchers to visualize biological knowledge in a precise and unambiguous manner.

Availability and implementation: Milestone 2 was released in December 2011. Source code, example files and binaries are freely available under the terms of either the LGPL v2.1+ or Apache v2.0 open source licenses from http://libsbgn.sourceforge.net.

Contact: sbgn-libsbgn@lists.sourceforge.net

I included the hyperlinks to standards and software for the introduction but not the article references. Those are of interest too but for the moment I only want to entice you to read the article in full. There is a lot of graph work going on in bioinformatics and we would all do well to be more aware of it.

The Systems Biology Graphical Notation (SBGN, Le Novère et al., 2009) facilitates the representation and exchange of complex biological knowledge in a concise and unambiguous manner: as standardized pathway maps. It has been developed and supported by a vibrant community of biologists, biochemists, software developers, bioinformaticians and pathway databases experts.

SBGN is described in detail in the online specifications (see http://sbgn.org/Documents/Specifications). Here we summarize its concepts only briefly. SBGN defines three orthogonal visual languages: Process Description (PD), Entity Relationship (ER) and Activity Flow (AF). SBGN maps must follow the visual vocabulary, syntax and layout rules of one of these languages. The choice of language depends on the type of pathway or process being depicted and the amount of available information. The PD language, which originates from Kitano’s Process Diagrams (Kitano et al., 2005) and the related CellDesigner tool (Funahashi et al., 2008), is equivalent to a bipartite graph (with a few exceptions) with one type of nodes representing pools of biological entities, and a second type of nodes representing biological processes such as biochemical reactions, transport, binding and degradation. Arcs represent consumption, production or control, and can only connect nodes of differing types. The PD language is very suitable for metabolic pathways, but struggles to concisely depict the combinatorial complexity of certain proteins with many phosphorylation states. The ER language, on the other hand, is inspired by Kohn’s Molecular Interaction Maps (Kohn et al., 2006), and describes relations between biomolecules. In ER, two entities can be linked with an interaction arc. The outcome of an interaction (for example, a protein complex), is considered an entity in itself, represented by a black dot, which can engage in further interactions. Thus ER represents dependencies between interactions, or putting it differently, it can represent which interaction is necessary for another one to take place. Interactions are possible between two or more entities, which make ER maps roughly equivalent to a hypergraph in which an arc can connect more than two nodes. ER is more concise than PD when it comes to representing protein modifications and protein interactions, although it is less capable when it comes to presenting biochemical reactions. Finally, the third language in the SBGN family is AF, which represents the activities of biomolecules at a higher conceptual level. AF is suitable to represent the flow of causality between biomolecules even when detailed knowledge on biological processes is missing.

Efficient integration of the SBGN standard into the research cycle requires adoption by visualization and modeling software. Encouragingly, a growing number of pathway tools (see http://sbgn.org/SBGN_Software) offer some form of SBGN compatibility. However, current software implementations of SBGN are often incomplete and sometimes incorrect. This is not surprising: as SBGN covers a broad spectrum of biological phenomena, complete and accurate implementation of the full SBGN specifications represents a complex, error-prone and time-consuming task for individual tool developers. This development step could be simplified, and redundant implementation efforts avoided, by accurately translating the full SBGN specifications into a single software library, available freely for any tool developer to reuse in their own project. Moreover, the maps produced by any given tool usually cannot be reused in another tool, because SBGN only defines how biological information should be visualized, but not how the maps should be stored electronically. Related community standards for exchanging pathway knowledge, namely BioPAX (Demir et al., 2010) and SBML (Hucka et al., 2003), have proved insufficient for this role (more on this topic in Section 4). Therefore, we observed a second need, for a dedicated, standardized SBGN file format.

Following these observations, we started a community effort with two goals: to encourage the adoption of SBGN by facilitating its implementation in pathway tools, and to increase interoperability between SBGN-compatible software. This has resulted in a file format called SBGN-ML and a software library called LibSBGN. Each of these two components will be explained separately in the next sections.

Of course, there is always the data prior to this markup and the data that comes afterwards, so you could say I see a role for topic maps. 😉

July 19, 2012

Biological Dark Matter [Intelllectual Dark Matter?]

Filed under: Bioinformatics,Data Mining — Patrick Durusau @ 6:05 am

Biological Dark Matter

Nathan Wolfe answers a child’s question of “what is left to explore?” with an exposition on how little we know about the most abundant life form of all, the virus.

Opportunities abound for data mining and mapping the results of data mining on viruses.

Protection against the next pandemic is vitally important but I would have answered differently.

In addition to viruses, advances have been made in data structures, graph algorithms, materials science, digital chip design, programming languages, astronomy, just to name a few areas where substantial progress has been made and more is anticipated.

Those just happen to be areas of interest to me. I am sure you could create even longer lists of areas of interest to you where substantial progress has been made.

We need to convey a sense of excitement and discovery in all areas of the sciences and humanities.

Perhaps we should call it: Intellectual Dark Matter? (another name for the unknown?)

July 17, 2012

Memory Efficient De Bruijn Graph Construction [Attn: Graph Builders, Chess Anyone?]

Filed under: Bioinformatics,Genome,Graphs,Networks — Patrick Durusau @ 10:44 am

Memory Efficient De Bruijn Graph Construction by Yang Li, Pegah Kamousi, Fangqiu Han, Shengqi Yang, Xifeng Yan, and Subhash Suri.

Abstract:

Massively parallel DNA sequencing technologies are revolutionizing genomics research. Billions of short reads generated at low costs can be assembled for reconstructing the whole genomes. Unfortunately, the large memory footprint of the existing de novo assembly algorithms makes it challenging to get the assembly done for higher eukaryotes like mammals. In this work, we investigate the memory issue of constructing de Bruijn graph, a core task in leading assembly algorithms, which often consumes several hundreds of gigabytes memory for large genomes. We propose a disk-based partition method, called Minimum Substring Partitioning (MSP), to complete the task using less than 10 gigabytes memory, without runtime slowdown. MSP breaks the short reads into multiple small disjoint partitions so that each partition can be loaded into memory, processed individually and later merged with others to form a de Bruijn graph. By leveraging the overlaps among the k-mers (substring of length k), MSP achieves astonishing compression ratio: The total size of partitions is reduced from $\Theta(kn)$ to $\Theta(n)$, where $n$ is the size of the short read database, and $k$ is the length of a $k$-mer. Experimental results show that our method can build de Bruijn graphs using a commodity computer for any large-volume sequence dataset.

A discovery in one area of data processing can have a large impact in a number of others. I suspect that will be the case with the technique described here.

The use of substrings for compression and to determine the creation of partitions was particularly clever.

Software and data sets

Questions:

  1. What are the substring characteristics of your data?
  2. How would you use a De Bruijn graph with your data?

If you don’t know the answers to those questions, you might want to find out.

Additional Resources:

De Bruijn Graph (Wikipedia)

De Bruijn Sequence (Wikipedia)

How to apply de Bruijn graphs to genome assembly by Phillip E C Compeau, Pavel A Pevzner, and Glenn Tesler. Nature Biotechnology 29, 987–991 (2011) doi:10.1038/nbt.2023

And De Bruijn graphs/sequences are not just for bioinformatics: from the Chess Programming Wiki: De Bruijn Sequences. (Lots of pointers and additional references.)

July 15, 2012

The Ontology for Biomedical Investigations (OBI)

Filed under: Bioinformatics,Biomedical,Medical Informatics,Ontology — Patrick Durusau @ 9:40 am

The Ontology for Biomedical Investigations (OBI)

From the webpage:

The Ontology for Biomedical Investigations (OBI) project is developing an integrated ontology for the description of biological and clinical investigations. This includes a set of ‘universal’ terms, that are applicable across various biological and technological domains, and domain-specific terms relevant only to a given domain. This ontology will support the consistent annotation of biomedical investigations, regardless of the particular field of study. The ontology will represent the design of an investigation, the protocols and instrumentation used, the material used, the data generated and the type analysis performed on it. Currently OBI is being built under the Basic Formal Ontology (BFO).

  • Develop an Ontology for Biomedical Investigations in collaboration with groups representing different biological and technological domains involved in Biomedical Investigations
  • Make OBI compatible with other bio-ontologies
  • Develop OBI using an open source approach
  • Create a valuable resource for the biomedical communities to provide a source of terms for consistent annotation of investigations

An ontology that will be of interest if you are integrating biomedical materials.

At least as a starting point.

My listing of ontologies, vocabularies, etc., for any field are woefully incomplete for any field and represent at best starting points for your own, more comprehensive investigations. If you do find these starting points useful, please send pointers to your more complete investigations for any field.

Functional Genomics Data Society – FGED

Filed under: Bioinformatics,Biomedical,Functional Genomics — Patrick Durusau @ 9:29 am

Functional Genomics Data Society – FGED

While searching out the MAGE-TAB standard, I found:

The Functional Genomics Data Society – FGED Society, founded in 1999 as the MGED Society, advocates for open access to genomic data sets and works towards providing concrete solutions to achieve this. Our goal is to assure that investment in functional genomics data generates the maximum public benefit. Our work on defining minimum information specifications for reporting data in functional genomics papers have already enabled large data sets to be used and reused to their greater potential in biological and medical research.

We work with other organisations to develop standards for biological research data quality, annotation and exchange. We facilitate the creation and use of software tools that build on these standards and allow researchers to annotate and share their data easily. We promote scientific discovery that is driven by genome wide and other biological research data integration and meta-analysis.

Home of:

Along with links to other resources and collaborations.

ISA-TAB

Filed under: Bioinformatics,Biomedical,Genome — Patrick Durusau @ 9:06 am

ISA-TAB format page at SourceForge.

Where you will find:

ISA-TAB 1.0 – Candidate release (PDF file)

Example ISA-TAB files.

ISAValidator

Abstract from ISA-TAB 1.0:

This document describes ISA-TAB, a general purpose framework with which to capture and communicate the complex metadata required to interpret experiments employing combinations of technologies, and the associated data files. Sections 1 to 3 introduce the ISA-TAB proposal, describe the rationale behind its development, provide an overview of its structure and relate it to other formats. Section 4 describes the specification in detail; section 5 provides examples of design patterns.

ISA-TAB builds on the existing paradigm that is MAGE-TAB – a tab-delimited format to exchange microarray data. ISA-TAB necessarily maintains backward compatibility with existing MAGE-TAB files to facilitate adoption; conserving the simplicity of MAGE-TAB for simple experimental designs, while incorporating new features to capture the full complexity of experiments employing a combination of technologies. Like MAGE-TAB before it, ISA-TAB is simply a format; the decision on how to regulate its use (i.e. enforcing completion of mandatory fields or use of a controlled terminology) is a matter for those communities, which will implement the format in their systems and for which submission and exchange of minimal information is critical. In this case, an additional layer or of constraints should be agreed and required on top of the ISA-TAB specification.

Knowledge of the MAGE-TAB format is required, on which see: MAGE-TAB.

As terminologies/vocabularies/ontologies evolve, ISA-TAB formatted files are a good example of targets for topic maps.

Researchers can continue their use of ISA-TAB formatted files undisturbed by changes in terminology, vocabulary or even ontology due to the semantic navigation layer provided by topic maps.

Or perhaps more correctly, one researcher or librarian can create a mapping of such changes that benefit all the other members of their lab.

GigaScience

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 8:32 am

GigaScience

From the description:

GigaScience is a new integrated database and journal co-published in collaboration between BGI Shenzhen and BioMed Central, to meet the needs of a new generation of biological and biomedical research as it enters the era of “big-data.” BGI (formerly known as Beijing Genomics Institute) was founded in 1999 and has since become the largest genomic organization in the world and has a proven track record of innovative, high profile research.

To achieve its goals, GigaScience has developed a novel publishing format that integrates manuscript publication with a database that will provide DOI assignment to every dataset. Supporting the open-data movement, we require that all supporting data and source code be publically available in a suitable public repository and/or under a public domain CC0 license in the BGI GigaScience database. Using the BGI cloud as a test environment, we also consider open-source software tools/methods for the analysis or handling of large-scale data. When submitting a manuscript, please contact us if you have datasets or cloud applications you would like us to host. To maximize data usability submitters are encouraged to follow best practice for metadata reporting and are given the opportunity to submit in ISA-Tab format.

A new journal to watch. One of the early articles is accompanied by an 83 GB data file.

Doing a separate post on the ISA-Tab format.

While I write that, image a format that carries with it known subject mappings into the literature? Or references to subject mappings into the literature?

July 14, 2012

Journal of Proteomics & Bioinformatics

Filed under: Bioinformatics,Proteomics — Patrick Durusau @ 12:44 pm

Journal of Proteomics & Bioinformatics

From Aims and Scope:

Journal of Proteomics & Bioinformatics (JPB), a broad-based journal was founded on two key tenets: To publish the most exciting researches with respect to the subjects of Proteomics & Bioinformatics. Secondly, to provide a rapid turn-around time possible for reviewing and publishing, and to disseminate the articles freely for research, teaching and reference purposes. [The rest was boilerplate about open access so I didn’t bother repeating it.]

Another open access journal from Omics Publishing Group but this one has a publication history back to 2008.

Will look through the archive for material of interest.

Journal of Data Mining in Genomics and Proteomics

Filed under: Bioinformatics,Biomedical,Data Mining,Genome,Medical Informatics,Proteomics — Patrick Durusau @ 12:20 pm

Journal of Data Mining in Genomics and Proteomics

From the Aims and Scope page:

Journal of Data Mining in Genomics & Proteomics (JDMGP), a broad-based journal was founded on two key tenets: To publish the most exciting researches with respect to the subjects of Proteomics & Genomics. Secondly, to provide a rapid turn-around time possible for reviewing and publishing, and to disseminate the articles freely for research, teaching and reference purposes.

In today’s wired world information is available at the click of the button, curtsey the Internet. JDMGP-Open Access gives a worldwide audience larger than that of any subscription-based journal in OMICS field, no matter how prestigious or popular, and probably increases the visibility and impact of published work. JDMGP-Open Access gives barrier-free access to the literature for research. It increases convenience, reach, and retrieval power. Free online literature is available for software that facilitates full-text searching, indexing, mining, summarizing, translating, querying, linking, recommending, alerting, “mash-ups” and other forms of processing and analysis. JDMGP-Open Access puts rich and poor on an equal footing for these key resources and eliminates the need for permissions to reproduce and distribute content.

A publication (among many) from the OMICS Publishing Group, which sponsors a large number of online publications.

Has the potential to be an interesting source of information. Not much in the way of back files but then it is a very young journal.

July 13, 2012

Parsing the Newick format in C using flex and bison

Filed under: Bioinformatics,Graphs,Trees — Patrick Durusau @ 5:47 am

Parsing the Newick format in C using flex and bison by Pierre Lindenbaum.

From the post:

The following post is my answer for this question on biostar “Newick 2 Json converter“.

The Newick tree format is a simple format used to write out trees (using parentheses and commas) in a text file .

The original question asked for a parser based on perl but here, I’ve implemented a C parser using flex/bison.

If that doesn’t grab your interest, consider the following from the Wikipedia article cited by Pierre on the Newick tree format:

In mathematics, Newick tree format (or Newick notation or New Hampshire tree format) is a way of representing graph-theoretical trees with edge lengths using parentheses and commas. It was adopted by James Archie, William H. E. Day, Joseph Felsenstein, Wayne Maddison, Christopher Meacham, F. James Rohlf, and David Swofford, at two meetings in 1986, the second of which was at Newick’s restaurant in Dover, New Hampshire, US. The adopted format is a generalization of the format developed by Meacham in 1984 for the first tree-drawing programs in Felsenstein’s PHYLIP package.[1]

Of interest both for conversion but also for the representation of graph-theoretical trees. About the same time as GML and other efforts on trees.

In case you are in Dover, Newick’s survives to this day. I don’t know if they are aware of the reason for their fame but you could mention it.

July 11, 2012

Compressive Genomics [Compression as Merging]

Filed under: Bioinformatics,Compression,Genome,Merging,Scalability — Patrick Durusau @ 2:27 pm

Compressive genomics by Po-Ru Loh, Michael Baym, and Bonnie Berger (Nature Biotechnology 30, 627–630 (2012) doi:10.1038/nbt.2241)

From the introduction:

In the past two decades, genomic sequencing capabilities have increased exponentially[cites omitted] outstripping advances in computing power[cites omitted]. Extracting new insights from the data sets currently being generated will require not only faster computers, but also smarter algorithms. However, most genomes currently sequenced are highly similar to ones already collected[cite omitted]; thus, the amount of new sequence information is growing much more slowly.

Here we show that this redundancy can be exploited by compressing sequence data in such a way as to allow direct computation on the compressed data using methods we term ‘compressive’ algorithms. This approach reduces the task of computing on many similar genomes to only slightly more than that of operating on just one. Moreover, its relative advantage over existing algorithms will grow with the accumulation of genomic data. We demonstrate this approach by implementing compressive versions of both the Basic Local Alignment Search Tool (BLAST)[cite omitted] and the BLAST-Like Alignment Tool (BLAT)[cite omitted], and we emphasize how compressive genomics will enable biologists to keep pace with current data.

Software available at: Compression-accelerated BLAST and BLAT.

A new line of attack on searching “big data.”

Making “big data” into “smaller data” and enabling analysis of it while still “smaller data.”

Enabling the searching of highly similar genomes by compression is a form of merging isn’t it? That is a sequence (read subject) that occurs multiple times over similar genomes is given a single representative, while preserving its relationship to all the individual genome instances.

What makes merger computationally tractable here and yet topic may systems, at least some of them, are reported to have scalability issues: Scalability of Topic Map Systems by Marcel Hoyer?

What other examples of computationally tractable merging would you suggest? Including different merging approaches/algorithms. Thinking it might be a useful paper/study to work from scalable merging examples towards less scalable ones. Perhaps to discover what choices have an impact on scalability.

July 8, 2012

MicrobeDB: a locally maintainable database of microbial genomic sequences

Filed under: Bioinformatics,Biomedical,Database,Genome,MySQL — Patrick Durusau @ 3:54 pm

MicrobeDB: a locally maintainable database of microbial genomic sequences by Morgan G. I. Langille, Matthew R. Laird, William W. L. Hsiao, Terry A. Chiu, Jonathan A. Eisen, and Fiona S. L. Brinkman. (Bioinformatics (2012) 28 (14): 1947-1948. doi: 10.1093/bioinformatics/bts273)

Abstract

Summary: Analysis of microbial genomes often requires the general organization and comparison of tens to thousands of genomes both from public repositories and unpublished sources. MicrobeDB provides a foundation for such projects by the automation of downloading published, completed bacterial and archaeal genomes from key sources, parsing annotations of all genomes (both public and private) into a local database, and allowing interaction with the database through an easy to use programming interface. MicrobeDB creates a simple to use, easy to maintain, centralized local resource for various large-scale comparative genomic analyses and a back-end for future microbial application design.

Availability: MicrobeDB is freely available under the GNU-GPL at: http://github.com/mlangill/microbedb/

No doubt a useful project but the article seems to be at war with itself:

Although many of these centers provide genomic data in a variety of static formats such as Genbank and Fasta, these are often inadequate for complex queries. To carry out these analyses efficiently, a relational database such as MySQL (http://mysql.com) can be used to allow rapid querying across many genomes at once. Some existing data providers such as CMR allow downloading of their database files directly, but these databases are designed for large web-based infrastructures and contain numerous tables that demand a steep learning curve. Also, addition of unpublished genomes to these databases is often not supported. A well known and widely used system is the Generic Model Organism Database (GMOD) project (http://gmod.org). GMOD is an open-source project that provides a common platform for building model organism databases such as FlyBase (McQuilton et al., 2011) and WormBase (Yook et al., 2011). GMOD supports a variety of options such as GBrowse (Stein et al., 2002) and a variety of database choices including Chado (Mungall and Emmert, 2007) and BioSQL (http://biosql.org). GMOD provides a comprehensive system, but for many researchers such a complex system is not needed.

On one hand, current solutions are “…often inadequate for complex queries” and just a few lines later, “…such a complex system is not needed.”

I have no doubt that using unfamiliar and complex table structures is a burden on any user. Not to mention lacking the ability to add “unpublished genomes” or fixing versions of data for analysis.

What concerns me is the “solution” being seen as yet another set of “local” options. Which impedes the future use of the now “localized” data.

The issue raised here need to be addressed but one-off solutions seem like a particularly poor choice.

July 7, 2012

Genome-scale analysis of interaction dynamics reveals organization of biological networks

Filed under: Bioinformatics,Biomedical,Genome,Graphs,Networks — Patrick Durusau @ 5:25 am

Genome-scale analysis of interaction dynamics reveals organization of biological networks by Jishnu Das, Jaaved Mohammed, and Haiyuan Yu. (Bioinformatics (2012) 28 (14): 1873-1878. doi: 10.1093/bioinformatics/bts283)

Summary:

Analyzing large-scale interaction networks has generated numerous insights in systems biology. However, such studies have primarily been focused on highly co-expressed, stable interactions. Most transient interactions that carry out equally important functions, especially in signal transduction pathways, are yet to be elucidated and are often wrongly discarded as false positives. Here, we revisit a previously described Smith–Waterman-like dynamic programming algorithm and use it to distinguish stable and transient interactions on a genomic scale in human and yeast. We find that in biological networks, transient interactions are key links topologically connecting tightly regulated functional modules formed by stable interactions and are essential to maintaining the integrity of cellular networks. We also perform a systematic analysis of interaction dynamics across different technologies and find that high-throughput yeast two-hybrid is the only available technology for detecting transient interactions on a large scale.

Research of obvious importance to anyone investigating biological networks but I mention it for the problem of how to represent transient relationships/interactions in a network?

Assuming a graph/network typology, how does a transient relationship impact a path traversal?

Assuming a graph/network typology, do we ignore the transience for graph theoretical properties such as shortest path?

Do we need graph theoretical queries versus biological network queries? Are the results always the same?

Can transient relationships results in transient properties? How do we record those?

Better yet, how do we ignore transient properties and under what conditions? (Leaving to one side how we would formally/computationally accomplish that ignorance.) What are the theoretical issues?

You can find the full text of this article at Professor Yu’s site: http://yulab.icmb.cornell.edu/PDF/Das_B2012.pdf

July 6, 2012

Tutorial on biological networks [The Heterogeneity of Nature]

Filed under: Bioinformatics,Biomedical,Graphs,Heterogeneous Data,Networks — Patrick Durusau @ 3:54 pm

Tutorial on biological networks by Francisco G. Vital-Lopez, Vesna Memišević, and Bhaskar Dutta. (Vital-Lopez, F. G., Memišević, V. and Dutta, B. (2012), Tutorial on biological networks. WIREs Data Mining Knowl Discov, 2: 298–325. doi: 10.1002/widm.1061)

Abstract:

Understanding how the functioning of a biological system emerges from the interactions among its components is a long-standing goal of network science. Fomented by developments in high-throughput technologies to characterize biomolecules and their interactions, network science has emerged as one of the fastest growing areas in computational and systems biology research. Although the number of research and review articles on different aspects of network science is increasing, updated resources that provide a broad, yet concise, review of this area in the context of systems biology are few. The objective of this article is to provide an overview of the research on biological networks to a general audience, who have some knowledge of biology and statistics, but are not necessarily familiar with this research field. Based on the different aspects of network science research, the article is broadly divided into four sections: (1) network construction, (2) topological analysis, (3) network and data integration, and (4) visualization tools. We specifically focused on the most widely studied types of biological networks, which are, metabolic, gene regulatory, protein–protein interaction, genetic interaction, and signaling networks. In future, with further developments on experimental and computational methods, we expect that the analysis of biological networks will assume a leading role in basic and translational research.

As a frozen artifact in time, I would suggest reading this article before it is too badly out of date. It will be sad to see it ravaged by time and pitted by later research that renders entire sections obsolete. Or of interest only to medical literature spelunkers of some future time.

Developers of homogeneous and “correct” models of biological networks should take warning from the closing lines of this survey article:

Currently different types of networks, such as PPI, GRN, or metabolic networks are analyzed separately. These heterogeneous networks have to be integrated systematically to generate comprehensive network, which creates a realistic representation of biological systems.[cite omitted] The integrated networks have to be combined with different types of molecular profiling data that measures different facades of the biological system. A recent multi institutional collaborative project, named The Cancer Genome Atlas,[cite omitted] has already started generating much multi-‘omics’ data for large cancer patient cohorts. Thus, we can expect to witness an exciting and fast paced growth on biological network research in the coming years.

Interesting.

Nature uses heterogeneous networks, with great success.

We can keep building homogenous networks or we can start building heterogeneous networks (at least to the extent we are capable).

What do you think?

July 5, 2012

Mosaic: making biological sense of complex networks

Filed under: Bioinformatics,Gene Ontology,Genome,Graphs,Networks — Patrick Durusau @ 12:14 pm

Mosaic: making biological sense of complex networks by Chao Zhang, Kristina Hanspers, Allan Kuchinsky, Nathan Salomonis, Dong Xu, and Alexander R. Pico. (Bioinformatics (2012) 28 (14): 1943-1944. doi: 10.1093/bioinformatics/bts278)

Abstract:

We present a Cytoscape plugin called Mosaic to support interactive network annotation, partitioning, layout and coloring based on gene ontology or other relevant annotations.

From the Introduction:

The increasing throughput and quality of molecular measurements in the domains of genomics, proteomics and metabolomics continue to fuel the understanding of biological processes. Collected per molecule, the scope of these data extends to physical, genetic and biochemical interactions that in turn comprise extensive networks. There are software tools available to visualize and analyze data-derived biological networks (Smoot et al., 2011). One challenge faced by these tools is how to make sense of such networks often represented as massive ‘hairballs’. Many network analysis algorithms filter or partition networks based on topological features, optionally weighted by orthogonal node or edge data (Bader and Hogue, 2003; Royer et al., 2008). Another approach is to mathematically model networks and rely on their statistical properties to make associations with other networks, phenotypes and drug effects, sidestepping the issue of making sense of the network itself altogether (Machado et al., 2011). Acknowledging that there is still great value in engaging the minds of researchers in exploratory data analysis at the level of networks (Kelder et al., 2010), we have produced a Cytoscape plugin called Mosaic to support interactive network annotation and visualization that includes partitioning, layout and coloring based on biologically relevant ontologies (Fig. 1). Mosaic shows slices of a given network in the visual language of biological pathways, which are familiar to any biologist and are ideal frameworks for integrating knowledge.

[Fig. 1 omitted}

Cytoscape is a free and open source network visualization platform that actively supports independent plugin development (Smoot et al., 2011). For annotation, Mosaic relies primarily on the full gene ontology (GO) or simplified ‘slim’ versions (http://www.geneontology.org/GO.slims.shtml). The cellular layout of partitioned subnetworks strictly depends on the cellular component branch of GO, but the other two functions, partitioning and coloring, can be driven by any annotation associated with a major gene or protein identifier system.

You will need:

As per the Mosaic project page.

The Mosaic page offers additional documentation, which will take a while to process. I am particularly interested in annotations of the network driving partitioning.

June 29, 2012

MuteinDB

Filed under: Bioinformatics,Biomedical,Genome,XML — Patrick Durusau @ 3:16 pm

MuteinDB: the mutein database linking substrates, products and enzymatic reactions directly with genetic variants of enzymes by Andreas Braun, Bettina Halwachs, Martina Geier, Katrin Weinhandl, Michael Guggemos, Jan Marienhagen, Anna J. Ruff, Ulrich Schwaneberg, Vincent Rabin, Daniel E. Torres Pazmiño, Gerhard G. Thallinger, and Anton Glieder.

Abstract:

Mutational events as well as the selection of the optimal variant are essential steps in the evolution of living organisms. The same principle is used in laboratory to extend the natural biodiversity to obtain better catalysts for applications in biomanufacturing or for improved biopharmaceuticals. Furthermore, single mutation in genes of drug-metabolizing enzymes can also result in dramatic changes in pharmacokinetics. These changes are a major cause of patient-specific drug responses and are, therefore, the molecular basis for personalized medicine. MuteinDB systematically links laboratory-generated enzyme variants (muteins) and natural isoforms with their biochemical properties including kinetic data of catalyzed reactions. Detailed information about kinetic characteristics of muteins is available in a systematic way and searchable for known mutations and catalyzed reactions as well as their substrates and known products. MuteinDB is broadly applicable to any known protein and their variants and makes mutagenesis and biochemical data searchable and comparable in a simple and easy-to-use manner. For the import of new mutein data, a simple, standardized, spreadsheet-based data format has been defined. To demonstrate the broad applicability of the MuteinDB, first data sets have been incorporated for selected cytochrome P450 enzymes as well as for nitrilases and peroxidases.

Database URL: http://www.muteindb.org/

Why is this relevant to topic maps or semantic diversity you ask?

I will let the author’s answer:

Information about specific proteins and their muteins are widely spread in the literature. Many studies only describe single mutation and its effects without comparison to already known muteins. Possible additive effects of single amino acid changes are scarcely described or used. Even after a thorough and time-consuming literature search, researchers face the problem of assembling and presenting the data in an easy understandable and comprehensive way. Essential information may be lost such as details about potentially cooperative mutations or reactions one would not expect in certain protein families. Therefore, a web-accessible database combining available knowledge about a specific enzyme and its muteins in a single place are highly desirable. Such a database would allow researchers to access relevant information about their protein of interest in a fast and easy way and accelerate the engineering of new and improved variants. (Third paragraph of the introduction)

I would have never dreamed that gene data would be spread to Hell and back. 😉

The article will give you insight into how gene data is collected, searched, organized, etc. All of which will be valuable to you whether you are designing or using information systems in this area.

I was a bit let down when I read about data formats:

Most of them are XML based, which can be difficult to create and manipulate. Therefore, simpler, spreadsheet-based formats have been introduced which are more accessible for the individual researcher.

I’ve never had any difficulties with XML based formats but will admit that may not be a universal experience. Sounds to me like the XML community should concentrate a bit less on making people write angle-bang syntax and more on long term useful results. (Which I think XML can deliver.)

June 25, 2012

In the red corner – PubMed and in the blue corner – Google Scholar

Filed under: Bioinformatics,Biomedical,PubMed,Search Engines,Searching — Patrick Durusau @ 7:40 pm

Medical literature searches: a comparison of PubMed and Google Scholar by Eva Nourbakhsh, Rebecca Nugent, Helen Wang, Cihan Cevik and Kenneth Nugent. (Health Information & Libraries Journal, Article first published online: 19 JUN 2012)

From the abstract:

Background

Medical literature searches provide critical information for clinicians. However, the best strategy for identifying relevant high-quality literature is unknown.

Objectives

We compared search results using PubMed and Google Scholar on four clinical questions and analysed these results with respect to article relevance and quality.

Methods

Abstracts from the first 20 citations for each search were classified into three relevance categories. We used the weighted kappa statistic to analyse reviewer agreement and nonparametric rank tests to compare the number of citations for each article and the corresponding journals’ impact factors.

Results

Reviewers ranked 67.6% of PubMed articles and 80% of Google Scholar articles as at least possibly relevant (P = 0.116) with high agreement (all kappa P-values < 0.01). Google Scholar articles had a higher median number of citations (34 vs. 1.5, P < 0.0001) and came from higher impact factor journals (5.17 vs. 3.55, P = 0.036). Conclusions

PubMed searches and Google Scholar searches often identify different articles. In this study, Google Scholar articles were more likely to be classified as relevant, had higher numbers of citations and were published in higher impact factor journals. The identification of frequently cited articles using Google Scholar for searches probably has value for initial literature searches.

I have several concerns that may or may not be allied by further investigation:

  • Four queries seems like an inadequate basis for evaluation. Not that I expect to see one “winner” and one “loser,” but am more concerned with what lead to the differences in results.
  • It is unclear why a citation from a journal with a higher impact factor is superior to one with a lesser impact factor? I assume the point of the query is to obtain a useful result (in the sense of medical treatment, not tenure).
  • Neither system enabled users to build upon the query experience of prior users with a similar query.
  • Neither system enabled users to avoid re-reading the same texts as other had read before them.

Thoughts?

Improving links between literature and biological data with text mining: a case study with GEO, PDB and MEDLINE

Filed under: Bioinformatics,Biomedical,Text Mining — Patrick Durusau @ 7:15 pm

Improving links between literature and biological data with text mining: a case study with GEO, PDB and MEDLINE by Neveol, A., Wilbur, W. J., Lu, Z.

Abstract:

High-throughput experiments and bioinformatics techniques are creating an exploding volume of data that are becoming overwhelming to keep track of for biologists and researchers who need to access, analyze and process existing data. Much of the available data are being deposited in specialized databases, such as the Gene Expression Omnibus (GEO) for microarrays or the Protein Data Bank (PDB) for protein structures and coordinates. Data sets are also being described by their authors in publications archived in literature databases such as MEDLINE and PubMed Central. Currently, the curation of links between biological databases and the literature mainly relies on manual labour, which makes it a time-consuming and daunting task. Herein, we analysed the current state of link curation between GEO, PDB and MEDLINE. We found that the link curation is heterogeneous depending on the sources and databases involved, and that overlap between sources is low, <50% for PDB and GEO. Furthermore, we showed that text-mining tools can automatically provide valuable evidence to help curators broaden the scope of articles and database entries that they review. As a result, we made recommendations to improve the coverage of curated links, as well as the consistency of information available from different databases while maintaining high-quality curation.

Database URLs: MEDLINE http://www.ncbi.nlm.nih.gov/PubMed, GEO http://www.ncbi.nlm.nih.gov/geo/, PDB http://www.rcsb.org/pdb/.

A good illustration of the use of automated means to augment the capacity of curators of data links.

Or topic map authors performing the same task.

June 22, 2012

Sage Bionetworks and Amazon SWF

Sage Bionetworks and Amazon SWF

From the post:

Over the past couple of decades the medical research community has witnessed a huge increase in the creation of genetic and other bio molecular data on human patients. However, their ability to meaningfully interpret this information and translate it into advances in patient care has been much more modest. The difficulty of accessing, understanding, and reusing data, analysis methods, or disease models across multiple labs with complimentary expertise is a major barrier to the effective interpretation of genomic data. Sage Bionetworks is a non-profit biomedical research organization that seeks to revolutionize the way researchers work together by catalyzing a shift to an open, transparent research environment. Such a shift would benefit future patients by accelerating development of disease treatments, and society as a whole by reducing costs and efficacy of health care.

To drive collaboration among researchers, Sage Bionetworks built an on-line environment, called Synapse. Synapse hosts clinical-genomic datasets and provides researchers with a platform for collaborative analyses. Just like GitHub and Source Forge provide tools and shared code for software engineers, Synapse provides a shared compute space and suite of analysis tools for researchers. Synapse leverages a variety of AWS products to handle basic infrastructure tasks, which has freed the Sage Bionetworks development team to focus on the most scientifically-relevant and unique aspects of their application.

Amazon Simple Workflow Service (Amazon SWF) is a key technology leveraged in Synapse. Synapse relies on Amazon SWF to orchestrate complex, heterogeneous scientific workflows. Michael Kellen, Director of Technology for Sage Bionetworks states, “SWF allowed us to quickly decompose analysis pipelines in an orderly way by separating state transition logic from the actual activities in each step of the pipeline. This allowed software engineers to work on the state transition logic and our scientists to implement the activities, all at the same time. Moreover by using Amazon SWF, Synapse is able to use a heterogeneity of computing resources including our servers hosted in-house, shared infrastructure hosted at our partners’ sites, and public resources, such as Amazon’s Elastic Compute Cloud (Amazon EC2). This gives us immense flexibility is where we run computational jobs which enables Synapse to leverage the right combination of infrastructure for every project.”

The Sage Bionetworks case study (above) and another one, NASA JPL and Amazon SWF, will get you excited about reaching out to the documentation on Amazon Simple Workflow Service (Amazon SWF).

In ways that presentations that consist of reading slides about management advantages to Amazon SWF simply can’t reach. At least not for me.

Take the tip and follow the case studies, then onto the documentation.

Full disclosure: I have always been fascinated by space and really hard bioinformatics problems. And have < 0 interest in DRM antics on material if piped to /dev/null would raise a user's IQ.

June 21, 2012

Graph DB + Bioinformatics: Bio4j,…

Filed under: Amazon Web Services AWS,Bioinformatics,Neo4j — Patrick Durusau @ 3:04 pm

Graph DB + Bioinformatics: Bio4j, recent applications and future directions by Pablo Pareja.

If you haven’t seen one of Pablo’s slide decks on Bio4j, get ready for a real treat!

Let me quote the numbers from slide 42, which is entitled: “Bio4j + MG7 + 24 Chip-Seq samples”

157 639 502 nodes

742 615 705 relationships

632 832 045 properties

149 relationship types

44 node types

And it works just fine!

Granting he is not running this on his cellphone but if you are going to process serious data, you need serious computing power. (OK, he uses Amazon Web Services. Like I said, not his cellphone.)

Did I mention everything done by Oh no sequences! (www.ohnosequences.com) is 100% Open source?

There is much to learn here. Enjoy!

June 12, 2012

Network Medicine: Using Visualization to Decode Complex Diseases

Filed under: Bioinformatics,Biomedical,Genome,Graphs,Networks — Patrick Durusau @ 6:26 pm

Network Medicine: Using Visualization to Decode Complex Diseases

From the post:

Albert Làszló Barabàsi is a physicist, but maybe best known for his work in the field of network theory. In his TEDMED talk titled “Network Medicine: A Network Based Approach to Decode Complex Diseases” [tedmed.com], Albert-Làszló applies advanced network theory to the field of biology.

Using a metaphor of Manhattan maps, he explains how an all-encompassing map of the relationships between genes, proteins and metabolites can form the key to truly understand the mechanisms behind many diseases. He further makes the point that diseases should not be divided up in organ-based separate branches of medicin, but rather as a tightly interconnected network.

More information and movies at the post (information aesthetics)

Turns out that relationships (can you say graph/network?) are going to be critical in the treatment of disease. (Not treatment of symptoms, treatment of disease.)

May 30, 2012

How to Stay Current in Bioinformatics/Genomics [Role for Topic Maps as Filters?]

Filed under: Bioinformatics,Filters,Genome — Patrick Durusau @ 3:09 pm

How to Stay Current in Bioinformatics/Genomics by Stephen Turner.

From the post:

A few folks have asked me how I get my news and stay on top of what’s going on in my field, so I thought I’d share my strategy. With so many sources of information begging for your attention, the difficulty is not necessarily finding what’s interesting, but filtering out what isn’t. What you don’t read is just as important as what you do, so when it comes to things like RSS, Twitter, and especially e-mail, it’s essential to filter out sources where the content consistently fails to be relevant or capture your interest. I run a bioinformatics core, so I’m more broadly interested in applied methodology and study design rather than any particular phenotype, model system, disease, or method. With that in mind, here’s how I stay current with things that are relevant to me. Please leave comments with what you’re reading and what you find useful that I omitted here.

Here is a concrete example of the information feeds used to stay current on bioinformatics/genomics.

A topic map mantra has been: “All the information about a subject in one place.”

Should that change to: “Current information about subject(s) ….,” rather than aggregation, topic maps as a filtering strategy?

I think of filters as “subtractive” but that is only one view of filtering.

Can have “additive” filters as well.

Take a look at the information feeds Stephen is using.

Would you use topic maps as “additive” or “subtractive” filters?

May 8, 2012

Downloading the XML data from the Exome Variant Server

Filed under: Bioinformatics,Biomedical,Genome — Patrick Durusau @ 10:44 am

Downloading the XML data from the Exome Variant Server

Pierre Lindenbaum writes:

From EVS: “The goal of the NHLBI GO Exome Sequencing Project (ESP) is to discover novel genes and mechanisms contributing to heart, lung and blood disorders by pioneering the application of next-generation sequencing of the protein coding regions of the human genome across diverse, richly-phenotyped populations and to share these datasets and findings with the scientific community to extend and enrich the diagnosis, management and treatment of heart, lung and blood disorders.

The NHLBI Exome Sequencing Project provides a download area but I wanted to build a local database for the richer XML data returned by their Web Services (previously described here on my blog ). The following java program sends some XML/SOAP requests to the EVS server for each chromosome using a genomic window of 150000 bp and parses the XML response.

If you are interested in tools that will assist you in populating a genome-centric topic map, Pierre’s blog is an important one to follow.

« Newer PostsOlder Posts »

Powered by WordPress