Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 10, 2012

Reveal—visual eQTL analytics [Statistics of Identity/Association]

Filed under: Bioinformatics,Biomedical,Genome — Patrick Durusau @ 1:36 pm

Reveal—visual eQTL analytics by Günter Jäger, Florian Battke and Kay Nieselt. (Bioinformatics (2012) 28 (18): i542-i548. doi: 10.1093/bioinformatics/bts382)

Abstract

Motivation: The analysis of expression quantitative trait locus (eQTL) data is a challenging scientific endeavor, involving the processing of very large, heterogeneous and complex data. Typical eQTL analyses involve three types of data: sequence-based data reflecting the genotypic variations, gene expression data and meta-data describing the phenotype. Based on these, certain genotypes can be connected with specific phenotypic outcomes to infer causal associations of genetic variation, expression and disease.

To this end, statistical methods are used to find significant associations between single nucleotide polymorphisms (SNPs) or pairs of SNPs and gene expression. A major challenge lies in summarizing the large amount of data as well as statistical results and to generate informative, interactive visualizations.

Results: We present Reveal, our visual analytics approach to this challenge. We introduce a graph-based visualization of associations between SNPs and gene expression and a detailed genotype view relating summarized patient cohort genotypes with data from individual patients and statistical analyses.

Availability: Reveal is included in Mayday, our framework for visual exploration and analysis. It is available at http://it.inf.uni-tuebingen.de/software/reveal/.

Contact: guenter.jaeger@uni-tuebingen.de

Interesting work on a number of fronts, not the least of it being “…analysis of expression quantitative trait locus (eQTL) data.”

Its use of statistical methods to discover “significant associations,” interactive visualizations and processing of “large, heterogeneous and complex data” are of more immediate interest to me.

Wikipedia is evidence for subjects (including relationships) that can be usefully identified using URLs. But that is only a fraction of all the subjects and relationships we may want to include in our topic maps.

An area I need to work up for my next topic map course is probabilistic identification of subjects and their relationships. What statistical techniques are useful for what fields? Or even what subjects within what fields? What are the processing tradeoffs versus certainty of identification?

Suggestions/comments?

September 9, 2012

Reverse engineering of gene regulatory networks from biological data [self-conscious?]

Filed under: Bioinformatics,Biomedical,Networks — Patrick Durusau @ 4:25 pm

Reverse engineering of gene regulatory networks from biological data by Li-Zhi Liu, Fang-Xiang Wu, Wen-Jun Zhang. (Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, Volume 2, Issue 5, pages 365–385, September/October 2012)

Abstract:

Reverse engineering of gene regulatory networks (GRNs) is one of the most challenging tasks in systems biology and bioinformatics. It aims at revealing network topologies and regulation relationships between components from biological data. Owing to the development of biotechnologies, various types of biological data are collected from experiments. With the availability of these data, many methods have been developed to infer GRNs. This paper firstly provides an introduction to the basic biological background and the general idea of GRN inferences. Then, different methods are surveyed from two aspects: models that those methods are based on and inference algorithms that those methods use. The advantages and disadvantages of these models and algorithms are discussed.

As you might expect, heterogeneous data is one topic of interest in this paper:

Models Based on Heterogeneous Data

Besides the dimensionality problem, the data from microarray experiments always contain many noises and measurement errors. Therefore, an accurate network can hardly be obtained due to the limited information in microarray data. With the development of technologies, a large amount of other diverse types of genomic data are collected. Many researchers are motivated to study GRNs by combining these data with microarray data. Because different types of the genomic data reflect different aspects of underlying networks, the inferences of GRNs based on the integration of different types of data are expected to provide more accurate and reliable results than based on microarray data alone. However, effectively integrating heterogeneous data is currently a hot research topic and a nontrivial task because they are generally collected along with much noise and related to each other in a complex way. (emphasis added)

Truth be known, high dimensionality and heterogeneous data are more accurate reflections of the objects of our study.

Conversely, the lower the dimensions of a model or the greater the homogeneity of the data, the less accurate they become.

Are we creating less accurate reflections to allow for the inabilities of our machines?

Will that make our machines less self-conscious about their limitations?

Or will that make us less self-conscious about our machines’ limitations?

September 7, 2012

EU-ADR Web Platform

Filed under: Bioinformatics,Biomedical,Drug Discovery,Medical Informatics — Patrick Durusau @ 10:29 am

EU-ADR Web Platform

I was disappointed to not find the UMLS concepts and related terms mapping for participants in the EU-ADR project.

I did find these workflows at the EU-ADR Web Platform:

MEDLINE ADR

In the filtering process of well known signals, the aim of the “MEDLINE ADR” workflow is to automate the search of publications related to ADRs corresponding to a given drug/adverse event association. To do so, we defined an approach based on the MeSH thesaurus, using the subheadings «chemically induced» and «adverse effects» with the “Pharmacological Action” knowledge. Using a threshold of ≥3 extracted publications, the automated search method, presented a sensitivity of 93% and a specificity of 97% on the true positive and true negative sets (WP 2.2). We then determined a threshold number of extracted publications ≥ 3 to confirm the knowledge of this association in the literature. This approach offers the opportunity to automatically determine if an ADR (association of a drug and an adverse event) has already been described in MEDLINE. However, the causality relationship between the drug and an event may be judged only by an expert reading the full text article and determining if the methodology of this article was correct and if the association is statically significant.

MEDLINE Co-occurrence

The “MEDLINE Co-occurrence” workflow performs a comprehensive data processing operation, searching the given Drug-Event combination in the PubMed database. Final workflow results include a final score, measuring found drugs relevance regarding the initial Drug-Event pair, as well as pointers to web pages for the discovered drugs.

DailyMed

The “DailyMed” workflow performs a comprehensive data processing operation, searching the given Drug-Event combination in the DailyMed database. Final workflow results include a final score, measuring found drugs relevance regarding the initial Drug-Event pair, as well as pointers to web pages for the discovered drugs.

DrugBank

The “DrugBank” workflow performs a comprehensive data processing operation, searching the given Drug-Event combination in the DrugBank database. Final workflow results include a final score, measuring found drugs relevance regarding the initial Drug-Event pair, as well as pointers to web pages for the discovered drugs.

Substantiation

The “Substantiation” workflow tries to establish a connection between the clinical event and the drug through a gene or protein, by identifying the proteins that are targets of the drug and are also associated with the event. In addition it also considers information about drug metabolites in this process. In such cases it can be argued that the binding of the drug to the protein would lead to the observed event phenotype. Associations between the event and proteins are found by querying our integrated gene-disease association database (Bauer-Mehren, et al., 2010). As this database provides annotations of the gene-disease associations to the articles reporting the association and in case of text-mining derived associations even the exact sentence, the article or sentence can be studied in more detail in order to inspect the supporting evidence for each gene-disease association. It has to be mentioned that our gene-disease association database also contains information about genetic variants or SNPs and their association to diseases or adverse drug events. The methodology for providing information about the binding of a drug (or metabolite) to protein targets is reported in deliverable 4.2, and includes extraction from different databases (annotated chemical libraries) and application of prediction methods based on chemical similarity.

A glimpse of what is state of the art today and a basis for building better tools for tomorrow.

Harmonization of Reported Medical Events in Europe

Filed under: Bioinformatics,Biomedical,Health care,Medical Informatics — Patrick Durusau @ 10:00 am

Harmonization process for the identification of medical events in eight European healthcare databases: the experience from the EU-ADR project by Paul Avillach, et. al. (J Am Med Inform Assoc doi:10.1136/amiajnl-2012-000933)

Abstract

Objective Data from electronic healthcare records (EHR) can be used to monitor drug safety, but in order to compare and pool data from different EHR databases, the extraction of potential adverse events must be harmonized. In this paper, we describe the procedure used for harmonizing the extraction from eight European EHR databases of five events of interest deemed to be important in pharmacovigilance: acute myocardial infarction (AMI); acute renal failure (ARF); anaphylactic shock (AS); bullous eruption (BE); and rhabdomyolysis (RHABD).

Design The participating databases comprise general practitioners’ medical records and claims for hospitalization and other healthcare services. Clinical information is collected using four different disease terminologies and free text in two different languages. The Unified Medical Language System was used to identify concepts and corresponding codes in each terminology. A common database model was used to share and pool data and verify the semantic basis of the event extraction queries. Feedback from the database holders was obtained at various stages to refine the extraction queries.

….

Conclusions The iterative harmonization process enabled a more homogeneous identification of events across differently structured databases using different coding based algorithms. This workflow can facilitate transparent and reproducible event extractions and understanding of differences between databases.

Not to be overly critical but the one thing left out of the abstract was some hint about the “…procedure used for harmonizing the extraction…” which interests me.

The workflow diagram from figure 2 is worth transposing into HTML markup:

  • Event definition
    • Choice of the event
    • Event Definition Form (EDF) containing the medical definition and diagnostic criteria for the event
  • Concepts selection and projection into the terminologies
    • Search for Unified Medical Language System (UMLS) concepts corresponding to the medical definition as reported in the EDF
    • Projection of UMLS concepts into the different terminologies used in the participating databases
    • Publication on the project’s forum of the first list of UMLS concepts and corresponding codes and terms for each terminology
  • Revision of concepts and related terms
    • Feedback from database holders about the list of concepts with corresponding codes and related terms that they have previously used to identify the event of interest
    • Report on literature review on search criteria being used in previous observational studies that explored the event of interest
    • Text mining in database to identify potentially missing codes through the identification of terms associated with the event in databases
    • Conference call for finalizing the list of concepts
    • Search for new UMLS concepts from the proposed terms
    • Final list of UMLS concepts and related codes posted on the forum
  • Translation of concepts and coding algorithms into queries
    • Queries in each database were built using:
      1. the common data model;
      2. the concept projection into different terminologies; and
      3. the chosen algorithms for event definition
    • Query Analysis
      • Database holders extract data on the event of interest using codes and free text from pre-defined concepts and with database-specific refinement strategies
      • Database holders calculate incidence rates and comparisons are made among databases
      • Database holders compare search queries via the forum

At least for non-members, the EU-ADR website does not appear to offer access to the UMLS concepts and related codes mapping. That mapping could be used to increase accessibility to any database using those codes.

September 6, 2012

Human Ignorance, Deep and Profound

Filed under: Bioinformatics,Biomedical,Graphs,Networks — Patrick Durusau @ 3:33 am

Scientists have discovered over 4 million gene switches, formerly known as “junk” (a scientific shorthand for “we don’t know what this means”), in the human genome. From the New York Times article Bits of Mystery DNA, Far From ‘Junk,’ Play Crucial Role (GINA KOLATA):

…The human genome is packed with at least four million gene switches that reside in bits of DNA that once were dismissed as “junk” but that turn out to play critical roles in controlling how cells, organs and other tissues behave. The discovery, considered a major medical and scientific breakthrough, has enormous implications for human health because many complex diseases appear to be caused by tiny changes in hundreds of gene switches.

The findings, which are the fruit of an immense federal project involving 440 scientists from 32 laboratories around the world, will have immediate applications for understanding how alterations in the non-gene parts of DNA contribute to human diseases, which may in turn lead to new drugs. They can also help explain how the environment can affect disease risk. In the case of identical twins, small changes in environmental exposure can slightly alter gene switches, with the result that one twin gets a disease and the other does not.

As scientists delved into the “junk” — parts of the DNA that are not actual genes containing instructions for proteins — they discovered a complex system that controls genes. At least 80 percent of this DNA is active and needed. The result of the work is an annotated road map of much of this DNA, noting what it is doing and how. It includes the system of switches that, acting like dimmer switches for lights, control which genes are used in a cell and when they are used, and determine, for instance, whether a cell becomes a liver cell or a neuron.

Reminds me of the discovery that glial cells aren’t packing material to support neurons. We were missing about half the human brain by size.

While I find both discoveries exciting, I am also mindful that we are not getting any closer to complete knowledge.

Rather opening up opportunities to correct prior mistakes and at some future time, to discover present ones.

PS: As you probably suspect, relationships between gene switches are extremely complex. New graph databases/algorithms anyone?

August 30, 2012

MyMiner: a web application for computer-assisted biocuration and text annotation

Filed under: Annotation,Bioinformatics,Biomedical,Classification — Patrick Durusau @ 10:35 am

MyMiner: a web application for computer-assisted biocuration and text annotation by David Salgado, Martin Krallinger, Marc Depaule, Elodie Drula, Ashish V. Tendulkar, Florian Leitner, Alfonso Valencia and Christophe Marcelle. ( Bioinformatics (2012) 28 (17): 2285-2287. doi: 10.1093/bioinformatics/bts435 )

Abstract:

Motivation: The exponential growth of scientific literature has resulted in a massive amount of unstructured natural language data that cannot be directly handled by means of bioinformatics tools. Such tools generally require structured data, often generated through a cumbersome process of manual literature curation. Herein, we present MyMiner, a free and user-friendly text annotation tool aimed to assist in carrying out the main biocuration tasks and to provide labelled data for the development of text mining systems. MyMiner allows easy classification and labelling of textual data according to user-specified classes as well as predefined biological entities. The usefulness and efficiency of this application have been tested for a range of real-life annotation scenarios of various research topics.

Availability: http://myminer.armi.monash.edu.au.

Contacts: david.salgado@monash.edu and christophe.marcelle@monash.edu

Supplementary Information: Supplementary data are available at Bioinformatics online.

A useful tool and good tutorial materials.

I could easily see something similar for CS research (unless such already exists).

August 25, 2012

FragVLib a free database mining software for generating “Fragment-based Virtual Library” using pocket similarity…

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 6:13 pm

FragVLib a free database mining software for generating “Fragment-based Virtual Library” using pocket similarity search of ligand-receptor complexes Raed Khashan. Journal of Cheminformatics 2012, 4:18 doi:10.1186/1758-2946-4-18.

Abstract:

Background

With the exponential increase in the number of available ligand-receptor complexes, researchers are becoming more dedicated to mine these complexes to facilitate the drug design and development process. Therefore, we present FragVLib, free software which is developed as a tool for performing similarity search across database(s) of ligand-receptor complexes for identifying binding pockets which are similar to that of a target receptor.

Results

The search is based on 3D-geometric and chemical similarity of the atoms forming the binding pocket. For each match identified, the ligand’s fragment(s) corresponding to that binding pocket are extracted, thus, forming a virtual library of fragments (FragVLib) that is useful for structure-based drug design.

Conclusions

An efficient algorithm is implemented in FragVLib to facilitate the pocket similarity search. The resulting fragments can be used for structure-based drug design tools such as Fragment-Based Lead Discovery (FBLD). They can also be used for finding bioisosteres and as an idea generator.

Suggestions of other uses of 3D-geometric shapes for similarity?

August 19, 2012

Bi-directional semantic similarity….

Filed under: Bioinformatics,Biomedical,Semantics,Similarity — Patrick Durusau @ 6:32 pm

Bi-directional semantic similarity for gene ontology to optimize biological and clinical analyses by Sang Jay Bien, Chan Hee Park, Hae Jin Shim, Woongcheol Yang, Jihun Kim and Ju Han Kim.

Abstract:

Background Semantic similarity analysis facilitates automated semantic explanations of biological and clinical data annotated by biomedical ontologies. Gene ontology (GO) has become one of the most important biomedical ontologies with a set of controlled vocabularies, providing rich semantic annotations for genes and molecular phenotypes for diseases. Current methods for measuring GO semantic similarities are limited to considering only the ancestor terms while neglecting the descendants. One can find many GO term pairs whose ancestors are identical but whose descendants are very different and vice versa. Moreover, the lower parts of GO trees are full of terms with more specific semantics.

Methods This study proposed a method of measuring semantic similarities between GO terms using the entire GO tree structure, including both the upper (ancestral) and the lower (descendant) parts. Comprehensive comparison studies were performed with well-known information content-based and graph structure-based semantic similarity measures with protein sequence similarities, gene expression-profile correlations, protein–protein interactions, and biological pathway analyses.

Conclusion The proposed bidirectional measure of semantic similarity outperformed other graph-based and information content-based methods.

Makes me curious what the experience with direction and identification has been with other ontologies?

Concept Annotation in the CRAFT corpus

Filed under: Bioinformatics,Biomedical,Corpora,Natural Language Processing — Patrick Durusau @ 4:47 pm

Concept Annotation in the CRAFT corpus by Michael Bada, Miriam Eckert, Donald Evans, Kristin Garcia, Krista Shipley, Dmitry Sitnikov, William A. Baumgartner, K. Bretonnel Cohen, Karin Verspoor, Judith A. Blake and Lawrence E. Hunter by BMC Bioinformatics 2012, 13:161 doi:10.1186/1471-2105-13-161.

Abstract:

Background

Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text.

Results

This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement.

Conclusions

As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.

Lessons on what it takes to create a “gold standard” corpus to advance NLP application development.

What do you think the odds are of “high inter[author] agreement” in the absence of such planning and effort?

Sorry, I meant “high interannotator agreement.”

Guess we have to plan for “low inter[author] agreement.”

Suggestions?

Gold Standard (or Bronze, Tin?)

A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools by Karin M Verspoor, Kevin B Cohen, Arrick Lanfranchi, Colin Warner, Helen L Johnson, Christophe Roeder, Jinho D Choi, Christopher Funk, Yuriy Malenkiy, Miriam Eckert, Nianwen Xue, William A Baumgartner, Michael Bada, Martha Palmer and Lawrence E Hunter. BMC Bioinformatics 2012, 13:207 doi:10.1186/1471-2105-13-207.

Abstract:

Background

We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus.

Results

Many biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data.

Conclusions

The finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides a valuable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications.

This is the article that I discovered and then worked my way to it from BioNLP.

Important as a deeply annotated text corpus.

But also a reminder that human annotators created the “gold standard,” against which other efforts are judged.

If you are ill, do you want gold standard research into the medical literature (which involves librarians)? Or is bronze or tin standard research good enough?

PS: I will be going back to pickup the other resources as appropriate.

CRAFT: THE COLORADO RICHLY ANNOTATED FULL TEXT CORPUS

Filed under: Bioinformatics,Biomedical,Corpora,Natural Language Processing — Patrick Durusau @ 3:41 pm

CRAFT: THE COLORADO RICHLY ANNOTATED FULL TEXT CORPUS

From the Quick Facts:

  • 67 full text articles
  • >560,000 Tokens
  • >21,000 Sentences
  • ~100,000 concept annotations to 7 different biomedical ontologies/terminologies
    • Chemical Entities of Biological Interest (ChEBI)
    • Cell Type Ontology (CL)
    • Entrez Gene
    • Gene Ontology (biological process, cellular component, and molecular function)
    • NCBI Taxonomy
    • Protein Ontology
    • Sequence Ontology
  • Penn Treebank markup for each sentence
  • Multiple output formats available

Let’s see: 67 articles resulted in 100,000 concept annotations, or about 1,493 per article for seven (7) ontologies/terminologies.

Ready to test this mapping out in your topic map application?

BioNLP-Corpora

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 2:46 pm

BioNLP-Corpora

From the webpage:

BioNLP-Corpora is a repository of biologically and linguistically annotated corpora and biological datasets.

It is one of the projects of the BioNLP initiative by the Center for Computational Pharmacology at the University of Colorado Denver Health Sciences Center to create and distribute code, software, and data for applying natural language processing techniques to biomedical texts.

There are many resources available for download at BioNLP-Corpora:

Like the guy says in the original Star Wars, “…almost there….”

In addition to being really useful resources, i am following a path that arose from the discovery of one resource.

One more website and then the article I found that lead to all the BioNLP* resources.

BioNLP

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 2:39 pm

BioNLP

From the homepage (worth repeating in full):

BioNLP is an initiative by the Center for Computational Pharmacology at the University of Colorado Denver Health Sciences Center to create and distribute code, software, and data for applying natural language processing techniques to biomedical texts. There are many projects associated with BioNLP.

Projects

  • BioLemmatizer: a biomedical literature specific lemmatizer.
  • BioNLP-Corpora: a repository of biologically and linguistically annotated corpora and biomedical datasets. This project includes
    • Colorado Richly Annotated Full-Text Corpus (CRAFT)
    • PICorpus
    • GeneHomonym
    • Annotation Projects
    • MEDLINE Mining projects
    • Anaphora Corpus
    • TestSuite Corpora
  • BioNLP-UIMA: Unstructured Information Management Architecture (UIMA) components geared towards the use and evaluation of tools for biomedical natural language processing, including tools for our own OpenDMAP and MutationFinder use.
  • common: a library of utility code for common tasks
  • Knowtator: a Protege plug-in for text annotation.
  • medline-xml-parser: a code library containing an XML parser for the 2012 Medline XML distribution format
  • MutationFinder: an information extraction system for extracting descriptions of point mutations from free text.
  • OboAnalyzer: an analysis tool to detect OBO ontology terms that use different linguistic conventions for expressing similar semantics.
  • OpenDMAP: an ontology-driven, rule-based concept analysis and information extraction system
  • Parentheses Classifier: a classifier for the content of parenthesized text
  • Simple Semantic Classifier: a text classifier for OBO domains
  • uima-shims: a library of simple interfaces designed to facilitate the development of type-system-independent UIMA components

August 15, 2012

BiologicalNetworks

Filed under: Bioinformatics,Biomedical,Genome — Patrick Durusau @ 10:25 am

BiologicalNetworks

From the webpage:

BiologicalNetworks research environment enables integrative analysis of:

  • Interaction networks, metabolic and signaling pathways together with transcriptomic, metabolomic and proteomic experiments data
  • Transcriptional regulation modules (modular networks)
  • Genomic sequences including gene regulatory regions (e.g. binding sites, promoter regions) and respective transcription factors, as well as NGS data
  • Comparative genomics, homologous/orthologous genes and phylogenies
  • 3D protein structures and ligand binding, small molecules and drugs
  • Multiple ontologies including GeneOntology, Cell and Tissue types, Diseases, Anatomy and taxonomies

BiologicalNetworks backend database (IntegromeDB) integrates >1000 curated data sources (from the NAR list) for thousands of eukaryotic, prokaryotic and viral organisms and millions of public biomedical, biochemical, drug, disease and health-related web resources.

Correction: As of 3 July 2012, “IntegromeDB’s index reaches 1 Billion (biomedical resources links) milestone.”

IntegromeDB collects all the biomedical, biochemical, drug and disease related data available in the public domain and brings you the most relevant data for your search. It provides you with an integrative view on the genomic, proteomic, transcriptomic, genetic and functional information featuring gene/protein names, synonyms and alternative IDs, gene function, orthologies, gene expression, pathways and molecular (protein-protein, TF-gene, genetic, etc.) interactions, mutations and SNPs, disease relationships, drugs and compounds, and many other. Explore and enjoy!

Sounds a lot like a topic map doesn’t it?

One interesting feature is Inconsistency in the integrated data.

The data sets are available for download as RDF files.

How would you:

  • Improve the consistency of integrated data?
  • Enable crowd participation in curation of data?
  • Enable the integration of data files into other data systems?

August 10, 2012

The Story Behind “Scaling Metagenome Assembly with Probabilistic de Bruijn Graphs”

Filed under: Bioinformatics,Biomedical,De Bruijn Graphs,Genome,Graphs — Patrick Durusau @ 3:11 pm

The Story Behind “Scaling Metagenome Assembly with Probabilistic de Bruijn Graphs” by C. Titus Brown.

From the post:

This is the story behind our PNAS paper, “Scaling Metagenome Assembly with Probabilistic de Bruijn Graphs” (released from embargo this past Monday).

Why did we write it? How did it get started? Well, rewind the tape 2 years and more…

There we were in May 2010, sitting on 500 million Illumina reads from shotgun DNA sequencing of an Iowa prairie soil sample. We wanted to reconstruct the microbial community contents and structure of the soil sample, but we couldn’t figure out how to do that from the data. We knew that, in theory, the data contained a number of partial microbial genomes, and we had a technique — de novo genome assembly — that could (again, in theory) reconstruct those partial genomes. But when we ran the software, it choked — 500 million reads was too big a data set for the software and computers we had. Plus, we were looking forward to the future, when we would get even more data; if the software was dying on us now, what would we do when we had 10, 100, or 1000 times as much data?

A perfect post to read over the weekend!

Not all research ends successfully, but when it does, it is a story that inspires.

[C]rowdsourcing … knowledge base construction

Filed under: Biomedical,Crowd Sourcing,Data Mining,Medical Informatics — Patrick Durusau @ 1:48 pm

Development and evaluation of a crowdsourcing methodology for knowledge base construction: identifying relationships between clinical problems and medications by Allison B McCoy, Adam Wright, Archana Laxmisan, Madelene J Ottosen, Jacob A McCoy, David Butten, and Dean F Sittig. (J Am Med Inform Assoc 2012; 19:713-718 doi:10.1136/amiajnl-2012-000852)

Abstract:

Objective We describe a novel, crowdsourcing method for generating a knowledge base of problem–medication pairs that takes advantage of manually asserted links between medications and problems.

Methods Through iterative review, we developed metrics to estimate the appropriateness of manually entered problem–medication links for inclusion in a knowledge base that can be used to infer previously unasserted links between problems and medications.

Results Clinicians manually linked 231 223 medications (55.30% of prescribed medications) to problems within the electronic health record, generating 41 203 distinct problem–medication pairs, although not all were accurate. We developed methods to evaluate the accuracy of the pairs, and after limiting the pairs to those meeting an estimated 95% appropriateness threshold, 11 166 pairs remained. The pairs in the knowledge base accounted for 183 127 total links asserted (76.47% of all links). Retrospective application of the knowledge base linked 68 316 medications not previously linked by a clinician to an indicated problem (36.53% of unlinked medications). Expert review of the combined knowledge base, including inferred and manually linked problem–medication pairs, found a sensitivity of 65.8% and a specificity of 97.9%.

Conclusion Crowdsourcing is an effective, inexpensive method for generating a knowledge base of problem–medication pairs that is automatically mapped to local terminologies, up-to-date, and reflective of local prescribing practices and trends.

I would not apply the term “crowdsourcing,” here, in part because the “crowd” is hardly unknown. Not a crowd at all, but an identifiable group of clinicians.

Doesn’t invalidate the results, which shows the utility of data mining for creating knowledge bases.

As a matter of usage, let’s not confuse anonymous “crowds,” with specific groups of people.

Phenol-Explorer 2.0:… [Topic Maps As Search Templates]

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 12:31 pm

Phenol-Explorer 2.0: a major update of the Phenol-Explorer database integrating data on polyphenol metabolism and pharmacokinetics in humans and experimental animals by Joseph A. Rothwell, Mireia Urpi-Sarda, Maria Boto-Ordoñez, Craig Knox, Rafael Llorach, Roman Eisner, Joseph Cruz, Vanessa Neveu, David Wishart, Claudine Manach, Cristina Andres-Lacueva, and Augustin Scalbert.

Abstract:

Phenol-Explorer, launched in 2009, is the only comprehensive web-based database on the content in foods of polyphenols, a major class of food bioactives that receive considerable attention due to their role in the prevention of diseases. Polyphenols are rarely absorbed and excreted in their ingested forms, but extensively metabolized in the body, and until now, no database has allowed the recall of identities and concentrations of polyphenol metabolites in biofluids after the consumption of polyphenol-rich sources. Knowledge of these metabolites is essential in the planning of experiments whose aim is to elucidate the effects of polyphenols on health. Release 2.0 is the first major update of the database, allowing the rapid retrieval of data on the biotransformations and pharmacokinetics of dietary polyphenols. Data on 375 polyphenol metabolites identified in urine and plasma were collected from 236 peer-reviewed publications on polyphenol metabolism in humans and experimental animals and added to the database by means of an extended relational design. Pharmacokinetic parameters have been collected and can be retrieved in both tabular and graphical form. The web interface has been enhanced and now allows the filtering of information according to various criteria. Phenol-Explorer 2.0, which will be periodically updated, should prove to be an even more useful and capable resource for polyphenol scientists because bioactivities and health effects of polyphenols are dependent on the nature and concentrations of metabolites reaching the target tissues. The Phenol-Explorer database is publicly available and can be found online at http://www.phenol-explorer.eu.

I wanted to call your attention to Table 1: Search Strategy and Terms, step 4 which reads:

Polyphenol* or flavan* or flavon*or anthocyan* or isoflav* or phytoestrogen* or phyto-estrogen* or lignin* or stilbene* or chalcon* or phenolic acid* or ellagic* or coumarin* or hydroxycinnamic* or quercetin* or kaempferol* or rutin* or apigenin* or luteolin* or catechin* or epicatechin* or gallocatechin* or epigallocatechin* or procyanidin* or hesperetin* or naringenin* or cyanidin* or malvidin* or petunid* or peonid*or daidz* or genist* or glycit* or equol* or gallic* or vanillic* or chlorogenic* or tyrosol* or hydoxytyrosol* or resveratrol* or viniferin*

Which of these terms are synonyms for “tyrosol?

No peeking!

Wikipedia (a generalist source), lists five (5) names, including tyrosol, and 5 different identifiers.

Common Chemistry, which you can access by the CAS number, has twenty-one (21) synonyms.

Ready?

Would you believe 0?

See for yourself: Wikipedia Tyrosol; Common Chemistry – CAS 501-94-0.

Another question: In one week (or even tomorrow), how much of the query in step 4 will you remember?

Some obvious comments:

  • The creators of Pehno-Explorer 2.0 have done a great service to the community by curating this data resource.
  • Creating comprehensive queries is a creative enterprise and not easy to duplicate.

Perhaps less obvious comments:

  • The terms in the query have synonyms, which is no great surprise.
  • If the terms were represented as topics in a topic map, synonyms could be captured for those terms.
  • Capturing of synonyms for terms would support expansion or contraction of search queries.
  • Capturing terms (and their synonyms) in a topic map, would permit merging of terms/synonyms from other researchers.

Final question: Have you thought about using topic maps as search templates?

i2b2: Informatics for Integrating Biology and the Bedside

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 4:43 am

i2b2: Informatics for Integrating Biology and the Bedside

I discovered this site while chasing down a coreference resolution workshop. From the homepage:

Informatics for Integrating Biology and the Bedside (i2b2) is an NIH-funded National Center for Biomedical Computing (NCBC) based at Partners HealthCare System in Boston, Mass. Established in 2004 in response to an NIH Roadmap Initiative RFA, this NCBC is one of four national centers awarded in this first competition (http://www.bisti.nih.gov/ncbc/); currently there are seven NCBCs. One of 12 specific initiatives in the New Pathways to Discovery Cluster, the NCBCs will initiate the development of a national computational infrastructure for biomedical computing. The NCBCs and related R01s constitute the National Program of Excellence in Biomedical Computing.

The i2b2 Center, led by Director Isaac Kohane, M.D., Ph.D., Professor of Pediatrics at Harvard Medical School at Children’s Hospital Boston, is comprised of seven cores involving investigators from the Harvard-affiliated hospitals, MIT, Harvard School of Public Health, Joslin Diabetes Center, Harvard Medical School and the Harvard/MIT Division of Health Sciences and Technology. This Center is funded under a Cooperative agreement with the National Institutes of Health.

The i2b2 Center is developing a scalable computational framework to address the bottleneck limiting the translation of genomic findings and hypotheses in model systems relevant to human health. New computational paradigms (Core 1) and methodologies (Cores 2) are being developed and tested in several diseases (airways disease, hypertension, type 2 diabetes mellitus, Huntington’s Disease, rheumatoid arthritis, and major depressive disorder) (Core 3 Driving Biological Projects).

The i2b2 Center (Core 5) offers a Summer Institute in Bioinformatics and Integrative Genomics for qualified undergraduate students, supports an Academic Users’ Group of over 125 members, sponsors annual Shared Tasks for Challenges in Natural Language Processing for Clinical Data, distributes an NLP DataSet for research purpose, and sponsors regular Symposia and Workshops for the community.

Sounds like prime hunting grounds for vocabularies that cross disciplinary boundaries and the like.

Extensive resources. Will explore and report back.

August 9, 2012

The Cell: An Image Library

Filed under: Bioinformatics,Biomedical,Data Source,Medical Informatics — Patrick Durusau @ 3:50 pm

The Cell: An Image Library

For the casual user, an impressive collection of cell images.

For the professional user, the advanced search page gives you an idea of the depth of images in this collection.

A good source of images for curated (not “mash up”) alignment with other materials. Such as instructional resources on biology or medicine.

August 8, 2012

The 2012 Nucleic Acids Research Database Issue…

Filed under: Bioinformatics,Biomedical,Genome — Patrick Durusau @ 1:50 pm

The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection by Michael Y. Galperin, and Xosé M. Fernández-Suárez.

Abstract:

The 19th annual Database Issue of Nucleic Acids Research features descriptions of 92 new online databases covering various areas of molecular biology and 100 papers describing recent updates to the databases previously described in NAR and other journals. The highlights of this issue include, among others, a description of neXtProt, a knowledgebase on human proteins; a detailed explanation of the principles behind the NCBI Taxonomy Database; NCBI and EBI papers on the recently launched BioSample databases that store sample information for a variety of database resources; descriptions of the recent developments in the Gene Ontology and UniProt Gene Ontology Annotation projects; updates on Pfam, SMART and InterPro domain databases; update papers on KEGG and TAIR, two universally acclaimed databases that face an uncertain future; and a separate section with 10 wiki-based databases, introduced in an accompanying editorial. The NAR online Molecular Biology Database Collection, available at http://www.oxfordjournals.org/nar/database/a/, has been updated and now lists 1380 databases. Brief machine-readable descriptions of the databases featured in this issue, according to the BioDBcore standards, will be provided at the http://biosharing.org/biodbcore web site. The full content of the Database Issue is freely available online on the Nucleic Acids Research web site (http://nar.oxfordjournals.org/).

Abstract of the article describing: Nucleic Acids Research, Database issue, Volume 40 Issue D1 January 2012.

Very much like being a kid in a candy store. Hard to know what to look at next! Both for subject matter experts and those of us interested in the technology aspects of the databases.

ANNOVAR: functional annotation of genetic variants….

Filed under: Bioinformatics,Biomedical,Genome — Patrick Durusau @ 1:50 pm

ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data by Kai Wang, Mingyao Li, and Hakon Hakonarson. (Nucl. Acids Res. (2010) 38 (16): e164. doi: 10.1093/nar/gkq603)

Just in case you are unfamiliar with ANNOVAR, the software mentioned in: gSearch: a fast and flexible general search tool for whole-genome sequencing:

Abstract:

High-throughput sequencing platforms are generating massive amounts of genetic variation data for diverse genomes, but it remains a challenge to pinpoint a small subset of functionally important variants. To fill these unmet needs, we developed the ANNOVAR tool to annotate single nucleotide variants (SNVs) and insertions/deletions, such as examining their functional consequence on genes, inferring cytogenetic bands, reporting functional importance scores, finding variants in conserved regions, or identifying variants reported in the 1000 Genomes Project and dbSNP. ANNOVAR can utilize annotation databases from the UCSC Genome Browser or any annotation data set conforming to Generic Feature Format version 3 (GFF3). We also illustrate a ‘variants reduction’ protocol on 4.7 million SNVs and indels from a human genome, including two causal mutations for Miller syndrome, a rare recessive disease. Through a stepwise procedure, we excluded variants that are unlikely to be causal, and identified 20 candidate genes including the causal gene. Using a desktop computer, ANNOVAR requires ∼4 min to perform gene-based annotation and ∼15 min to perform variants reduction on 4.7 million variants, making it practical to handle hundreds of human genomes in a day. ANNOVAR is freely available at http://www.openbioinformatics.org/annovar/.

Approximately two years separates ANNOVAR from gSearch. Should give you an idea of the speed of development in bioinformatics. They haven’t labored over finding a syntax for everyone to use for more than a decade. I suspect there is a lesson in there somewhere.

gSearch: a fast and flexible general search tool for whole-genome sequencing

Filed under: Bioinformatics,Biomedical,Genome — Patrick Durusau @ 1:49 pm

gSearch: a fast and flexible general search tool for whole-genome sequencing by Taemin Song, Kyu-Baek Hwang, Michael Hsing, Kyungjoon Lee, Justin Bohn, and Sek Won Kong.

Abstract:

Background: Various processes such as annotation and filtering of variants or comparison of variants in different genomes are required in whole-genome or exome analysis pipelines. However, processing different databases and searching among millions of genomic loci is not trivial.

Results: gSearch compares sequence variants in the Genome Variation Format (GVF) or Variant Call Format (VCF) with a pre-compiled annotation or with variants in other genomes. Its search algorithms are subsequently optimized and implemented in a multi-threaded manner. The proposed method is not a stand-alone annotation tool with its own reference databases. Rather, it is a search utility that readily accepts public or user-prepared reference files in various formats including GVF, Generic Feature Format version 3 (GFF3), Gene Transfer Format (GTF), VCF and Browser Extensible Data (BED) format. Compared to existing tools such as ANNOVAR, gSearch runs more than 10 times faster. For example, it is capable of annotating 52.8 million variants with allele frequencies in 6 min.

Availability: gSearch is available at http://ml.ssu.ac.kr/gSearch. It can be used as an independent search tool or can easily be integrated to existing pipelines through various programming environments such as Perl, Ruby and Python.

As the abstract says: “…searching among millions of genomic loci is not trivial.”

Either for integration with topic map tools in a pipeline or for searching technology, definitely worth a close reading.

BioContext: an integrated text mining system…

Filed under: Bioinformatics,Biomedical,Entity Extraction,Text Mining — Patrick Durusau @ 1:49 pm

BioContext: an integrated text mining system for large-scale extraction and contextualization of biomolecular events by Martin Gerner, Farzaneh Sarafraz, Casey M. Bergman, and Goran Nenadic. (Bioinformatics (2012) 28 (16): 2154-2161. doi: 10.1093/bioinformatics/bts332)

Abstract:

Motivation: Although the amount of data in biology is rapidly increasing, critical information for understanding biological events like phosphorylation or gene expression remains locked in the biomedical literature. Most current text mining (TM) approaches to extract information about biological events are focused on either limited-scale studies and/or abstracts, with data extracted lacking context and rarely available to support further research.

Results: Here we present BioContext, an integrated TM system which extracts, extends and integrates results from a number of tools performing entity recognition, biomolecular event extraction and contextualization. Application of our system to 10.9 million MEDLINE abstracts and 234 000 open-access full-text articles from PubMed Central yielded over 36 million mentions representing 11.4 million distinct events. Event participants included over 290 000 distinct genes/proteins that are mentioned more than 80 million times and linked where possible to Entrez Gene identifiers. Over a third of events contain contextual information such as the anatomical location of the event occurrence or whether the event is reported as negated or speculative.

Availability: The BioContext pipeline is available for download (under the BSD license) at http://www.biocontext.org, along with the extracted data which is also available for online browsing.

If you are interested in text mining by professionals, this is a good place to start.

Should be of particular interest to anyone interested in mining literature for construction of a topic map.

August 5, 2012

Journal of Pathology Informatics (JPI)

Filed under: Bioinformatics,Biomedical,Medical Informatics,Pathology Informatics — Patrick Durusau @ 10:09 am

Journal of Pathology Informatics (JPI)

About:

The Journal of Pathology Informatics (JPI) is an open access peer-reviewed journal dedicated to the advancement of pathology informatics. This is the official journal of the Association for Pathology Informatics (API). The journal aims to publish broadly about pathology informatics and freely disseminate all articles worldwide. This journal is of interest to pathologists, informaticians, academics, researchers, health IT specialists, information officers, IT staff, vendors, and anyone with an interest in informatics. We encourage submissions from anyone with an interest in the field of pathology informatics. We publish all types of papers related to pathology informatics including original research articles, technical notes, reviews, viewpoints, commentaries, editorials, book reviews, and correspondence to the editors. All submissions are subject to peer review by the well-regarded editorial board and by expert referees in appropriate specialties.

Another site to add to your whitelist of sites to search for informatics information.

> 4,000 Ways to say “You’re OK” [Breast Cancer Diagnosis]

The feasibility of using natural language processing to extract clinical information from breast pathology reports by Julliette M Buckley, et.al.

Abstract:

Objective: The opportunity to integrate clinical decision support systems into clinical practice is limited due to the lack of structured, machine readable data in the current format of the electronic health record. Natural language processing has been designed to convert free text into machine readable data. The aim of the current study was to ascertain the feasibility of using natural language processing to extract clinical information from >76,000 breast pathology reports.

Approach and Procedure: Breast pathology reports from three institutions were analyzed using natural language processing software (Clearforest, Waltham, MA) to extract information on a variety of pathologic diagnoses of interest. Data tables were created from the extracted information according to date of surgery, side of surgery, and medical record number. The variety of ways in which each diagnosis could be represented was recorded, as a means of demonstrating the complexity of machine interpretation of free text.

Results: There was widespread variation in how pathologists reported common pathologic diagnoses. We report, for example, 124 ways of saying invasive ductal carcinoma and 95 ways of saying invasive lobular carcinoma. There were >4000 ways of saying invasive ductal carcinoma was not present. Natural language processor sensitivity and specificity were 99.1% and 96.5% when compared to expert human coders.

Conclusion: We have demonstrated how a large body of free text medical information such as seen in breast pathology reports, can be converted to a machine readable format using natural language processing, and described the inherent complexities of the task.

The advantages of using current language practices include:

  • No new vocabulary needs to be developed.
  • No adoption curve for a new vocabulary.
  • No training required for users to introduce the new vocabulary
  • Works with historical data.

and I am sure there are others.

Add natural language usage to your topic map for immediately useful results for your clients.

August 2, 2012

Processing Rat Brain Neuronal Signals Using a Hadoop Computing Cluster – Part III

Filed under: Bioinformatics,Biomedical,Hadoop,MapReduce — Patrick Durusau @ 9:23 pm

Processing Rat Brain Neuronal Signals Using a Hadoop Computing Cluster – Part III by Jadin C. Jackson, PhD & Bradley S. Rubin, PhD.

From the post:

Up to this point, we’ve described our reasons for using Hadoop and Hive on our neural recordings (Part I), the reasons why the analyses of these recordings are interesting from a scientific perspective, and detailed descriptions of our implementation of these analyses using Hadoop and Hive (Part II). The last part of this story cuts straight to the results and then discusses important lessons we learned along the way and future goals for improving the analysis framework we’ve built so far.

Biomedical researchers will be interested in the results but I am more interested in the observation that Hadoop makes it possible to retain results for ad hoc analysis.

Community Based Annotation (mapping?)

Filed under: Annotation,Bioinformatics,Biomedical,Interface Research/Design,Ontology — Patrick Durusau @ 1:51 pm

Enabling authors to annotate their articles is examined in: Assessment of community-submitted ontology annotations from a novel database-journal partnership by Tanya Z. Berardini, Donghui Li, Robert Muller, Raymond Chetty, Larry Ploetz, Shanker Singh, April Wensel and Eva Huala.

Abstract:

As the scientific literature grows, leading to an increasing volume of published experimental data, so does the need to access and analyze this data using computational tools. The most commonly used method to convert published experimental data on gene function into controlled vocabulary annotations relies on a professional curator, employed by a model organism database or a more general resource such as UniProt, to read published articles and compose annotation statements based on the articles’ contents. A more cost-effective and scalable approach capable of capturing gene function data across the whole range of biological research organisms in computable form is urgently needed.

We have analyzed a set of ontology annotations generated through collaborations between the Arabidopsis Information Resource and several plant science journals. Analysis of the submissions entered using the online submission tool shows that most community annotations were well supported and the ontology terms chosen were at an appropriate level of specificity. Of the 503 individual annotations that were submitted, 97% were approved and community submissions captured 72% of all possible annotations. This new method for capturing experimental results in a computable form provides a cost-effective way to greatly increase the available body of annotations without sacrificing annotation quality.

It is encouraging that this annotation effort started with the persons most likely to know the correct answers, authors of the papers in question.

The low initial participation rate (16%) and improved after email reminder rate (53%), were less encouraging.

I suspect unless and until prior annotation practices (by researchers) becomes a line item on current funding requests (how many annotations were accepted by publishers of your prior research?), we will continue to see annotations to be a low priority item.

Perhaps I should suggest that as a study area for the NIH?

Publishers, researchers who build annotation software, annotated data sources and their maintainers, are all likely to be interested.

Would you be interested as well?

August 1, 2012

Processing Rat Brain Neuronal Signals Using a Hadoop Computing Cluster – Part II

Filed under: Bioinformatics,Biomedical,Hadoop,Signal Processing — Patrick Durusau @ 7:19 pm

Processing Rat Brain Neuronal Signals Using a Hadoop Computing Cluster – Part II by Jadin C. Jackson, PhD & Bradley S. Rubin, PhD.

From the post:

As mentioned in Part I, although Hadoop and other Big Data technologies are typically applied to I/O intensive workloads, where parallel data channels dramatically increase I/O throughput, there is growing interest in applying these technologies to CPU intensive workloads. In this work, we used Hadoop and Hive to digitally signal process individual neuron voltage signals captured from electrodes embedded in the rat brain. Previously, this processing was performed on a single Matlab workstation, a workload that was both CPU intensive and data intensive, especially for intermediate output data. With Hadoop/Hive, we were not only able to apply parallelism to the various processing steps, but had the additional benefit of having all the data online for additional ad hoc analysis. Here, we describe the technical details of our implementation, including the biological relevance of the neural signals and analysis parameters. In Part III, we will then describe the tradeoffs between the Matlab and Hadoop/Hive approach, performance results, and several issues identified with using Hadoop/Hive in this type of application.

Details of the setup for processing rat brain signals with Hadoop.

Looking back, I did not see any mention of data sets? Perhaps in part III?

July 31, 2012

Processing Rat Brain Neuronal Signals Using A Hadoop Computing Cluster – Part I

Filed under: Bioinformatics,Biomedical,Hadoop,Signal Processing — Patrick Durusau @ 4:54 pm

Processing Rat Brain Neuronal Signals Using A Hadoop Computing Cluster – Part I by Jadin C. Jackson, PhD & Bradley S. Rubin, PhD.

From the introduction:

In this three-part series of posts, we will share our experiences tackling a scientific computing challenge that may serve as a useful practical example for those readers considering Hadoop and Hive as an option to meet their growing technical and scientific computing needs. This first part describes some of the background behind our application and the advantages of Hadoop that make it an attractive framework in which to implement our solution. Part II dives into the technical details of the data we aimed to analyze and of our solution. Finally, we wrap up this series in Part III with a description of some of our main results, and most importantly perhaps, a list of things we learned along the way, as well as future possibilities for improvements.

And:

Problem Statement

Prior to starting this work, Jadin had data previously gathered by himself and from neuroscience researchers who are interested in the role of the brain region called the hippocampus. In both rats and humans, this region is responsible for both spatial processing and memory storage and retrieval. For example, as a rat runs a maze, neurons in the hippocampus, each representing a point in space, fire in sequence. When the rat revisits a path, and pauses to make decisions about how to proceed, those same neurons fire in similar sequences as the rat considers the previous consequences of taking one path versus another. In addition to this binary-like firing of neurons, brain waves, produced by ensembles of neurons, are present in different frequency bands. These act somewhat like clock signals, and the phase relationships of these signals correlate to specific brain signal pathways that provide input to this sub-region of the hippocampus.

The goal of the underlying neuroscience research is to correlate the physical state of the rat with specific characteristics of the signals coming from the neural circuitry in the hippocampus. Those signal differences reflect the origin of signals to the hippocampus. Signals that arise within the hippocampus indicate actions based on memory input, such as reencountering previously encountered situations. Signals that arise outside the hippocampus correspond to other cognitive processing. In this work, we digitally signal process the individual neuronal signal output and turn it into spectral information related to the brain region of origin for the signal input.

If this doesn’t sound like a topic map related problem on your first read, what would you call the “…brain region of origin for the signal input[?]”

That is if you wanted to say something about it. Or wanted to associate information, oh, I don’t know, captured from a signal processing application with it?

Hmmm, that’s what I thought too.

Besides, it is a good opportunity for you to exercise your Hadoop skills. Never a bad thing to work on the unfamiliar.

July 20, 2012

Software support for SBGN maps: SBGN-ML and LibSBGN

Filed under: Bioinformatics,Biomedical,Graphs,Hypergraphs — Patrick Durusau @ 3:30 pm

Software support for SBGN maps: SBGN-ML and LibSBGN (Martijn P. van Iersel, Alice C. Villéger, Tobias Czauderna, Sarah E. Boyd, Frank T. Bergmann, Augustin Luna, Emek Demir, Anatoly Sorokin, Ugur Dogrusoz, Yukiko Matsuoka, Akira Funahashi, Mirit I. Aladjem, Huaiyu Mi, Stuart L. Moodie, Hiroaki Kitano, Nicolas Le Novère, and Falk Schreiber
Software support for SBGN maps: SBGN-ML and LibSBGN Bioinformatics 2012 28: 2016-2021. )

Warning: Unless you really like mapping and markup languages this is likely to be a boring story. If you do (and I do), it is the sort of thing you will print out and enjoy reading. Just so you know.

Abstract:

Motivation: LibSBGN is a software library for reading, writing and manipulating Systems Biology Graphical Notation (SBGN) maps stored using the recently developed SBGN-ML file format. The library (available in C++ and Java) makes it easy for developers to add SBGN support to their tools, whereas the file format facilitates the exchange of maps between compatible software applications. The library also supports validation of maps, which simplifies the task of ensuring compliance with the detailed SBGN specifications. With this effort we hope to increase the adoption of SBGN in bioinformatics tools, ultimately enabling more researchers to visualize biological knowledge in a precise and unambiguous manner.

Availability and implementation: Milestone 2 was released in December 2011. Source code, example files and binaries are freely available under the terms of either the LGPL v2.1+ or Apache v2.0 open source licenses from http://libsbgn.sourceforge.net.

Contact: sbgn-libsbgn@lists.sourceforge.net

I included the hyperlinks to standards and software for the introduction but not the article references. Those are of interest too but for the moment I only want to entice you to read the article in full. There is a lot of graph work going on in bioinformatics and we would all do well to be more aware of it.

The Systems Biology Graphical Notation (SBGN, Le Novère et al., 2009) facilitates the representation and exchange of complex biological knowledge in a concise and unambiguous manner: as standardized pathway maps. It has been developed and supported by a vibrant community of biologists, biochemists, software developers, bioinformaticians and pathway databases experts.

SBGN is described in detail in the online specifications (see http://sbgn.org/Documents/Specifications). Here we summarize its concepts only briefly. SBGN defines three orthogonal visual languages: Process Description (PD), Entity Relationship (ER) and Activity Flow (AF). SBGN maps must follow the visual vocabulary, syntax and layout rules of one of these languages. The choice of language depends on the type of pathway or process being depicted and the amount of available information. The PD language, which originates from Kitano’s Process Diagrams (Kitano et al., 2005) and the related CellDesigner tool (Funahashi et al., 2008), is equivalent to a bipartite graph (with a few exceptions) with one type of nodes representing pools of biological entities, and a second type of nodes representing biological processes such as biochemical reactions, transport, binding and degradation. Arcs represent consumption, production or control, and can only connect nodes of differing types. The PD language is very suitable for metabolic pathways, but struggles to concisely depict the combinatorial complexity of certain proteins with many phosphorylation states. The ER language, on the other hand, is inspired by Kohn’s Molecular Interaction Maps (Kohn et al., 2006), and describes relations between biomolecules. In ER, two entities can be linked with an interaction arc. The outcome of an interaction (for example, a protein complex), is considered an entity in itself, represented by a black dot, which can engage in further interactions. Thus ER represents dependencies between interactions, or putting it differently, it can represent which interaction is necessary for another one to take place. Interactions are possible between two or more entities, which make ER maps roughly equivalent to a hypergraph in which an arc can connect more than two nodes. ER is more concise than PD when it comes to representing protein modifications and protein interactions, although it is less capable when it comes to presenting biochemical reactions. Finally, the third language in the SBGN family is AF, which represents the activities of biomolecules at a higher conceptual level. AF is suitable to represent the flow of causality between biomolecules even when detailed knowledge on biological processes is missing.

Efficient integration of the SBGN standard into the research cycle requires adoption by visualization and modeling software. Encouragingly, a growing number of pathway tools (see http://sbgn.org/SBGN_Software) offer some form of SBGN compatibility. However, current software implementations of SBGN are often incomplete and sometimes incorrect. This is not surprising: as SBGN covers a broad spectrum of biological phenomena, complete and accurate implementation of the full SBGN specifications represents a complex, error-prone and time-consuming task for individual tool developers. This development step could be simplified, and redundant implementation efforts avoided, by accurately translating the full SBGN specifications into a single software library, available freely for any tool developer to reuse in their own project. Moreover, the maps produced by any given tool usually cannot be reused in another tool, because SBGN only defines how biological information should be visualized, but not how the maps should be stored electronically. Related community standards for exchanging pathway knowledge, namely BioPAX (Demir et al., 2010) and SBML (Hucka et al., 2003), have proved insufficient for this role (more on this topic in Section 4). Therefore, we observed a second need, for a dedicated, standardized SBGN file format.

Following these observations, we started a community effort with two goals: to encourage the adoption of SBGN by facilitating its implementation in pathway tools, and to increase interoperability between SBGN-compatible software. This has resulted in a file format called SBGN-ML and a software library called LibSBGN. Each of these two components will be explained separately in the next sections.

Of course, there is always the data prior to this markup and the data that comes afterwards, so you could say I see a role for topic maps. 😉

« Newer PostsOlder Posts »

Powered by WordPress