Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 27, 2012

Reasoning with the Variation Ontology using Apache Jena #OWL #RDF

Filed under: Bioinformatics,Jena,OWL,RDF,Reasoning — Patrick Durusau @ 1:46 pm

Reasoning with the Variation Ontology using Apache Jena #OWL #RDF by Pierre Lindenbaum.

From the post:

The Variation Ontology (VariO), “is an ontology for standardized, systematic description of effects, consequences and mechanisms of variations”.

In this post I will use the Apache Jena library for RDF to load this ontology. It will then be used to extract a set of variations that are a sub-class of a given class of Variation.

If you are interested in this example, you may also be interested in the Variation Ontology.

The VariO homepage reports:

VariO allows

  • consistent naming
  • annotation of variation effects
  • data integration
  • comparison of variations and datasets
  • statistical studies
  • development of sofware tools

It isn’t clear on a quick read, how VariO accomplishes:

  • data integration
  • comparison of variations and datasets

Unless it means uniform recordation using VariO enables “data integration,” and “comparison of variations and datasets?”

True but what nomenclature, uniformly used, does not enable “data integration,” and “comparison of variations and datasets?”

Is there one?

August 25, 2012

FragVLib a free database mining software for generating “Fragment-based Virtual Library” using pocket similarity…

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 6:13 pm

FragVLib a free database mining software for generating “Fragment-based Virtual Library” using pocket similarity search of ligand-receptor complexes Raed Khashan. Journal of Cheminformatics 2012, 4:18 doi:10.1186/1758-2946-4-18.

Abstract:

Background

With the exponential increase in the number of available ligand-receptor complexes, researchers are becoming more dedicated to mine these complexes to facilitate the drug design and development process. Therefore, we present FragVLib, free software which is developed as a tool for performing similarity search across database(s) of ligand-receptor complexes for identifying binding pockets which are similar to that of a target receptor.

Results

The search is based on 3D-geometric and chemical similarity of the atoms forming the binding pocket. For each match identified, the ligand’s fragment(s) corresponding to that binding pocket are extracted, thus, forming a virtual library of fragments (FragVLib) that is useful for structure-based drug design.

Conclusions

An efficient algorithm is implemented in FragVLib to facilitate the pocket similarity search. The resulting fragments can be used for structure-based drug design tools such as Fragment-Based Lead Discovery (FBLD). They can also be used for finding bioisosteres and as an idea generator.

Suggestions of other uses of 3D-geometric shapes for similarity?

August 19, 2012

Bi-directional semantic similarity….

Filed under: Bioinformatics,Biomedical,Semantics,Similarity — Patrick Durusau @ 6:32 pm

Bi-directional semantic similarity for gene ontology to optimize biological and clinical analyses by Sang Jay Bien, Chan Hee Park, Hae Jin Shim, Woongcheol Yang, Jihun Kim and Ju Han Kim.

Abstract:

Background Semantic similarity analysis facilitates automated semantic explanations of biological and clinical data annotated by biomedical ontologies. Gene ontology (GO) has become one of the most important biomedical ontologies with a set of controlled vocabularies, providing rich semantic annotations for genes and molecular phenotypes for diseases. Current methods for measuring GO semantic similarities are limited to considering only the ancestor terms while neglecting the descendants. One can find many GO term pairs whose ancestors are identical but whose descendants are very different and vice versa. Moreover, the lower parts of GO trees are full of terms with more specific semantics.

Methods This study proposed a method of measuring semantic similarities between GO terms using the entire GO tree structure, including both the upper (ancestral) and the lower (descendant) parts. Comprehensive comparison studies were performed with well-known information content-based and graph structure-based semantic similarity measures with protein sequence similarities, gene expression-profile correlations, protein–protein interactions, and biological pathway analyses.

Conclusion The proposed bidirectional measure of semantic similarity outperformed other graph-based and information content-based methods.

Makes me curious what the experience with direction and identification has been with other ontologies?

Concept Annotation in the CRAFT corpus

Filed under: Bioinformatics,Biomedical,Corpora,Natural Language Processing — Patrick Durusau @ 4:47 pm

Concept Annotation in the CRAFT corpus by Michael Bada, Miriam Eckert, Donald Evans, Kristin Garcia, Krista Shipley, Dmitry Sitnikov, William A. Baumgartner, K. Bretonnel Cohen, Karin Verspoor, Judith A. Blake and Lawrence E. Hunter by BMC Bioinformatics 2012, 13:161 doi:10.1186/1471-2105-13-161.

Abstract:

Background

Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text.

Results

This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement.

Conclusions

As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.

Lessons on what it takes to create a “gold standard” corpus to advance NLP application development.

What do you think the odds are of “high inter[author] agreement” in the absence of such planning and effort?

Sorry, I meant “high interannotator agreement.”

Guess we have to plan for “low inter[author] agreement.”

Suggestions?

Gold Standard (or Bronze, Tin?)

A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools by Karin M Verspoor, Kevin B Cohen, Arrick Lanfranchi, Colin Warner, Helen L Johnson, Christophe Roeder, Jinho D Choi, Christopher Funk, Yuriy Malenkiy, Miriam Eckert, Nianwen Xue, William A Baumgartner, Michael Bada, Martha Palmer and Lawrence E Hunter. BMC Bioinformatics 2012, 13:207 doi:10.1186/1471-2105-13-207.

Abstract:

Background

We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus.

Results

Many biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data.

Conclusions

The finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides a valuable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications.

This is the article that I discovered and then worked my way to it from BioNLP.

Important as a deeply annotated text corpus.

But also a reminder that human annotators created the “gold standard,” against which other efforts are judged.

If you are ill, do you want gold standard research into the medical literature (which involves librarians)? Or is bronze or tin standard research good enough?

PS: I will be going back to pickup the other resources as appropriate.

CRAFT: THE COLORADO RICHLY ANNOTATED FULL TEXT CORPUS

Filed under: Bioinformatics,Biomedical,Corpora,Natural Language Processing — Patrick Durusau @ 3:41 pm

CRAFT: THE COLORADO RICHLY ANNOTATED FULL TEXT CORPUS

From the Quick Facts:

  • 67 full text articles
  • >560,000 Tokens
  • >21,000 Sentences
  • ~100,000 concept annotations to 7 different biomedical ontologies/terminologies
    • Chemical Entities of Biological Interest (ChEBI)
    • Cell Type Ontology (CL)
    • Entrez Gene
    • Gene Ontology (biological process, cellular component, and molecular function)
    • NCBI Taxonomy
    • Protein Ontology
    • Sequence Ontology
  • Penn Treebank markup for each sentence
  • Multiple output formats available

Let’s see: 67 articles resulted in 100,000 concept annotations, or about 1,493 per article for seven (7) ontologies/terminologies.

Ready to test this mapping out in your topic map application?

BioNLP-Corpora

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 2:46 pm

BioNLP-Corpora

From the webpage:

BioNLP-Corpora is a repository of biologically and linguistically annotated corpora and biological datasets.

It is one of the projects of the BioNLP initiative by the Center for Computational Pharmacology at the University of Colorado Denver Health Sciences Center to create and distribute code, software, and data for applying natural language processing techniques to biomedical texts.

There are many resources available for download at BioNLP-Corpora:

Like the guy says in the original Star Wars, “…almost there….”

In addition to being really useful resources, i am following a path that arose from the discovery of one resource.

One more website and then the article I found that lead to all the BioNLP* resources.

BioNLP

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 2:39 pm

BioNLP

From the homepage (worth repeating in full):

BioNLP is an initiative by the Center for Computational Pharmacology at the University of Colorado Denver Health Sciences Center to create and distribute code, software, and data for applying natural language processing techniques to biomedical texts. There are many projects associated with BioNLP.

Projects

  • BioLemmatizer: a biomedical literature specific lemmatizer.
  • BioNLP-Corpora: a repository of biologically and linguistically annotated corpora and biomedical datasets. This project includes
    • Colorado Richly Annotated Full-Text Corpus (CRAFT)
    • PICorpus
    • GeneHomonym
    • Annotation Projects
    • MEDLINE Mining projects
    • Anaphora Corpus
    • TestSuite Corpora
  • BioNLP-UIMA: Unstructured Information Management Architecture (UIMA) components geared towards the use and evaluation of tools for biomedical natural language processing, including tools for our own OpenDMAP and MutationFinder use.
  • common: a library of utility code for common tasks
  • Knowtator: a Protege plug-in for text annotation.
  • medline-xml-parser: a code library containing an XML parser for the 2012 Medline XML distribution format
  • MutationFinder: an information extraction system for extracting descriptions of point mutations from free text.
  • OboAnalyzer: an analysis tool to detect OBO ontology terms that use different linguistic conventions for expressing similar semantics.
  • OpenDMAP: an ontology-driven, rule-based concept analysis and information extraction system
  • Parentheses Classifier: a classifier for the content of parenthesized text
  • Simple Semantic Classifier: a text classifier for OBO domains
  • uima-shims: a library of simple interfaces designed to facilitate the development of type-system-independent UIMA components

August 15, 2012

BiologicalNetworks

Filed under: Bioinformatics,Biomedical,Genome — Patrick Durusau @ 10:25 am

BiologicalNetworks

From the webpage:

BiologicalNetworks research environment enables integrative analysis of:

  • Interaction networks, metabolic and signaling pathways together with transcriptomic, metabolomic and proteomic experiments data
  • Transcriptional regulation modules (modular networks)
  • Genomic sequences including gene regulatory regions (e.g. binding sites, promoter regions) and respective transcription factors, as well as NGS data
  • Comparative genomics, homologous/orthologous genes and phylogenies
  • 3D protein structures and ligand binding, small molecules and drugs
  • Multiple ontologies including GeneOntology, Cell and Tissue types, Diseases, Anatomy and taxonomies

BiologicalNetworks backend database (IntegromeDB) integrates >1000 curated data sources (from the NAR list) for thousands of eukaryotic, prokaryotic and viral organisms and millions of public biomedical, biochemical, drug, disease and health-related web resources.

Correction: As of 3 July 2012, “IntegromeDB’s index reaches 1 Billion (biomedical resources links) milestone.”

IntegromeDB collects all the biomedical, biochemical, drug and disease related data available in the public domain and brings you the most relevant data for your search. It provides you with an integrative view on the genomic, proteomic, transcriptomic, genetic and functional information featuring gene/protein names, synonyms and alternative IDs, gene function, orthologies, gene expression, pathways and molecular (protein-protein, TF-gene, genetic, etc.) interactions, mutations and SNPs, disease relationships, drugs and compounds, and many other. Explore and enjoy!

Sounds a lot like a topic map doesn’t it?

One interesting feature is Inconsistency in the integrated data.

The data sets are available for download as RDF files.

How would you:

  • Improve the consistency of integrated data?
  • Enable crowd participation in curation of data?
  • Enable the integration of data files into other data systems?

August 13, 2012

Caleydo Project

Filed under: Bioinformatics,Graphics,Graphs,Networks,Visualization — Patrick Durusau @ 4:04 pm

Caleydo Project

From the webpage:

Caleydo is an open source visual analysis framework targeted at biomolecular data. The biggest strength of Caleydo is the visualization of interdependencies between multiple datasets. Caleydo can load tabular data and groupings/clusterings. You can explore relationships between multiple groupings, between different datasets and see how your data maps onto pathways.

Caleydo has been successfully used to analyze mRNA, miRNA, methylation, copy number variation, mutation status and clinical data as well as other datset types.

The screenshot from mybiosoftware.com really caught my attention:

Caleydo Screenshot

Targets biomolecular data but may have broader applications.

August 12, 2012

Systematic benchmark of substructure search in molecular graphs – From Ullmann to VF2

Filed under: Algorithms,Bioinformatics,Graphs,Molecular Graphs — Patrick Durusau @ 1:11 pm

Systematic benchmark of substructure search in molecular graphs – From Ullmann to VF2 by Hans-Christian Ehrlich and Matthias Rarey. (Journal of Cheminformatics 2012, 4:13 doi:10.1186/1758-2946-4-13)

Abstract:

Background

Searching for substructures in molecules belongs to the most elementary tasks in cheminformatics and is nowadays part of virtually every cheminformatics software. The underlying algorithms, used over several decades, are designed for the application to general graphs. Applied on molecular graphs, little effort has been spend on characterizing their performance. Therefore, it is not clear how current substructure search algorithms behave on such special graphs. One of the main reasons why such an evaluation was not performed in the past was the absence of appropriate data sets.

Results

In this paper, we present a systematic evaluation of Ullmann’s and the VF2 subgraph isomorphism algorithms on molecular data. The benchmark set consists of a collection of 1236 SMARTS substructure expressions and selected molecules from the ZINC database. The benchmark evaluates substructures search times for complete database scans as well as individual substructure-molecule-pairs. In detail, we focus on the influence of substructure formulation and size, the impact of molecule size, and the ability of both algorithms to be used on multiple cores.

Conclusions

The results show a clear superiority of the VF2 algorithm in all test scenarios. In general, both algorithms solve most instances in less than one millisecond, which we consider to be acceptable. Still, in direct comparison, the VF2 is most often several folds faster than Ullmann’s algorithm. Additionally, Ullmann’s algorithm shows a surprising number of run time outliers.

Questions:

How do your graphs compare to molecular graphs? Similarities? Differences?

For searching molecular graphs, what algorithm does your software use for substructure searches?

August 11, 2012

Neo4j and Bioinformatics

Filed under: Bio4j,Bioinformatics,Graphs,Neo4j — Patrick Durusau @ 6:01 pm

Neo4j and Bioinformatics

From the description:

Pablo Pareja will give an overview of Bio4j project, and then move to some of its recent applications. BG7: a new system for bacterial genome annotation designed for NGS data MG7: metagenomics + taxonomy integration Evolutionary studies, transcriptional networks, network analysis..

It may just be me but the sound seems “faint.” Even when set to full volume, it is difficult to hear Pablo clearly.

I have tried this on two different computers with different OSes so I don’t think it is a problem on my end.

Your experience?

BTW, slides are here.

August 10, 2012

The Story Behind “Scaling Metagenome Assembly with Probabilistic de Bruijn Graphs”

Filed under: Bioinformatics,Biomedical,De Bruijn Graphs,Genome,Graphs — Patrick Durusau @ 3:11 pm

The Story Behind “Scaling Metagenome Assembly with Probabilistic de Bruijn Graphs” by C. Titus Brown.

From the post:

This is the story behind our PNAS paper, “Scaling Metagenome Assembly with Probabilistic de Bruijn Graphs” (released from embargo this past Monday).

Why did we write it? How did it get started? Well, rewind the tape 2 years and more…

There we were in May 2010, sitting on 500 million Illumina reads from shotgun DNA sequencing of an Iowa prairie soil sample. We wanted to reconstruct the microbial community contents and structure of the soil sample, but we couldn’t figure out how to do that from the data. We knew that, in theory, the data contained a number of partial microbial genomes, and we had a technique — de novo genome assembly — that could (again, in theory) reconstruct those partial genomes. But when we ran the software, it choked — 500 million reads was too big a data set for the software and computers we had. Plus, we were looking forward to the future, when we would get even more data; if the software was dying on us now, what would we do when we had 10, 100, or 1000 times as much data?

A perfect post to read over the weekend!

Not all research ends successfully, but when it does, it is a story that inspires.

Phenol-Explorer 2.0:… [Topic Maps As Search Templates]

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 12:31 pm

Phenol-Explorer 2.0: a major update of the Phenol-Explorer database integrating data on polyphenol metabolism and pharmacokinetics in humans and experimental animals by Joseph A. Rothwell, Mireia Urpi-Sarda, Maria Boto-Ordoñez, Craig Knox, Rafael Llorach, Roman Eisner, Joseph Cruz, Vanessa Neveu, David Wishart, Claudine Manach, Cristina Andres-Lacueva, and Augustin Scalbert.

Abstract:

Phenol-Explorer, launched in 2009, is the only comprehensive web-based database on the content in foods of polyphenols, a major class of food bioactives that receive considerable attention due to their role in the prevention of diseases. Polyphenols are rarely absorbed and excreted in their ingested forms, but extensively metabolized in the body, and until now, no database has allowed the recall of identities and concentrations of polyphenol metabolites in biofluids after the consumption of polyphenol-rich sources. Knowledge of these metabolites is essential in the planning of experiments whose aim is to elucidate the effects of polyphenols on health. Release 2.0 is the first major update of the database, allowing the rapid retrieval of data on the biotransformations and pharmacokinetics of dietary polyphenols. Data on 375 polyphenol metabolites identified in urine and plasma were collected from 236 peer-reviewed publications on polyphenol metabolism in humans and experimental animals and added to the database by means of an extended relational design. Pharmacokinetic parameters have been collected and can be retrieved in both tabular and graphical form. The web interface has been enhanced and now allows the filtering of information according to various criteria. Phenol-Explorer 2.0, which will be periodically updated, should prove to be an even more useful and capable resource for polyphenol scientists because bioactivities and health effects of polyphenols are dependent on the nature and concentrations of metabolites reaching the target tissues. The Phenol-Explorer database is publicly available and can be found online at http://www.phenol-explorer.eu.

I wanted to call your attention to Table 1: Search Strategy and Terms, step 4 which reads:

Polyphenol* or flavan* or flavon*or anthocyan* or isoflav* or phytoestrogen* or phyto-estrogen* or lignin* or stilbene* or chalcon* or phenolic acid* or ellagic* or coumarin* or hydroxycinnamic* or quercetin* or kaempferol* or rutin* or apigenin* or luteolin* or catechin* or epicatechin* or gallocatechin* or epigallocatechin* or procyanidin* or hesperetin* or naringenin* or cyanidin* or malvidin* or petunid* or peonid*or daidz* or genist* or glycit* or equol* or gallic* or vanillic* or chlorogenic* or tyrosol* or hydoxytyrosol* or resveratrol* or viniferin*

Which of these terms are synonyms for “tyrosol?

No peeking!

Wikipedia (a generalist source), lists five (5) names, including tyrosol, and 5 different identifiers.

Common Chemistry, which you can access by the CAS number, has twenty-one (21) synonyms.

Ready?

Would you believe 0?

See for yourself: Wikipedia Tyrosol; Common Chemistry – CAS 501-94-0.

Another question: In one week (or even tomorrow), how much of the query in step 4 will you remember?

Some obvious comments:

  • The creators of Pehno-Explorer 2.0 have done a great service to the community by curating this data resource.
  • Creating comprehensive queries is a creative enterprise and not easy to duplicate.

Perhaps less obvious comments:

  • The terms in the query have synonyms, which is no great surprise.
  • If the terms were represented as topics in a topic map, synonyms could be captured for those terms.
  • Capturing of synonyms for terms would support expansion or contraction of search queries.
  • Capturing terms (and their synonyms) in a topic map, would permit merging of terms/synonyms from other researchers.

Final question: Have you thought about using topic maps as search templates?

i2b2: Informatics for Integrating Biology and the Bedside

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 4:43 am

i2b2: Informatics for Integrating Biology and the Bedside

I discovered this site while chasing down a coreference resolution workshop. From the homepage:

Informatics for Integrating Biology and the Bedside (i2b2) is an NIH-funded National Center for Biomedical Computing (NCBC) based at Partners HealthCare System in Boston, Mass. Established in 2004 in response to an NIH Roadmap Initiative RFA, this NCBC is one of four national centers awarded in this first competition (http://www.bisti.nih.gov/ncbc/); currently there are seven NCBCs. One of 12 specific initiatives in the New Pathways to Discovery Cluster, the NCBCs will initiate the development of a national computational infrastructure for biomedical computing. The NCBCs and related R01s constitute the National Program of Excellence in Biomedical Computing.

The i2b2 Center, led by Director Isaac Kohane, M.D., Ph.D., Professor of Pediatrics at Harvard Medical School at Children’s Hospital Boston, is comprised of seven cores involving investigators from the Harvard-affiliated hospitals, MIT, Harvard School of Public Health, Joslin Diabetes Center, Harvard Medical School and the Harvard/MIT Division of Health Sciences and Technology. This Center is funded under a Cooperative agreement with the National Institutes of Health.

The i2b2 Center is developing a scalable computational framework to address the bottleneck limiting the translation of genomic findings and hypotheses in model systems relevant to human health. New computational paradigms (Core 1) and methodologies (Cores 2) are being developed and tested in several diseases (airways disease, hypertension, type 2 diabetes mellitus, Huntington’s Disease, rheumatoid arthritis, and major depressive disorder) (Core 3 Driving Biological Projects).

The i2b2 Center (Core 5) offers a Summer Institute in Bioinformatics and Integrative Genomics for qualified undergraduate students, supports an Academic Users’ Group of over 125 members, sponsors annual Shared Tasks for Challenges in Natural Language Processing for Clinical Data, distributes an NLP DataSet for research purpose, and sponsors regular Symposia and Workshops for the community.

Sounds like prime hunting grounds for vocabularies that cross disciplinary boundaries and the like.

Extensive resources. Will explore and report back.

August 9, 2012

The Cell: An Image Library

Filed under: Bioinformatics,Biomedical,Data Source,Medical Informatics — Patrick Durusau @ 3:50 pm

The Cell: An Image Library

For the casual user, an impressive collection of cell images.

For the professional user, the advanced search page gives you an idea of the depth of images in this collection.

A good source of images for curated (not “mash up”) alignment with other materials. Such as instructional resources on biology or medicine.

August 8, 2012

The 2012 Nucleic Acids Research Database Issue…

Filed under: Bioinformatics,Biomedical,Genome — Patrick Durusau @ 1:50 pm

The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection by Michael Y. Galperin, and Xosé M. Fernández-Suárez.

Abstract:

The 19th annual Database Issue of Nucleic Acids Research features descriptions of 92 new online databases covering various areas of molecular biology and 100 papers describing recent updates to the databases previously described in NAR and other journals. The highlights of this issue include, among others, a description of neXtProt, a knowledgebase on human proteins; a detailed explanation of the principles behind the NCBI Taxonomy Database; NCBI and EBI papers on the recently launched BioSample databases that store sample information for a variety of database resources; descriptions of the recent developments in the Gene Ontology and UniProt Gene Ontology Annotation projects; updates on Pfam, SMART and InterPro domain databases; update papers on KEGG and TAIR, two universally acclaimed databases that face an uncertain future; and a separate section with 10 wiki-based databases, introduced in an accompanying editorial. The NAR online Molecular Biology Database Collection, available at http://www.oxfordjournals.org/nar/database/a/, has been updated and now lists 1380 databases. Brief machine-readable descriptions of the databases featured in this issue, according to the BioDBcore standards, will be provided at the http://biosharing.org/biodbcore web site. The full content of the Database Issue is freely available online on the Nucleic Acids Research web site (http://nar.oxfordjournals.org/).

Abstract of the article describing: Nucleic Acids Research, Database issue, Volume 40 Issue D1 January 2012.

Very much like being a kid in a candy store. Hard to know what to look at next! Both for subject matter experts and those of us interested in the technology aspects of the databases.

ANNOVAR: functional annotation of genetic variants….

Filed under: Bioinformatics,Biomedical,Genome — Patrick Durusau @ 1:50 pm

ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data by Kai Wang, Mingyao Li, and Hakon Hakonarson. (Nucl. Acids Res. (2010) 38 (16): e164. doi: 10.1093/nar/gkq603)

Just in case you are unfamiliar with ANNOVAR, the software mentioned in: gSearch: a fast and flexible general search tool for whole-genome sequencing:

Abstract:

High-throughput sequencing platforms are generating massive amounts of genetic variation data for diverse genomes, but it remains a challenge to pinpoint a small subset of functionally important variants. To fill these unmet needs, we developed the ANNOVAR tool to annotate single nucleotide variants (SNVs) and insertions/deletions, such as examining their functional consequence on genes, inferring cytogenetic bands, reporting functional importance scores, finding variants in conserved regions, or identifying variants reported in the 1000 Genomes Project and dbSNP. ANNOVAR can utilize annotation databases from the UCSC Genome Browser or any annotation data set conforming to Generic Feature Format version 3 (GFF3). We also illustrate a ‘variants reduction’ protocol on 4.7 million SNVs and indels from a human genome, including two causal mutations for Miller syndrome, a rare recessive disease. Through a stepwise procedure, we excluded variants that are unlikely to be causal, and identified 20 candidate genes including the causal gene. Using a desktop computer, ANNOVAR requires ∼4 min to perform gene-based annotation and ∼15 min to perform variants reduction on 4.7 million variants, making it practical to handle hundreds of human genomes in a day. ANNOVAR is freely available at http://www.openbioinformatics.org/annovar/.

Approximately two years separates ANNOVAR from gSearch. Should give you an idea of the speed of development in bioinformatics. They haven’t labored over finding a syntax for everyone to use for more than a decade. I suspect there is a lesson in there somewhere.

gSearch: a fast and flexible general search tool for whole-genome sequencing

Filed under: Bioinformatics,Biomedical,Genome — Patrick Durusau @ 1:49 pm

gSearch: a fast and flexible general search tool for whole-genome sequencing by Taemin Song, Kyu-Baek Hwang, Michael Hsing, Kyungjoon Lee, Justin Bohn, and Sek Won Kong.

Abstract:

Background: Various processes such as annotation and filtering of variants or comparison of variants in different genomes are required in whole-genome or exome analysis pipelines. However, processing different databases and searching among millions of genomic loci is not trivial.

Results: gSearch compares sequence variants in the Genome Variation Format (GVF) or Variant Call Format (VCF) with a pre-compiled annotation or with variants in other genomes. Its search algorithms are subsequently optimized and implemented in a multi-threaded manner. The proposed method is not a stand-alone annotation tool with its own reference databases. Rather, it is a search utility that readily accepts public or user-prepared reference files in various formats including GVF, Generic Feature Format version 3 (GFF3), Gene Transfer Format (GTF), VCF and Browser Extensible Data (BED) format. Compared to existing tools such as ANNOVAR, gSearch runs more than 10 times faster. For example, it is capable of annotating 52.8 million variants with allele frequencies in 6 min.

Availability: gSearch is available at http://ml.ssu.ac.kr/gSearch. It can be used as an independent search tool or can easily be integrated to existing pipelines through various programming environments such as Perl, Ruby and Python.

As the abstract says: “…searching among millions of genomic loci is not trivial.”

Either for integration with topic map tools in a pipeline or for searching technology, definitely worth a close reading.

BioContext: an integrated text mining system…

Filed under: Bioinformatics,Biomedical,Entity Extraction,Text Mining — Patrick Durusau @ 1:49 pm

BioContext: an integrated text mining system for large-scale extraction and contextualization of biomolecular events by Martin Gerner, Farzaneh Sarafraz, Casey M. Bergman, and Goran Nenadic. (Bioinformatics (2012) 28 (16): 2154-2161. doi: 10.1093/bioinformatics/bts332)

Abstract:

Motivation: Although the amount of data in biology is rapidly increasing, critical information for understanding biological events like phosphorylation or gene expression remains locked in the biomedical literature. Most current text mining (TM) approaches to extract information about biological events are focused on either limited-scale studies and/or abstracts, with data extracted lacking context and rarely available to support further research.

Results: Here we present BioContext, an integrated TM system which extracts, extends and integrates results from a number of tools performing entity recognition, biomolecular event extraction and contextualization. Application of our system to 10.9 million MEDLINE abstracts and 234 000 open-access full-text articles from PubMed Central yielded over 36 million mentions representing 11.4 million distinct events. Event participants included over 290 000 distinct genes/proteins that are mentioned more than 80 million times and linked where possible to Entrez Gene identifiers. Over a third of events contain contextual information such as the anatomical location of the event occurrence or whether the event is reported as negated or speculative.

Availability: The BioContext pipeline is available for download (under the BSD license) at http://www.biocontext.org, along with the extracted data which is also available for online browsing.

If you are interested in text mining by professionals, this is a good place to start.

Should be of particular interest to anyone interested in mining literature for construction of a topic map.

August 7, 2012

RESQUE: Network reduction…. [Are you listening NSA?]

Filed under: Bioinformatics,Graphs,Networks — Patrick Durusau @ 4:23 pm

RESQUE: Network reduction using semi-Markov random walk scores for efficient querying of biological networks by Sayed Mohammad Ebrahim Sahraeian and Byung-Jun Yoon. (Bioinformatics (2012) 28 (16): 2129-2136. doi: 10.1093/bioinformatics/bts341)

Abstract:

Motivation: Recent technological advances in measuring molecular interactions have resulted in an increasing number of large-scale biological networks. Translation of these enormous network data into meaningful biological insights requires efficient computational techniques that can unearth the biological information that is encoded in the networks. One such example is network querying, which aims to identify similar subnetwork regions in a large target network that are similar to a given query network. Network querying tools can be used to identify novel biological pathways that are homologous to known pathways, thereby enabling knowledge transfer across different organisms.

Results: In this article, we introduce an efficient algorithm for querying large-scale biological networks, called RESQUE. The proposed algorithm adopts a semi-Markov random walk (SMRW) model to probabilistically estimate the correspondence scores between nodes that belong to different networks. The target network is iteratively reduced based on the estimated correspondence scores, which are also iteratively re-estimated to improve accuracy until the best matching subnetwork emerges. We demonstrate that the proposed network querying scheme is computationally efficient, can handle any network query with an arbitrary topology and yields accurate querying results.

Availability: The source code of RESQUE is freely available at http://www.ece.tamu.edu/~bjyoon/RESQUE/

If you promise not to tell, you can get a preprint version of the article at the source code link.

RESQUE: REduction-based scheme using Semi-Markov scores for networkQUEing.

Sounds like a starting point if you are interested in the Visualize This! (NSA Network Visualization Contest).


If you know of anyone looking for a tele-commuting researcher who finds interesting things, pass my name along. Thanks!

August 5, 2012

Journal of the American Medical Informatics Association (JAMIA)

Filed under: Bioinformatics,Informatics,Medical Informatics,Pathology Informatics — Patrick Durusau @ 10:53 am

Journal of the American Medical Informatics Association (JAMIA)

Aims and Scope

JAMIA is AMIA‘s premier peer-reviewed journal for biomedical and health informatics. Covering the full spectrum of activities in the field, JAMIA includes informatics articles in the areas of clinical care, clinical research, translational science, implementation science, imaging, education, consumer health, public health, and policy. JAMIA’s articles describe innovative informatics research and systems that help to advance biomedical science and to promote health. Case reports, perspectives and reviews also help readers stay connected with the most important informatics developments in implementation, policy and education.

Another informatics journal to whitelist for searching.

Content is freely available after twelve (12) months.

Cancer, NLP & Kaiser Permanente Southern California (KPSC)

Filed under: Bioinformatics,Medical Informatics,Pathology Informatics,Uncategorized — Patrick Durusau @ 10:38 am

Kaiser Permanente Southern California (KPSC) deserves high marks for the research in:

Identifying primary and recurrent cancers using a SAS-based natural language processing algorithm by Justin A Strauss, et. al.

Abstract:

Objective Significant limitations exist in the timely and complete identification of primary and recurrent cancers for clinical and epidemiologic research. A SAS-based coding, extraction, and nomenclature tool (SCENT) was developed to address this problem.

Materials and methods SCENT employs hierarchical classification rules to identify and extract information from electronic pathology reports. Reports are analyzed and coded using a dictionary of clinical concepts and associated SNOMED codes. To assess the accuracy of SCENT, validation was conducted using manual review of pathology reports from a random sample of 400 breast and 400 prostate cancer patients diagnosed at Kaiser Permanente Southern California. Trained abstractors classified the malignancy status of each report.

Results Classifications of SCENT were highly concordant with those of abstractors, achieving κ of 0.96 and 0.95 in the breast and prostate cancer groups, respectively. SCENT identified 51 of 54 new primary and 60 of 61 recurrent cancer cases across both groups, with only three false positives in 792 true benign cases. Measures of sensitivity, specificity, positive predictive value, and negative predictive value exceeded 94% in both cancer groups.

Discussion Favorable validation results suggest that SCENT can be used to identify, extract, and code information from pathology report text. Consequently, SCENT has wide applicability in research and clinical care. Further assessment will be needed to validate performance with other clinical text sources, particularly those with greater linguistic variability.

Conclusion SCENT is proof of concept for SAS-based natural language processing applications that can be easily shared between institutions and used to support clinical and epidemiologic research.

Before I forget:

Data sharing statement SCENT is freely available for non-commercial use and modification. Program source code and requisite support files may be downloaded from: http://www.kp-scalresearch.org/research/tools_scent.aspx

Topic map promotion point: Application was built to account for linguistic variability, not to stamp it out.

Tools build to fit users are more likely to succeed, don’t you think?

Journal of Pathology Informatics (JPI)

Filed under: Bioinformatics,Biomedical,Medical Informatics,Pathology Informatics — Patrick Durusau @ 10:09 am

Journal of Pathology Informatics (JPI)

About:

The Journal of Pathology Informatics (JPI) is an open access peer-reviewed journal dedicated to the advancement of pathology informatics. This is the official journal of the Association for Pathology Informatics (API). The journal aims to publish broadly about pathology informatics and freely disseminate all articles worldwide. This journal is of interest to pathologists, informaticians, academics, researchers, health IT specialists, information officers, IT staff, vendors, and anyone with an interest in informatics. We encourage submissions from anyone with an interest in the field of pathology informatics. We publish all types of papers related to pathology informatics including original research articles, technical notes, reviews, viewpoints, commentaries, editorials, book reviews, and correspondence to the editors. All submissions are subject to peer review by the well-regarded editorial board and by expert referees in appropriate specialties.

Another site to add to your whitelist of sites to search for informatics information.

August 3, 2012

De novo assembly and genotyping of variants using colored de Bruijn graphs

Filed under: Bioinformatics,De Bruijn Graphs,Genome,Graphs,Networks — Patrick Durusau @ 2:06 pm

De novo assembly and genotyping of variants using colored de Bruijn graphs by Zamin Iqbal, Mario Caccamo, Isaac Turner, Paul Flicek & Gil McVean. (Nature Genetics 44, 226–232 (2012))

Abstract:

Detecting genetic variants that are highly divergent from a reference sequence remains a major challenge in genome sequencing. We introduce de novo assembly algorithms using colored de Bruijn graphs for detecting and genotyping simple and complex genetic variants in an individual or population. We provide an efficient software implementation, Cortex, the first de novo assembler capable of assembling multiple eukaryotic genomes simultaneously. Four applications of Cortex are presented. First, we detect and validate both simple and complex structural variations in a high-coverage human genome. Second, we identify more than 3 Mb of sequence absent from the human reference genome, in pooled low-coverage population sequence data from the 1000 Genomes Project. Third, we show how population information from ten chimpanzees enables accurate variant calls without a reference sequence. Last, we estimate classical human leukocyte antigen (HLA) genotypes at HLA-B, the most variable gene in the human genome.

You will need access to Nature Genetics but rounding out today’s posts on de Bruijn graphs with a recent research article.

Comments on the Cortex software appreciated.

Genome assembly and comparison using de Bruijn graphs

Filed under: Bioinformatics,De Bruijn Graphs,Genome,Graphs,Networks — Patrick Durusau @ 10:41 am

Genome assembly and comparison using de Bruijn graphs by Daniel Robert Zerbino. (thesis)

Abstract:

Recent advances in sequencing technology made it possible to generate vast amounts of sequence data. The fragments produced by these high-throughput methods are, however, far shorter than in traditional Sanger sequencing. Previously, micro-reads of less than 50 base pairs were considered useful only in the presence of an existing assembly. This thesis describes solutions for assembling short read sequencing data de novo, in the absence of a reference genome.

The algorithms developed here are based on the de Bruijn graph. This data structure is highly suitable for the assembly and comparison of genomes for the following reasons. It provides a flexible tool to handle the sequence variants commonly found in genome evolution such as duplications, inversions or transpositions. In addition, it can combine sequences of highly different lengths, from short reads to assembled genomes. Finally, it ensures an effective data compression of highly redundant datasets.

This thesis presents the development of a collection of methods, called Velvet, to convert a de Bruijn graph into a traditional assembly of contiguous sequences. The first step of the process, termed Tour Bus, removes sequencing errors and handles biological variations such as polymorphisms. In its second part, Velvet aims to resolve repeats based on the available information, from low coverage long reads (Rock Band) or paired shotgun reads (Pebble). These methods were tested on various simulations for precision and efficiency, then on control experimental datasets.

De Bruijn graphs can also be used to detect and analyse structural variants from unassembled data. The final chapter of this thesis presents the results of collaborative work on the analysis of several experimental unassembled datasets.

De Bruijn graphs are covered in pages 22-42 if you want to cut to the chase.

Obviously of interest to the bioinformatics community.

Where else would you use de Bruijn graph structures?

Is “Massive Data” > “Big Data”?

Filed under: BigData,Bioinformatics,Genome,Graphs,Networks — Patrick Durusau @ 8:32 am

Science News announces: New Computational Technique Relieves Logjam from Massive Amounts of Data, which is a better title than: Scaling metagenome sequence assembly with probabilistic de Bruijn graphs by Jason Pell, Arend Hintze, Rosangela Canino-Koning, Adina Howe, James M. Tiedje, and C. Titus Brown.

But I have to wonder about “massive data,” versus “big data,” versus “really big data,” versus “massive data streams,” as informative phrases. True, I have a weakness for an eye-catching headline but in prose, shouldn’t we say what data is under consideration? Let the readers draw their own conclusions?

The paper abstract reads:

Deep sequencing has enabled the investigation of a wide range of environmental microbial ecosystems, but the high memory requirements for de novo assembly of short-read shotgun sequencing data from these complex populations are an increasingly large practical barrier. Here we introduce a memory-efficient graph representation with which we can analyze the k-mer connectivity of metagenomic samples. The graph representation is based on a probabilistic data structure, a Bloom filter, that allows us to efficiently store assembly graphs in as little as 4 bits per k-mer, albeit inexactly. We show that this data structure accurately represents DNA assembly graphs in low memory. We apply this data structure to the problem of partitioning assembly graphs into components as a prelude to assembly, and show that this reduces the overall memory requirements for de novo assembly of metagenomes. On one soil metagenome assembly, this approach achieves a nearly 40-fold decrease in the maximum memory requirements for assembly. This probabilistic graph representation is a significant theoretical advance in storing assembly graphs and also yields immediate leverage on metagenomic assembly.

If “de Bruijn graphs,” sounds familiar, see: Memory Efficient De Bruijn Graph Construction [Attn: Graph Builders, Chess Anyone?].

August 2, 2012

Processing Rat Brain Neuronal Signals Using a Hadoop Computing Cluster – Part III

Filed under: Bioinformatics,Biomedical,Hadoop,MapReduce — Patrick Durusau @ 9:23 pm

Processing Rat Brain Neuronal Signals Using a Hadoop Computing Cluster – Part III by Jadin C. Jackson, PhD & Bradley S. Rubin, PhD.

From the post:

Up to this point, we’ve described our reasons for using Hadoop and Hive on our neural recordings (Part I), the reasons why the analyses of these recordings are interesting from a scientific perspective, and detailed descriptions of our implementation of these analyses using Hadoop and Hive (Part II). The last part of this story cuts straight to the results and then discusses important lessons we learned along the way and future goals for improving the analysis framework we’ve built so far.

Biomedical researchers will be interested in the results but I am more interested in the observation that Hadoop makes it possible to retain results for ad hoc analysis.

Community Based Annotation (mapping?)

Filed under: Annotation,Bioinformatics,Biomedical,Interface Research/Design,Ontology — Patrick Durusau @ 1:51 pm

Enabling authors to annotate their articles is examined in: Assessment of community-submitted ontology annotations from a novel database-journal partnership by Tanya Z. Berardini, Donghui Li, Robert Muller, Raymond Chetty, Larry Ploetz, Shanker Singh, April Wensel and Eva Huala.

Abstract:

As the scientific literature grows, leading to an increasing volume of published experimental data, so does the need to access and analyze this data using computational tools. The most commonly used method to convert published experimental data on gene function into controlled vocabulary annotations relies on a professional curator, employed by a model organism database or a more general resource such as UniProt, to read published articles and compose annotation statements based on the articles’ contents. A more cost-effective and scalable approach capable of capturing gene function data across the whole range of biological research organisms in computable form is urgently needed.

We have analyzed a set of ontology annotations generated through collaborations between the Arabidopsis Information Resource and several plant science journals. Analysis of the submissions entered using the online submission tool shows that most community annotations were well supported and the ontology terms chosen were at an appropriate level of specificity. Of the 503 individual annotations that were submitted, 97% were approved and community submissions captured 72% of all possible annotations. This new method for capturing experimental results in a computable form provides a cost-effective way to greatly increase the available body of annotations without sacrificing annotation quality.

It is encouraging that this annotation effort started with the persons most likely to know the correct answers, authors of the papers in question.

The low initial participation rate (16%) and improved after email reminder rate (53%), were less encouraging.

I suspect unless and until prior annotation practices (by researchers) becomes a line item on current funding requests (how many annotations were accepted by publishers of your prior research?), we will continue to see annotations to be a low priority item.

Perhaps I should suggest that as a study area for the NIH?

Publishers, researchers who build annotation software, annotated data sources and their maintainers, are all likely to be interested.

Would you be interested as well?

August 1, 2012

Processing Rat Brain Neuronal Signals Using a Hadoop Computing Cluster – Part II

Filed under: Bioinformatics,Biomedical,Hadoop,Signal Processing — Patrick Durusau @ 7:19 pm

Processing Rat Brain Neuronal Signals Using a Hadoop Computing Cluster – Part II by Jadin C. Jackson, PhD & Bradley S. Rubin, PhD.

From the post:

As mentioned in Part I, although Hadoop and other Big Data technologies are typically applied to I/O intensive workloads, where parallel data channels dramatically increase I/O throughput, there is growing interest in applying these technologies to CPU intensive workloads. In this work, we used Hadoop and Hive to digitally signal process individual neuron voltage signals captured from electrodes embedded in the rat brain. Previously, this processing was performed on a single Matlab workstation, a workload that was both CPU intensive and data intensive, especially for intermediate output data. With Hadoop/Hive, we were not only able to apply parallelism to the various processing steps, but had the additional benefit of having all the data online for additional ad hoc analysis. Here, we describe the technical details of our implementation, including the biological relevance of the neural signals and analysis parameters. In Part III, we will then describe the tradeoffs between the Matlab and Hadoop/Hive approach, performance results, and several issues identified with using Hadoop/Hive in this type of application.

Details of the setup for processing rat brain signals with Hadoop.

Looking back, I did not see any mention of data sets? Perhaps in part III?

« Newer PostsOlder Posts »

Powered by WordPress