Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 16, 2013

Announcing BioCoder

Filed under: Bioinformatics,Biology,Genomics — Patrick Durusau @ 5:08 pm

Announcing BioCoder by Mike Loukides.

From the post:

We’re pleased to announce BioCoder, a newsletter on the rapidly expanding field of biology. We’re focusing on DIY bio and synthetic biology, but we’re open to anything that’s interesting.

Why biology? Why now? Biology is currently going through a revolution as radical as the personal computer revolution. Up until the mid-70s, computing was dominated by large, extremely expensive machines that were installed in special rooms and operated by people wearing white lab coats. Programming was the domain of professionals. That changed radically with the advent of microprocessors, the homebrew computer club, and the first generation of personal computers. I put the beginning of the shift in 1975, when a friend of mine built a computer in his dorm room. But whenever it started, the phase transition was thorough and radical. We’ve built a new economy around computing: we’ve seen several startups become gigantic enterprises, and we’ve seen several giants collapse because they couldn’t compete with the more nimble startups.

Bioinformatics and amateur genome exploration is growing hobby area. Yes, hobby area.

For background, see: Playing with genes by David Smith.

Your bioinformatics skills, which you learned for cross-over use in other fields, could come in handy.

A couple of resources to get you started:

DYIgenomics

DYI Genomics

Seems like a ripe field for mining and organization.

There is no publication date set on Weaponized Viruses in a Nutshell.

September 26, 2013

Computational Chemogenomics

Filed under: Bioinformatics,Biomedical,Genomics — Patrick Durusau @ 11:00 am

Computational Chemogenomics by Edgar Jacoby (Novartis Pharma AG, Switzerland).

Description:

In the post-genomic era, one of the key challenges for drug discovery consists in making optimal use of comprehensive genomic data to identify effective new medicines. Chemogenomics addresses this challenge and aims to systematically identify all ligands and modulators for all gene products expressed, besides allowing accelerated exploration of their biological function.

Computational chemogenomics focuses on applications of compound library design and virtual screening to expand the bioactive chemical space, to target hopping of chemotypes to identify synergies within related drug discovery projects or to repurpose known drugs, to propose mechanisms of action of compounds, and to identify off-target effects by cross-reactivity analysis.

Both ligand-based and structure-based in silico approaches, as reviewed in this book, play important roles in all these applications. Computational chemogenomics is expected to increase the quality and productivity of drug discovery and lead to the discovery of new medicines.

If you are on the cutting edge of bioinformatics or want to keep up with the cutting edge in bioinformatics, this is a volume to consider.

The hard copy price is $149.95 so it may be a while before I acquire a copy of it.

September 24, 2013

Rumors of Legends (the TMRM kind?)

Filed under: Bioinformatics,Biomedical,Legends,Semantics,TMRM,XML — Patrick Durusau @ 3:42 pm

BioC: a minimalist approach to interoperability for biomedical text processing (numerous authors, see the article).

Abstract:

A vast amount of scientific information is encoded in natural language text, and the quantity of such text has become so great that it is no longer economically feasible to have a human as the first step in the search process. Natural language processing and text mining tools have become essential to facilitate the search for and extraction of information from text. This has led to vigorous research efforts to create useful tools and to create humanly labeled text corpora, which can be used to improve such tools. To encourage combining these efforts into larger, more powerful and more capable systems, a common interchange format to represent, store and exchange the data in a simple manner between different language processing systems and text mining tools is highly desirable. Here we propose a simple extensible mark-up language format to share text documents and annotations. The proposed annotation approach allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities. In addition, we provide simple code to hold this data, read it from and write it back to extensible mark-up language files and perform some sample processing. We also describe completed as well as ongoing work to apply the approach in several directions. Code and data are available at http://bioc.sourceforge.net/.

From the introduction:

With the proliferation of natural language text, text mining has emerged as an important research area. As a result many researchers are developing natural language processing (NLP) and information retrieval tools for text mining purposes. However, while the capabilities and the quality of tools continue to grow, it remains challenging to combine these into more complex systems. Every new generation of researchers creates their own software specific to their research, their environment and the format of the data they study; possibly due to the fact that this is the path requiring the least labor. However, with every new cycle restarting in this manner, the sophistication of systems that can be developed is limited. (emphasis added)

That is the experience with creating electronic versions of the Hebrew Bible. Every project has started from a blank screen, requiring re-proofing of the same text, etc. As a result, there is no electronic encoding of the masora magna (think long margin notes). Duplicated effort has a real cost to scholarship.

The authors stray into legend land when they write:

Our approach to these problems is what we would like to call a ‘minimalist’ approach. How ‘little’ can one do to obtain interoperability? We provide an extensible mark-up language (XML) document type definition (DTD) defining ways in which a document can contain text, annotations and relations. Major XML elements may contain ‘infon’ elements, which store key-value pairs with any desired semantic information. We have adapted the term ‘infon’ from the writings of Devlin (1), where it is given the sense of a discrete item of information. An associated ‘key’ file is necessary to define the semantics that appear in tags such as the infon elements. Key files are simple text files where the developer defines the semantics associated with the data. Different corpora or annotation sets sharing the same semantics may reuse an existing key file, thus representing an accepted standard for a particular data type. In addition, key files may describe a new kind of data not seen before. At this point we prescribe no semantic standards. BioC users are encouraged to create their own key files to represent their BioC data collections. In time, we believe, the most useful key files will develop a life of their own, thus providing emerging standards that are naturally adopted by the community.

The “key files” don’t specify subject identities for the purposes of merging. But defining the semantics of data is a first step in that direction.

I like the idea of popular “key files” (read legends) taking on a life of their own due to their usefulness. An economic activity based on reducing the friction in using or re-using data. That should have legs.

BTW, don’t overlook the author’s data and code, available at: http://bioc.sourceforge.net/.

September 17, 2013

Groups: knowledge spreadsheets for symbolic biocomputing [Semantic Objects]

Filed under: Bioinformatics,Knowledge Map,Knowledge Representation — Patrick Durusau @ 4:53 pm

Groups: knowledge spreadsheets for symbolic biocomputing by Michael Travers, Suzanne M. Paley, Jeff Shrager, Timothy A. Holland and Peter D. Karp.

Abstract:

Knowledge spreadsheets (KSs) are a visual tool for interactive data analysis and exploration. They differ from traditional spreadsheets in that rather than being oriented toward numeric data, they work with symbolic knowledge representation structures and provide operations that take into account the semantics of the application domain. ‘Groups’ is an implementation of KSs within the Pathway Tools system. Groups allows Pathway Tools users to define a group of objects (e.g. groups of genes or metabolites) from a Pathway/Genome Database. Groups can be transformed (e.g. by transforming a metabolite group to the group of pathways in which those metabolites are substrates); combined through set operations; analysed (e.g. through enrichment analysis); and visualized (e.g. by painting onto a metabolic map diagram). Users of the Pathway Tools-based BioCyc.org website have made extensive use of Groups, and an informal survey of Groups users suggests that Groups has achieved the goal of allowing biologists themselves to perform some data manipulations that previously would have required the assistance of a programmer.

Database URL: BioCyc.org.

Not my area so a biologist would have to comment on the substantive aspects of using these particular knowledge spreadsheets.

But there is much in this article that could be applied more broadly.

From the introduction:

A long-standing problem in computing is that of providing non-programmers with intuitive, yet powerful tools for manipulating and analysing sets of entities. For example, a number of bioinformatics database websites provide users with powerful tools for composing database queries, but once a user obtains the query results, they are largely on their own. What if a user wants to store the query results for future reference, or combine them with other query results, or transform the results, or share them with a colleague? Sets of entities of interest arise in other contexts for life scientists, such as the entities that are identified as significantly perturbed in a high-throughput experiment (e.g. a set of differentially occurring metabolites), or a set of genes of interest that emerge from an experimental investigation.

We observe that spreadsheets have become a dominant form of end-user programming and data analysis for scientists. Although traditional spreadsheets provide a compelling interaction model, and are excellent tools for the manipulation of the tables of numbers that are typical of accounting and data analysis problems, they are less easily used with the complex symbolic computations typical of symbolic biocomputing. For example, they cannot perform semantic transformations such as converting a gene list to the list of pathways the genes act in.

We coined the term knowledge spreadsheet (KS) to describe spreadsheets that are characterized by their ability to manipulate semantic objects and relationships instead of just numbers and strings. Both traditional spreadsheets and KSs represent data in tabular structures, but in a KS the contents of a cell will typically be an object from a knowledge base (KB) [such as a MetaCyc (1) frame or a URI entity from an RDF store]. Given that a column in a KS will typically contain objects of the same ontological type, a KS can offer high-level semantically knowledgeable operations on the data. For example, given a group with a column of metabolites, a semantic operation could create a parallel column in which each cell contained the reactions that produced that metabolite. Another difference between our implementation of KSs and traditional spreadsheets is that cells in our KSs can contain multiple values.
(…)

Can you think of any domain that would not benefit from better handling of “semantic objects?”

As you read the article closely, any number of ideas or techniques for manipulating “semantic objects” will come to mind.

August 8, 2013

Using the Unix Chainsaw:…

Filed under: Bioinformatics,Linux OS,Programming — Patrick Durusau @ 2:50 pm

Using the Unix Chainsaw: Named Pipes and Process Substitution by Vince Buffalo.

From the post:

It’s hard not to fall in love with Unix as a bioinformatician. In a past post I mentioned how Unix pipes are an extremely elegant way to interface bioinformatics programs (and do inter-process communication in general). In exploring other ways of interfacing programs in Unix, I’ve discovered two great but overlooked ways of interfacing programs: the named pipe and process substitution.

Why We Love Unix and Pipes

A few weeks ago I stumbled across a great talk by Gary Bernhardt entitled The Unix Chainsaw. Bernhardt’s “chainsaw” analogy is great: people sometimes fear doing work in Unix because it’s a powerful tool, and it’s easy to screw up with powerful tools. I think in the process of grokking Unix it’s not uncommon to ask “is this clever and elegant? or completely fucking stupid?”. This is normal, especially if you come from a language like Lisp or Python (or even C really). Unix is a get-shit-done system. I’ve used a chainsaw, and you’re simultaneously amazed at (1) how easily it slices through a tree, and (2) that you’re dumb enough to use this thing three feet away from your vital organs. This is Unix.
(…)

“The Unix Chainsaw.” Definitely a title for a drama about a group of shell hackers that uncover fraud and waste in large government projects. 😉

If you are not already a power user on *nix, this could be a step in that direction.

August 4, 2013

Building Smaller Data

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 9:41 am

Throw the Bath Water Out, Keep the Baby: Keeping Medically-Relevant Terms for Text Mining by Jay Jarman, MS and Donald J. Berndt, PhD.

Abstract:

The purpose of this research is to answer the question, can medically-relevant terms be extracted from text notes and text mined for the purpose of classification and obtain equal or better results than text mining the original note? A novel method is used to extract medically-relevant terms for the purpose of text mining. A dataset of 5,009 EMR text notes (1,151 related to falls) was obtained from a Veterans Administration Medical Center. The dataset was processed with a natural language processing (NLP) application which extracted concepts based on SNOMED-CT terms from the Unified Medical Language System (UMLS) Metathesaurus. SAS Enterprise Miner was used to text mine both the set of complete text notes and the set represented by the extracted concepts. Logistic regression models were built from the results, with the extracted concept model performing slightly better than the complete note model.

The researchers created two datasets. One composed of the original text medical notes and the second of extracted named entities using NLP and medical vocabularies.

The named entity only dataset was found to perform better than the full text mining approach.

A smaller data set that had a higher performance than the larger data set of notes.

Wait! Isn’t that backwards? I thought “big data” was always better than “smaller data?”

Maybe not?

Maybe having the “right” dataset is better than having a “big data” set.

The 97% Junk Part of Human DNA

Filed under: Bioinformatics,Biomedical,Gene Ontology,Genome,Genomics — Patrick Durusau @ 9:21 am

Researchers from the Gene and Stem Cell Therapy Program at Sydney’s Centenary Institute have confirmed that, far from being “junk,” the 97 per cent of human DNA that does not encode instructions for making proteins can play a significant role in controlling cell development.

And in doing so, the researchers have unravelled a previously unknown mechanism for regulating the activity of genes, increasing our understanding of the way cells develop and opening the way to new possibilities for therapy.

Using the latest gene sequencing techniques and sophisticated computer analysis, a research group led by Professor John Rasko AO and including Centenary’s Head of Bioinformatics, Dr William Ritchie, has shown how particular white blood cells use non-coding DNA to regulate the activity of a group of genes that determines their shape and function. The work is published today in the scientific journal Cell.*

There’s a poke with a sharp stick to any gene ontology.

Roles in associations of genes have suddenly expanded.

Your call:

  1. Wait until a committee can officially name the new roles and parts of the “junk” that play those roles, or
  2. Create names/roles on the fly and merge those with subsequent identifiers on an ongoing basis as our understanding improves.

Any questions?

*Justin J.-L. Wong, William Ritchie, Olivia A. Ebner, Matthias Selbach, Jason W.H. Wong, Yizhou Huang, Dadi Gao, Natalia Pinello, Maria Gonzalez, Kinsha Baidya, Annora Thoeng, Teh-Liane Khoo, Charles G. Bailey, Jeff Holst, John E.J. Rasko. Orchestrated Intron Retention Regulates Normal Granulocyte Differentiation. Cell, 2013; 154 (3): 583 DOI: 10.1016/j.cell.2013.06.052

July 28, 2013

NIH Big Data to Knowledge (BD2K) Initiative [TM Opportunity?]

Filed under: Bioinformatics,Biomedical,Funding — Patrick Durusau @ 3:23 pm

NIH Big Data to Knowledge (BD2K) Initiative by Shar Steed.

From the post:

The National Institutes of Health (NIH) has announced the Centers of Excellence for Big Data Computing in the Biomedical Sciences (U54) funding opportunity announcement, the first in its Big Data to Knowledge (BD2K) Initiative.

The purpose of the BD2K initiative is to help biomedical scientists fully utilize Big Data being generated by research communities. As technology advances, scientists are generating and using large, complex, and diverse datasets, which is making the biomedical research enterprise more data-intensive and data-driven. According to the BD2K website:

[further down in the post]

Data integration: An applicant may propose a Center that will develop efficient and meaningful ways to create connections across data types (i.e., unimodal or multimodal data integration).

That sounds like topic maps doesn’t it?

At least if we get away from black/white, match one of a set of IRIs or not, type merging practices.

For more details:

A webinar for applicants is scheduled for Thursday, September 12, 2013, from 3 – 4:30 pm EDT. Click here for more information.

Be aware of this workshop:

August 21, 2013 – August 22, 2013
NIH Data Catalogue
Chair:
Francine Berman, Ph.D.

This workshop seeks to identify the least duplicative and burdensome, and most sustainable and scalable method to create and maintain an NIH Data Catalog. An NIH Data Catalog would make biomedical data findable and citable, as PubMed does for scientific publications, and would link data to relevant grants, publications, software, or other relevant resources. The Data Catalog would be integrated with other BD2K initiatives as part of the broad NIH response to the challenges and opportunities of Big Data and seek to create an ongoing dialog with stakeholders and users from the biomedical community.

Contact: BD2Kworkshops@mail.nih.gov

Let’s see: “…least duplicative and burdensome, and most sustainable and scalable method to create and maintain an NIH Data Catalog.”

Recast existing data as RDF with a suitable OWL Ontology. – Duplicative, burdensome, not sustainable or scalable.

Accept all existing data as it exists and write subject identity and merging rules: Non-duplicative, existing systems persist so less burdensome, re-use of existing data = sustainable, only open question is scalability.

Sounds like a topic map opportunity to me.

You?

July 23, 2013

Microbial Life Database

Filed under: Biodiversity,Bioinformatics — Patrick Durusau @ 3:01 pm

Microbial Life Database

From the webpage:

The Microbial Life Database (MLD) is a project under continuos development to visualize the ecological, physiological and morphological diversity of microbial life. A database is being constructed including data for nearly 600 well-known prokaryote genera mostly described in Bergey’s Manual of Determinative Bacteriology and published by the Bergey’s Trust. Correction and additions come from many other sources. The database is divided by genera but we are working on a version by species. This is the current database v02 in Google Spreadsheets format. Below is a bubble chart of the number of species included in each of the microbial groups. You can click the graph to go to an interactive version with more details. If you want to contribute to this database please send an email to Prof. Abel Mendez.

I don’t have any immediate need for this data set but it is the sort of project where semantic reefs are found. 😉

July 3, 2013

CHD@ZJU…

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 9:37 am

CHD@ZJU: a knowledgebase providing network-based research platform on coronary heart disease by Leihong Wu, Xiang Li, Jihong Yang, Yufeng Liu, Xiaohui Fan and Yiyu Cheng. (Database (2013) 2013 : bat047 doi: 10.1093/database/bat047)

From the webpage:

Abstract:

Coronary heart disease (CHD), the leading cause of global morbidity and mortality in adults, has been reported to be associated with hundreds of genes. A comprehensive understanding of the CHD-related genes and their corresponding interactions is essential to advance the translational research on CHD. Accordingly, we construct this knowledgebase, CHD@ZJU, which records CHD-related information (genes, pathways, drugs and references) collected from different resources and through text-mining method followed by manual confirmation. In current release, CHD@ZJU contains 660 CHD-related genes, 45 common pathways and 1405 drugs accompanied with >8000 supporting references. Almost half of the genes collected in CHD@ZJU were novel to other publicly available CHD databases. Additionally, CHD@ZJU incorporated the protein–protein interactions to investigate the cross-talk within the pathways from a multi-layer network view. These functions offered by CHD@ZJU would allow researchers to dissect the molecular mechanism of CHD in a systematic manner and therefore facilitate the research on CHD-related multi-target therapeutic discovery.

Database URL: http://tcm.zju.edu.cn/chd/

The article outlines the construction of CHD@ZJU as follows:

chd@zju

Figure 1.
Procedure for CHD@ZJU construction. CHD-related genes were extracted with text-mining technique and manual confirmation. PPI, pathway and drugs information were then collected from public resources such as KEGG and HPRD. Interactome network of every pathway was constructed based on their corresponding genes and related PPIs, and the whole CHD diseasome network was then constructed with all CHD-related genes. With CHD@ZJU, users could find information related to CHD from gene, pathway and the whole biological network level.

While assisted by computer technology, there is a manual confirmation step that binds all the information together.

June 1, 2013

1000 Genomes…

Filed under: 1000 Genomes,Bioinformatics,Genomics — Patrick Durusau @ 3:44 pm

1000 Genomes: A Deep Catalog of Human Genetic Variation

From the overview:

Recent improvements in sequencing technology (“next-gen” sequencing platforms) have sharply reduced the cost of sequencing. The 1000 Genomes Project is the first project to sequence the genomes of a large number of people, to provide a comprehensive resource on human genetic variation.

As with other major human genome reference projects, data from the 1000 Genomes Project will be made available quickly to the worldwide scientific community through freely accessible public databases. (See Data use statement.)

The goal of the 1000 Genomes Project is to find most genetic variants that have frequencies of at least 1% in the populations studied. This goal can be attained by sequencing many individuals lightly. To sequence a person’s genome, many copies of the DNA are broken into short pieces and each piece is sequenced. The many copies of DNA mean that the DNA pieces are more-or-less randomly distributed across the genome. The pieces are then aligned to the reference sequence and joined together. To find the complete genomic sequence of one person with current sequencing platforms requires sequencing that person’s DNA the equivalent of about 28 times (called 28X). If the amount of sequence done is only an average of once across the genome (1X), then much of the sequence will be missed, because some genomic locations will be covered by several pieces while others will have none. The deeper the sequencing coverage, the more of the genome will be covered at least once. Also, people are diploid; the deeper the sequencing coverage, the more likely that both chromosomes at a location will be included. In addition, deeper coverage is particularly useful for detecting structural variants, and allows sequencing errors to be corrected.

Sequencing is still too expensive to deeply sequence the many samples being studied for this project. However, any particular region of the genome generally contains a limited number of haplotypes. Data can be combined across many samples to allow efficient detection of most of the variants in a region. The Project currently plans to sequence each sample to about 4X coverage; at this depth sequencing cannot provide the complete genotype of each sample, but should allow the detection of most variants with frequencies as low as 1%. Combining the data from 2500 samples should allow highly accurate estimation (imputation) of the variants and genotypes for each sample that were not seen directly by the light sequencing.

If you are looking for a large data set other than CiteSeer and DBpedia, you should consider something from the 1000 Genomes project.

Lots of significant data with more on the way.

May 20, 2013

The Index-Based Subgraph Matching Algorithm (ISMA)…

Filed under: Bioinformatics,Graphs,Indexing — Patrick Durusau @ 4:23 pm

The Index-Based Subgraph Matching Algorithm (ISMA): Fast Subgraph Enumeration in Large Networks Using Optimized Search Trees by Sofie Demeyer, Tom Michoel, Jan Fostier, Pieter Audenaert, Mario Pickavet, Piet Demeester. (Demeyer S, Michoel T, Fostier J, Audenaert P, Pickavet M, et al. (2013) The Index-Based Subgraph Matching Algorithm (ISMA): Fast Subgraph Enumeration in Large Networks Using Optimized Search Trees. PLoS ONE 8(4): e61183. doi:10.1371/journal.pone.0061183)

Abstract:

Subgraph matching algorithms are designed to find all instances of predefined subgraphs in a large graph or network and play an important role in the discovery and analysis of so-called network motifs, subgraph patterns which occur more often than expected by chance. We present the index-based subgraph matching algorithm (ISMA), a novel tree-based algorithm. ISMA realizes a speedup compared to existing algorithms by carefully selecting the order in which the nodes of a query subgraph are investigated. In order to achieve this, we developed a number of data structures and maximally exploited symmetry characteristics of the subgraph. We compared ISMA to a naive recursive tree-based algorithm and to a number of well-known subgraph matching algorithms. Our algorithm outperforms the other algorithms, especially on large networks and with large query subgraphs. An implementation of ISMA in Java is freely available at http://sourceforge.net/projects/isma/.

From the introduction:

Over the last decade, network theory has come to play a central role in our understanding of complex systems in fields as diverse as molecular biology, sociology, economics, the internet, and others [1]. The central question in all these fields is to understand behavior at the level of the whole system from the topology of interactions between its individual constituents. In this respect, the existence of network motifs, small subgraph patterns which occur more often in a network than expected by chance, has turned out to be one of the defining properties of real-world complex networks, in particular biological networks [2]. Network motifs act as the fundamental information processing units in cellular regulatory networks [3] and they form the building blocks of larger functional modules (also known as network communities) [4]–[6]. The discovery and analysis of network motifs crucially depends on the ability to enumerate all instances of a given query subgraph in a network or graph of interest, a classical problem in pattern recognition [7], that is known to be NP complete [8].

Heavy sledding but important for exploration of large graphs/networks and the subsequent representation of those findings in a topic map.

I first saw this in Nat Torkinton’s Four short links: 13 May 2013.

May 17, 2013

A self-updating road map of The Cancer Genome Atlas

Filed under: Bioinformatics,Biology,Biomedical,Medical Informatics,RDF,Semantic Web,SPARQL — Patrick Durusau @ 4:33 pm

A self-updating road map of The Cancer Genome Atlas by David E. Robbins, Alexander Grüneberg, Helena F. Deus, Murat M. Tanik and Jonas S. Almeida. (Bioinformatics (2013) 29 (10): 1333-1340. doi: 10.1093/bioinformatics/btt141)

Abstract:

Motivation: Since 2011, The Cancer Genome Atlas’ (TCGA) files have been accessible through HTTP from a public site, creating entirely new possibilities for cancer informatics by enhancing data discovery and retrieval. Significantly, these enhancements enable the reporting of analysis results that can be fully traced to and reproduced using their source data. However, to realize this possibility, a continually updated road map of files in the TCGA is required. Creation of such a road map represents a significant data modeling challenge, due to the size and fluidity of this resource: each of the 33 cancer types is instantiated in only partially overlapping sets of analytical platforms, while the number of data files available doubles approximately every 7 months.

Results: We developed an engine to index and annotate the TCGA files, relying exclusively on third-generation web technologies (Web 3.0). Specifically, this engine uses JavaScript in conjunction with the World Wide Web Consortium’s (W3C) Resource Description Framework (RDF), and SPARQL, the query language for RDF, to capture metadata of files in the TCGA open-access HTTP directory. The resulting index may be queried using SPARQL, and enables file-level provenance annotations as well as discovery of arbitrary subsets of files, based on their metadata, using web standard languages. In turn, these abilities enhance the reproducibility and distribution of novel results delivered as elements of a web-based computational ecosystem. The development of the TCGA Roadmap engine was found to provide specific clues about how biomedical big data initiatives should be exposed as public resources for exploratory analysis, data mining and reproducible research. These specific design elements align with the concept of knowledge reengineering and represent a sharp departure from top-down approaches in grid initiatives such as CaBIG. They also present a much more interoperable and reproducible alternative to the still pervasive use of data portals.

Availability: A prepared dashboard, including links to source code and a SPARQL endpoint, is available at http://bit.ly/TCGARoadmap. A video tutorial is available at http://bit.ly/TCGARoadmapTutorial.

Curious how the granularity of required semantics and the uniformity of the underlying data set impact the choice of semantic approaches?

And does access to data files present different challenges than say access to research publications in the same field?

May 16, 2013

HAL: a hierarchical format for storing…

Filed under: Bioinformatics,Genomics,Graphs — Patrick Durusau @ 12:27 pm

HAL: a hierarchical format for storing and analyzing multiple genome alignments by Glenn Hickey, Benedict Paten, Dent Earl, Daniel Zerbino and David Haussler. (Bioinformatics (2013) 29 (10): 1341-1342. doi: 10.1093/bioinformatics/btt128)

Abstract:

Motivation: Large multiple genome alignments and inferred ancestral genomes are ideal resources for comparative studies of molecular evolution, and advances in sequencing and computing technology are making them increasingly obtainable. These structures can provide a rich understanding of the genetic relationships between all subsets of species they contain. Current formats for storing genomic alignments, such as XMFA and MAF, are all indexed or ordered using a single reference genome, however, which limits the information that can be queried with respect to other species and clades. This loss of information grows with the number of species under comparison, as well as their phylogenetic distance.

Results: We present HAL, a compressed, graph-based hierarchical alignment format for storing multiple genome alignments and ancestral reconstructions. HAL graphs are indexed on all genomes they contain. Furthermore, they are organized phylogenetically, which allows for modular and parallel access to arbitrary subclades without fragmentation because of rearrangements that have occurred in other lineages. HAL graphs can be created or read with a comprehensive C++ API. A set of tools is also provided to perform basic operations, such as importing and exporting data, identifying mutations and coordinate mapping (liftover).

Availability: All documentation and source code for the HAL API and tools are freely available at http://github.com/glennhickey/hal.

Important work for bioinformatics and genome alignment as well as specializing graphs for that work.

Graphs are a popular subject these days but successful projects will rely on graphs with particular properties and structures to be useful.

The more examples of graph-based projects, the more we learn about general principles of graphs for particular applications or requirements.

May 15, 2013

EDAM: an ontology of bioinformatics operations,…

Filed under: Bioinformatics,Ontology — Patrick Durusau @ 5:51 pm

EDAM: an ontology of bioinformatics operations, types of data and identifiers, topics and formats by Jon Ison, Matúš Kalaš, Inge Jonassen, Dan Bolser, Mahmut Uludag, Hamish McWilliam, James Malone, Rodrigo Lopez, Steve Pettifer and Peter Rice. (Bioinformatics (2013) 29 (10): 1325-1332. doi: 10.1093/bioinformatics/btt113)

Abstract:

Motivation: Advancing the search, publication and integration of bioinformatics tools and resources demands consistent machine-understandable descriptions. A comprehensive ontology allowing such descriptions is therefore required.

Results: EDAM is an ontology of bioinformatics operations (tool or workflow functions), types of data and identifiers, application domains and data formats. EDAM supports semantic annotation of diverse entities such as Web services, databases, programmatic libraries, standalone tools, interactive applications, data schemas, datasets and publications within bioinformatics. EDAM applies to organizing and finding suitable tools and data and to automating their integration into complex applications or workflows. It includes over 2200 defined concepts and has successfully been used for annotations and implementations.

Availability: The latest stable version of EDAM is available in OWL format from http://edamontology.org/EDAM.owl and in OBO format from http://edamontology.org/EDAM.obo. It can be viewed online at the NCBO BioPortal and the EBI Ontology Lookup Service. For documentation and license please refer to http://edamontology.org. This article describes version 1.2 available at http://edamontology.org/EDAM_1.2.owl.

No matter how many times I read it, I just don’t get:

Advancing the search, publication and integration of bioinformatics tools and resources demands consistent machine-understandable descriptions. A comprehensive ontology allowing such descriptions is therefore required.

I will be generous and assume the authors meant “machine-processable descriptions” when I read “machine-understandable descriptions.” It is well known that machines don’t “understand” data, they simply process it according to specified instructions.

But more to the point, machines are indifferent to the type or number of descriptions they have for any subject. It might confuse a human processor to have thirty (30) different descriptions for the same subject but there has been no showing of such a limit for machines.

Every effort to produce a “comprehensive” ontology/classification/taxonomy, pick your brand of poison, has been in the face of competing and different descriptions. That is, after all, the rationale for a comprehensive …, that there are too many choices already.

The outcome of all such efforts, assuming there are N diverse descriptions is N + 1 diverse descriptions, the 1 being the current project added to existing diverse descriptions.

April 28, 2013

Scientific Lenses over Linked Data… [Operational Equivalence]

Scientific Lenses over Linked Data: An approach to support task specifi c views of the data. A vision. by Christian Brenninkmeijer, Chris Evelo, Carole Goble, Alasdair J G Gray, Paul Groth, Steve Pettifer, Robert Stevens, Antony J Williams, and Egon L Willighagen.

Abstract:

Within complex scienti fic domains such as pharmacology, operational equivalence between two concepts is often context-, user- and task-specifi c. Existing Linked Data integration procedures and equivalence services do not take the context and task of the user into account. We present a vision for enabling users to control the notion of operational equivalence by applying scienti c lenses over Linked Data. The scientifi c lenses vary the links that are activated between the datasets which aff ects the data returned to the user.

Two additional quotes from this paper should convince you of the importance of this work:

We aim to support users in controlling and varying their view of the data by applying a scientifi c lens which govern the notions of equivalence applied to the data. Users will be able to change their lens based on the task and role they are performing rather than having one fixed lens. To support this requirement, we propose an approach that applies context dependent sets of equality links. These links are stored in a stand-off fashion so that they are not intermingled with the datasets. This allows for multiple, context-dependent, linksets that can evolve without impact on the underlying datasets and support diff ering opinions on the relationships between data instances. This flexibility is in contrast to both Linked Data and traditional data integration approaches. We look at the role personae can play in guiding the nature of relationships between the data resources and the desired a ffects of applying scientifi c lenses over Linked Data.

and,

Within scienti fic datasets it is common to fi nd links to the “equivalent” record in another dataset. However, there is no declaration of the form of the relationship. There is a great deal of variation in the notion of equivalence implied by the links both within a dataset’s usage and particularly across datasets, which degrades the quality of the data. The scienti fic user personae have very di fferent needs about the notion of equivalence that should be applied between datasets. The users need a simple mechanism by which they can change the operational equivalence applied between datasets. We propose the use of scientifi c lenses.

Obvious questions:

Does your topic map software support multiple operational equivalences?

Does your topic map interface enable users to choose “lenses” (I like lenses better than roles) to view equivalence?

Does your topic map software support declaring the nature of equivalence?

I first saw this in the slide deck: Scientific Lenses: Supporting Alternative Views of the Data by Alasdair J G Gray at: 4th Open PHACTS Community Workshop.

BTW, the notion of equivalence being represented by “links” reminds me of a comment Peter Neubauer (Neo4j) once made to me, saying that equivalence could be modeled as edges. Imagine typing equivalence edges. Will have to think about that some more.

4th Open PHACTS Community Workshop (slides) [Operational Equivalence]

Filed under: Bioinformatics,Biomedical,Drug Discovery,Linked Data,Medical Informatics — Patrick Durusau @ 12:24 pm

4th Open PHACTS Community Workshop : Using the power of Open PHACTS

From the post:

The fourth Open PHACTS Community Workshop was held at Burlington House in London on April 22 and 23, 2013. The Workshop focussed on “Using the Power of Open PHACTS” and featured the public release of the Open PHACTS application programming interface (API) and the first Open PHACTS example app, ChemBioNavigator.

The first day featured talks describing the data accessible via the Open PHACTS Discovery Platform and technical aspects of the API. The use of the API by example applications ChemBioNavigator and PharmaTrek was outlined, and the results of the Accelrys Pipeline Pilot Hackathon discussed.

The second day involved discussion of Open PHACTS sustainability and plans for the successor organisation, the Open PHACTS Foundation. The afternoon was attended by those keen to further discuss the potential of the Open PHACTS API and the future of Open PHACTS.

During talks, especially those detailing the Open PHACTS API, a good number of signup requests to the API via dev.openphacts.org were received. The hashtag #opslaunch was used to follow reactions to the workshop on Twitter (see storify), and showed the response amongst attendees to be overwhelmingly positive.

This summary is followed by slides from the two days of presentations.

Not like being there but still quite useful.

As a matter of fact, I found a lead on “operational equivalence” with this data set. More to follow in a separate post.

April 24, 2013

Brain: … [Topic Naming Constraint Reappears]

Filed under: Bioinformatics,OWL,Semantic Web — Patrick Durusau @ 1:41 pm

Brain: biomedical knowledge manipulation by Samuel Croset, John P. Overington and Dietrich Rebholz-Schuhmann. (Bioinformatics (2013) 29 (9): 1238-1239. doi: 10.1093/bioinformatics/btt109)

Abstract:

Summary: Brain is a Java software library facilitating the manipulation and creation of ontologies and knowledge bases represented with the Web Ontology Language (OWL).

Availability and implementation: The Java source code and the library are freely available at https://github.com/loopasam/Brain and on the Maven Central repository (GroupId: uk.ac.ebi.brain). The documentation is available at https://github.com/loopasam/Brain/wiki.

Contact: croset@ebi.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

Odd how things like the topic naming constraint show up in unexpected contexts. 😉

This article may be helpful if you are required to create or read OWL based data.

But as I read the article I saw:

The names (short forms) of OWL entities handled by a Brain object have to be unique. It is for instance not possible to add an OWL class, such as http://www.example.org/Cell to the ontology if an OWL entity with the short form ‘Cell’ already exists.

The explanation?

Despite being in contradiction with some Semantic Web principles, this design prevents ambiguous queries and hides as much as possible the cumbersome interaction with prefixes and Internationalized Resource Identifiers (IRI).

I suppose but doesn’t ambiguity exist in the mind of the user? That is they use a term than can have more than one meaning?

Having unique terms simply means inventing odd terms that no user will know.

Rather than unambiguous isn’t that unfound?

April 20, 2013

PhenoMiner:..

PhenoMiner: quantitative phenotype curation at the rat genome database by Stanley J. F. Laulederkind, et.al. (Database (2013) 2013 : bat015 doi: 10.1093/database/bat015)

Abstract:

The Rat Genome Database (RGD) is the premier repository of rat genomic and genetic data and currently houses >40 000 rat gene records as well as human and mouse orthologs, >2000 rat and 1900 human quantitative trait loci (QTLs) records and >2900 rat strain records. Biological information curated for these data objects includes disease associations, phenotypes, pathways, molecular functions, biological processes and cellular components. Recently, a project was initiated at RGD to incorporate quantitative phenotype data for rat strains, in addition to the currently existing qualitative phenotype data for rat strains, QTLs and genes. A specialized curation tool was designed to generate manual annotations with up to six different ontologies/vocabularies used simultaneously to describe a single experimental value from the literature. Concurrently, three of those ontologies needed extensive addition of new terms to move the curation forward. The curation interface development, as well as ontology development, was an ongoing process during the early stages of the PhenoMiner curation project.

Database URL: http://rgd.mcw.edu

The line:

A specialized curation tool was designed to generate manual annotations with up to six different ontologies/vocabularies used simultaneously to describe a single experimental value from the literature.

sounded relevant to topic maps.

Turns out to be five ontologies and the article reports:

The ‘Create Record’ page (Figure 4) is where the rest of the data for a single record is entered. It consists of a series of autocomplete text boxes, drop-down text boxes and editable plain text boxes. All of the data entered are associated with terms from five ontologies/vocabularies: RS, CMO, MMO, XCO and the optional MA (Mouse Adult Gross Anatomy Dictionary) (13)

Important to note that authoring does not require the user to make explicit the properties underlying any of the terms from the different ontologies.

Some users probably know that level of detail but what is important is the capturing of their knowledge of subject sameness.

A topic map extension/add-on to such a system could flesh out those bare terms to provide a basis for treating terms from different ontologies as terms for the same subjects.

That merging/mapping detail need not bother an author or casual user.

But it increases the odds that future data sets can be reliably integrated with this one.

And issues with the correctness of a mapping can be meaningfully investigated.

If it helps, think of correctness of mappping as accountability, for someone else.

April 16, 2013

The non-negative matrix factorization toolbox for biological data mining

Filed under: Bioinformatics,Data Mining,Matrix — Patrick Durusau @ 4:20 pm

The non-negative matrix factorization toolbox for biological data mining by Yifeng Li and Alioune Ngom. (Source Code for Biology and Medicine 2013, 8:10 doi:10.1186/1751-0473-8-10)

From the post:

Background: Non-negative matrix factorization (NMF) has been introduced as an important method for mining biological data. Though there currently exists packages implemented in R and other programming languages, they either provide only a few optimization algorithms or focus on a specific application field. There does not exist a complete NMF package for the bioinformatics community, and in order to perform various data mining tasks on biological data.

Results: We provide a convenient MATLAB toolbox containing both the implementations of various NMF techniques and a variety of NMF-based data mining approaches for analyzing biological data. Data mining approaches implemented within the toolbox include data clustering and bi-clustering, feature extraction and selection, sample classification, missing values imputation, data visualization, and statistical comparison.

Conclusions: A series of analysis such as molecular pattern discovery, biological process identification, dimension reduction, disease prediction, visualization, and statistical comparison can be performed using this toolbox.

Written in a bioinformatics context but also used in text data mining (Enron emails), spectral analysis and other data mining fields. (See Non-negative matrix factorization)

April 14, 2013

Planform:… [Graph vs. SQL?]

Filed under: Bioinformatics,Graphs,SQL,SQLite — Patrick Durusau @ 3:16 pm

Planform: an application and database of graph-encoded planarian regenerative experiments by Daniel Lobo, Taylor J. Malone and Michael Levin. Bioinformatics (2013) 29 (8): 1098-1100. doi: 10.1093/bioinformatics/btt088

Abstract:

Summary: Understanding the mechanisms governing the regeneration capabilities of many organisms is a fundamental interest in biology and medicine. An ever-increasing number of manipulation and molecular experiments are attempting to discover a comprehensive model for regeneration, with the planarian flatworm being one of the most important model species. Despite much effort, no comprehensive, constructive, mechanistic models exist yet, and it is now clear that computational tools are needed to mine this huge dataset. However, until now, there is no database of regenerative experiments, and the current genotype–phenotype ontologies and databases are based on textual descriptions, which are not understandable by computers. To overcome these difficulties, we present here Planform (Planarian formalization), a manually curated database and software tool for planarian regenerative experiments, based on a mathematical graph formalism. The database contains more than a thousand experiments from the main publications in the planarian literature. The software tool provides the user with a graphical interface to easily interact with and mine the database. The presented system is a valuable resource for the regeneration community and, more importantly, will pave the way for the application of novel artificial intelligence tools to extract knowledge from this dataset.

Availability: The database and software tool are freely available at http://planform.daniel-lobo.com.

Watch the video tour for an example of a domain specific authoring tool.

It does not use any formal graph notation/terminology or attempt a new form of ASCII art.

Users can enter data about worms with four (4) heads. That bodes well for new techniques to author topic maps.

On the use of graphs, the authors write:

We have created a formalism based on graphs to encode the resultant morphologies and manipulations of regenerative experiments (Lobo et al., 2013). Mathematical graphs are ideal to encode relationships between individuals and have been previously used to encode morphologies (Lobo et al., 2011). The formalism divided a morphology into adjacent regions (graph nodes) connected to each other (graph edges). The geometrical characteristics of the regions (connection angles, distances, shapes, type, etc.) are stored as node and link labels. Importantly, the formalism permits automatic comparisons between morphologies: we implemented a metric to quantify the difference between two morphologies based on the graph edit distance algorithm.

The experiment manipulations are encoded in a tree structure. Nodes represent specific manipulations (cuts, irradiation and transplantations) where links define the order and relations between manipulations. This approach permits encode the majority of published planarian regenerative experiments.

The graph vs. relational crowd will be disappointed to learn the project uses SQLite (“the most widely deployed SQL database engine in the world”) for the storage/access to its data. 😉

You were aware that hypergraphs were used to model relational databases in the “old days.” Yes?

I will try to pull together some of those publications in the near future.

April 11, 2013

Efficient comparison of sets of intervals with NC-lists

Filed under: Bioinformatics,Set Intersection,Sets — Patrick Durusau @ 1:00 pm

Efficient comparison of sets of intervals with NC-lists by Matthias Zytnicki, YuFei Luo and Hadi Quesneville. (Bioinformatics (2013) 29 (7): 933-939. doi: 10.1093/bioinformatics/btt070)

Abstract:

Motivation: High-throughput sequencing produces in a small amount of time a large amount of data, which are usually difficult to analyze. Mapping the reads to the transcripts they originate from, to quantify the expression of the genes, is a simple, yet time demanding, example of analysis. Fast genomic comparison algorithms are thus crucial for the analysis of the ever-expanding number of reads sequenced.

Results: We used NC-lists to implement an algorithm that compares a set of query intervals with a set of reference intervals in two steps. The first step, a pre-processing done once for all, requires time O[#R log(#R) + #Q log(#Q)], where Q and R are the sets of query and reference intervals. The search phase requires constant space, and time O(#R + #Q + #M), where M is the set of overlaps. We showed that our algorithm compares favorably with five other algorithms, especially when several comparisons are performed.

Availability: The algorithm has been included to S–MART, a versatile tool box for RNA-Seq analysis, freely available at http://urgi.versailles.inra.fr/Tools/S-Mart. The algorithm can be used for many kinds of data (sequencing reads, annotations, etc.) in many formats (GFF3, BED, SAM, etc.), on any operating system. It is thus readily useable for the analysis of next-generation sequencing data.

Before you search for “NC-lists,” be aware that you will get this article as the first “hit” today in some popular search engines. Followed by a variety of lists for North Carolina.

A more useful search engine would allow me to choose the correct usage of a term and to re-run the query using the distinguished subject.

The expansion helps: Nested Containment List (NCList).

Familiar if you are working in bioinformatics.

More generally, consider the need to compare complex sequences of values for merging purposes.

Not a magic bullet but a technique you should keep in mind.

Origin: Nested Containment List (NCList): a new algorithm for accelerating interval query of genome alignment and interval databases, Alexander V. Alekseyenko and Christopher J. Lee. (Bioinformatics (2007) 23 (11): 1386-1393. doi: 10.1093/bioinformatics/btl647)

April 7, 2013

Open PHACTS

Open PHACTS – Open Pharmacological Space

From the homepage:

Open PHACTS is building an Open Pharmacological Space in a 3-year knowledge management project of the Innovative Medicines Initiative (IMI), a unique partnership between the European Community and the European Federation of Pharmaceutical Industries and Associations (EFPIA).

The project is due to end in March 2014, and aims to deliver a sustainable service to continue after the project funding ends. The project consortium consists of leading academics in semantics, pharmacology and informatics, driven by solid industry business requirements: 28 partners, including 9 pharmaceutical companies and 3 biotechs.

Sourcecode has just appeared on GibHub: OpenPHACTS.

Important to different communities for different reasons. My interest isn’t the same as BigPharma. 😉

A project to watch as they navigate the thickets of vocabularies, ontologies and other semantically diverse information sources.

April 5, 2013

Bioinformatics Workshops and Training Resources

Filed under: Bioinformatics — Patrick Durusau @ 2:15 pm

List of Bioinformatics Workshops and Training Resources by Stephen Turner.

Stephen as created List of Bioinformatics Workshops and Training Resources, a listing:

of both online learning resources and in-person workshops (preferentially highlighting those where workshop materials are freely available online)

Update Stephen if you discover new resources and/or create new resources that should be listed.

As in most areas, bioinformatics has a wealth of semantic issues for topic maps to address.

March 23, 2013

Using Bayesian networks to discover relations…

Filed under: Bayesian Data Analysis,Bayesian Models,Bioinformatics,Medical Informatics — Patrick Durusau @ 3:33 pm

Using Bayesian networks to discover relations between genes, environment, and disease by Chengwei Su, Angeline Andrew, Margaret R Karagas and Mark E Borsuk. (BioData Mining 2013, 6:6 doi:10.1186/1756-0381-6-6)

Abstract:

We review the applicability of Bayesian networks (BNs) for discovering relations between genes, environment, and disease. By translating probabilistic dependencies among variables into graphical models and vice versa, BNs provide a comprehensible and modular framework for representing complex systems. We first describe the Bayesian network approach and its applicability to understanding the genetic and environmental basis of disease. We then describe a variety of algorithms for learning the structure of a network from observational data. Because of their relevance to real-world applications, the topics of missing data and causal interpretation are emphasized. The BN approach is then exemplified through application to data from a population-based study of bladder cancer in New Hampshire, USA. For didactical purposes, we intentionally keep this example simple. When applied to complete data records, we find only minor differences in the performance and results of different algorithms. Subsequent incorporation of partial records through application of the EM algorithm gives us greater power to detect relations. Allowing for network structures that depart from a strict causal interpretation also enhances our ability to discover complex associations including gene-gene (epistasis) and gene-environment interactions. While BNs are already powerful tools for the genetic dissection of disease and generation of prognostic models, there remain some conceptual and computational challenges. These include the proper handling of continuous variables and unmeasured factors, the explicit incorporation of prior knowledge, and the evaluation and communication of the robustness of substantive conclusions to alternative assumptions and data manifestations.

From the introduction:

BNs have been applied in a variety of settings for the purposes of causal study and probabilistic prediction, including medical diagnosis, crime and terrorism risk, forensic science, and ecological conservation (see [7]). In bioinformatics, they have been used to analyze gene expression data [8,9], derive protein signaling networks [10-12], predict protein-protein interactions [13], perform pedigree analysis [14], conduct genetic epidemiological studies [5], and assess the performance of microsatellite markers on cancer recurrence [15].

Not to mention criminal investigations: Bayesian Network – [Crime Investigation] (Youtube). 😉

Once relations are discovered, you are free to decorate them with roles, properties, etc., in other words, associations.

March 16, 2013

MetaNetX.org…

Filed under: Bioinformatics,Biomedical,Genomics,Modeling,Semantic Diversity — Patrick Durusau @ 1:42 pm

MetaNetX.org: a website and repository for accessing, analysing and manipulating metabolic networks by Mathias Ganter, Thomas Bernard, Sébastien Moretti, Joerg Stelling and Marco Pagni. (Bioinformatics (2013) 29 (6): 815-816. doi: 10.1093/bioinformatics/btt036)

Abstract:

MetaNetX.org is a website for accessing, analysing and manipulating genome-scale metabolic networks (GSMs) as well as biochemical pathways. It consistently integrates data from various public resources and makes the data accessible in a standardized format using a common namespace. Currently, it provides access to hundreds of GSMs and pathways that can be interactively compared (two or more), analysed (e.g. detection of dead-end metabolites and reactions, flux balance analysis or simulation of reaction and gene knockouts), manipulated and exported. Users can also upload their own metabolic models, choose to automatically map them into the common namespace and subsequently make use of the website’s functionality.

http://metanetx.org.

The authors are addressing a familiar problem:

Genome-scale metabolic networks (GSMs) consist of compartmentalized reactions that consistently combine biochemical, genetic and genomic information. When also considering a biomass reaction and both uptake and secretion reactions, GSMs are often used to study genotype–phenotype relationships, to direct new discoveries and to identify targets in metabolic engineering (Karr et al., 2012). However, a major difficulty in GSM comparisons and reconstructions is to integrate data from different resources with different nomenclatures and conventions for both metabolites and reactions. Hence, GSM consolidation and comparison may be impossible without detailed biological knowledge and programming skills. (emphasis added)

For which they propose an uncommon solution:

MetaNetX.org is implemented as a user-friendly and self-explanatory website that handles all user requests dynamically (Fig. 1a). It allows a user to access a collection of hundreds of published models, browse and select subsets for comparison and analysis, upload or modify new models and export models in conjunction with their results. Its functionality is based on a common namespace defined by MNXref (Bernard et al., 2012). In particular, all repository or user uploaded models are automatically translated with or without compartments into the common namespace; small deviations from the original model are possible due to the automatic reconciliation steps implemented by Bernard et al. (2012). However, a user can choose not to translate his model but still make use of the website’s functionalities. Furthermore, it is possible to augment the given reaction set by user-defined reactions, for example, for model augmentation.

The bioinformatics community recognizes the intellectual poverty of lock step models.

Wonder when the intelligence community is going to have that “a ha” moment?

March 11, 2013

The Annotation-enriched non-redundant patent sequence databases [Curation vs. Search]

Filed under: Bioinformatics,Biomedical,Marketing,Medical Informatics,Patents,Topic Maps — Patrick Durusau @ 2:01 pm

The Annotation-enriched non-redundant patent sequence databases Weizhong Li, Bartosz Kondratowicz, Hamish McWilliam, Stephane Nauche and Rodrigo Lopez.

Not a real promising title is it? 😉 The reason I cite it here is that by curation, the database is “non-redundant.”

Try searching for some of these sequences at the USPTO and compare the results.

The power of curation will be immediately obvious.

Abstract:

The EMBL-European Bioinformatics Institute (EMBL-EBI) offers public access to patent sequence data, providing a valuable service to the intellectual property and scientific communities. The non-redundant (NR) patent sequence databases comprise two-level nucleotide and protein sequence clusters (NRNL1, NRNL2, NRPL1 and NRPL2) based on sequence identity (level-1) and patent family (level-2). Annotation from the source entries in these databases is merged and enhanced with additional information from the patent literature and biological context. Corrections in patent publication numbers, kind-codes and patent equivalents significantly improve the data quality. Data are available through various user interfaces including web browser, downloads via FTP, SRS, Dbfetch and EBI-Search. Sequence similarity/homology searches against the databases are available using BLAST, FASTA and PSI-Search. In this article, we describe the data collection and annotation and also outline major changes and improvements introduced since 2009. Apart from data growth, these changes include additional annotation for singleton clusters, the identifier versioning for tracking entry change and the entry mappings between the two-level databases.

Database URL: http://www.ebi.ac.uk/patentdata/nr/

Topic maps are curated data. Which one do you prefer?

March 1, 2013

Bellman’s GAP…

Filed under: Bioinformatics,Programming — Patrick Durusau @ 5:32 pm

Bellman’s GAP—a language and compiler for dynamic programming in sequence analysis by Georg Sauthoff, Mathias Möhl, Stefan Janssen and Robert Giegerich. (Bioinformatics (2013) 29 (5): 551-560. doi: 10.1093/bioinformatics/btt022)

Abstract:

Motivation: Dynamic programming is ubiquitous in bioinformatics. Developing and implementing non-trivial dynamic programming algorithms is often error prone and tedious. Bellman’s GAP is a new programming system, designed to ease the development of bioinformatics tools based on the dynamic programming technique.

Results: In Bellman’s GAP, dynamic programming algorithms are described in a declarative style by tree grammars, evaluation algebras and products formed thereof. This bypasses the design of explicit dynamic programming recurrences and yields programs that are free of subscript errors, modular and easy to modify. The declarative modules are compiled into C++ code that is competitive to carefully hand-crafted implementations.

This article introduces the Bellman’s GAP system and its language, GAP-L. It then demonstrates the ease of development and the degree of re-use by creating variants of two common bioinformatics algorithms. Finally, it evaluates Bellman’s GAP as an implementation platform of ‘real-world’ bioinformatics tools.

Availability: Bellman’s GAP is available under GPL license from http://bibiserv.cebitec.uni-bielefeld.de/bellmansgap. This Web site includes a repository of re-usable modules for RNA folding based on thermodynamics.

Contact: robert@techfak.uni-bielefeld.de

Supplementary information: Supplementary data are available at Bioinformatics online

Focused on bioinformatics but dynamic programming is not limited to that field.

There is a very amusing story about how the field came to have the name “dynamic programming” in the Wikipedia article: Dynamic Programming.

February 21, 2013

NetGestalt for Data Visualization in the Context of Pathways

Filed under: Bioinformatics,Biomedical,Graphs,Networks,Visualization — Patrick Durusau @ 7:06 pm

NetGestalt for Data Visualization in the Context of Pathways by Stephen Turner.

From the post:

Many of you may be familiar with WebGestalt, a wonderful web utility developed by Bing Zhang at Vanderbilt for doing basic gene-set enrichment analyses. Last year, we invited Bing to speak at our annual retreat for the Vanderbilt Graduate Program in Human Genetics, and he did not disappoint! Bing walked us through his new tool called NetGestalt.

NetGestalt provides users with the ability to overlay large-scale experimental data onto biological networks. Data are loaded using continuous and binary tracks that can contain either single or multiple lines of data (called composite tracks). Continuous tracks could be gene expression intensities from microarray data or any other quantitative measure that can be mapped to the genome. Binary tracks are usually insertion/deletion regions, or called regions like ChIP peaks. NetGestalt extends many of the features of WebGestalt, including enrichment analysis for modules within a biological network, and provides easy ways to visualize the overlay of multiple tracks with Venn diagrams.

Stephen also points to documentation and video tutorials.

NetGestalt uses gene symbol as the gene identifier. Data that uses other gene identifiers must be mapped to gene symbols before uploading. (Manual, page 4)

An impressive alignment of data sources even with the restriction to gene symbols.

February 15, 2013

Using molecular networks to assess molecular similarity

Systems chemistry: Using molecular networks to assess molecular similarity by Bailey Fallon.

From the post:

In new research published in Journal of Systems Chemistry, Sijbren Otto and colleagues have provided the first experimental approach towards molecular networks that can predict bioactivity based on an assessment of molecular similarity.

Molecular similarity is an important concept in drug discovery. Molecules that share certain features such as shape, structure or hydrogen bond donor/acceptor groups may have similar properties that make them common to a particular target. Assessment of molecular similarity has so far relied almost exclusively on computational approaches, but Dr Otto reasoned that a measure of similarity might be obtained by interrogating the molecules in solution experimentally.

Important work for drug discovery but there are semantic lessons here as well:

Tests for similarity/sameness are domain specific.

Which means there are no universal tests for similarity/sameness.

Lacking universal tests for similarity/sameness, we should focus on developing documented and domain specific tests for similarity/sameness.

Domain specific tests provide quicker ROI than less useful and doomed universal solutions.

Documented domain specific tests may, no guarantees, enable us to find commonalities between domain measures of similarity/sameness.

But our conclusions will be based on domain experience and not projection from our domain onto others, less well known domains.

« Newer PostsOlder Posts »

Powered by WordPress