Archive for the ‘Bioinformatics’ Category
Friday, May 17th, 2013
A self-updating road map of The Cancer Genome Atlas by David E. Robbins, Alexander Grüneberg, Helena F. Deus, Murat M. Tanik and Jonas S. Almeida. (Bioinformatics (2013) 29 (10): 1333-1340. doi: 10.1093/bioinformatics/btt141)
Abstract:
Motivation: Since 2011, The Cancer Genome Atlas’ (TCGA) files have been accessible through HTTP from a public site, creating entirely new possibilities for cancer informatics by enhancing data discovery and retrieval. Significantly, these enhancements enable the reporting of analysis results that can be fully traced to and reproduced using their source data. However, to realize this possibility, a continually updated road map of files in the TCGA is required. Creation of such a road map represents a significant data modeling challenge, due to the size and fluidity of this resource: each of the 33 cancer types is instantiated in only partially overlapping sets of analytical platforms, while the number of data files available doubles approximately every 7 months.
Results: We developed an engine to index and annotate the TCGA files, relying exclusively on third-generation web technologies (Web 3.0). Specifically, this engine uses JavaScript in conjunction with the World Wide Web Consortium’s (W3C) Resource Description Framework (RDF), and SPARQL, the query language for RDF, to capture metadata of files in the TCGA open-access HTTP directory. The resulting index may be queried using SPARQL, and enables file-level provenance annotations as well as discovery of arbitrary subsets of files, based on their metadata, using web standard languages. In turn, these abilities enhance the reproducibility and distribution of novel results delivered as elements of a web-based computational ecosystem. The development of the TCGA Roadmap engine was found to provide specific clues about how biomedical big data initiatives should be exposed as public resources for exploratory analysis, data mining and reproducible research. These specific design elements align with the concept of knowledge reengineering and represent a sharp departure from top-down approaches in grid initiatives such as CaBIG. They also present a much more interoperable and reproducible alternative to the still pervasive use of data portals.
Availability: A prepared dashboard, including links to source code and a SPARQL endpoint, is available at http://bit.ly/TCGARoadmap. A video tutorial is available at http://bit.ly/TCGARoadmapTutorial.
Curious how the granularity of required semantics and the uniformity of the underlying data set impact the choice of semantic approaches?
And does access to data files present different challenges than say access to research publications in the same field?
Posted in Bioinformatics, Biology, Biomedical, Medical Informatics, RDF, SPARQL, Semantic Web | No Comments »
Thursday, May 16th, 2013
HAL: a hierarchical format for storing and analyzing multiple genome alignments by Glenn Hickey, Benedict Paten, Dent Earl, Daniel Zerbino and David Haussler. (Bioinformatics (2013) 29 (10): 1341-1342. doi: 10.1093/bioinformatics/btt128)
Abstract:
Motivation: Large multiple genome alignments and inferred ancestral genomes are ideal resources for comparative studies of molecular evolution, and advances in sequencing and computing technology are making them increasingly obtainable. These structures can provide a rich understanding of the genetic relationships between all subsets of species they contain. Current formats for storing genomic alignments, such as XMFA and MAF, are all indexed or ordered using a single reference genome, however, which limits the information that can be queried with respect to other species and clades. This loss of information grows with the number of species under comparison, as well as their phylogenetic distance.
Results: We present HAL, a compressed, graph-based hierarchical alignment format for storing multiple genome alignments and ancestral reconstructions. HAL graphs are indexed on all genomes they contain. Furthermore, they are organized phylogenetically, which allows for modular and parallel access to arbitrary subclades without fragmentation because of rearrangements that have occurred in other lineages. HAL graphs can be created or read with a comprehensive C++ API. A set of tools is also provided to perform basic operations, such as importing and exporting data, identifying mutations and coordinate mapping (liftover).
Availability: All documentation and source code for the HAL API and tools are freely available at http://github.com/glennhickey/hal.
Important work for bioinformatics and genome alignment as well as specializing graphs for that work.
Graphs are a popular subject these days but successful projects will rely on graphs with particular properties and structures to be useful.
The more examples of graph-based projects, the more we learn about general principles of graphs for particular applications or requirements.
Posted in Bioinformatics, Genomics, Graphs | No Comments »
Wednesday, May 15th, 2013
EDAM: an ontology of bioinformatics operations, types of data and identifiers, topics and formats by Jon Ison, Matúš Kalaš, Inge Jonassen, Dan Bolser, Mahmut Uludag, Hamish McWilliam, James Malone, Rodrigo Lopez, Steve Pettifer and Peter Rice. (Bioinformatics (2013) 29 (10): 1325-1332. doi: 10.1093/bioinformatics/btt113)
Abstract:
Motivation: Advancing the search, publication and integration of bioinformatics tools and resources demands consistent machine-understandable descriptions. A comprehensive ontology allowing such descriptions is therefore required.
Results: EDAM is an ontology of bioinformatics operations (tool or workflow functions), types of data and identifiers, application domains and data formats. EDAM supports semantic annotation of diverse entities such as Web services, databases, programmatic libraries, standalone tools, interactive applications, data schemas, datasets and publications within bioinformatics. EDAM applies to organizing and finding suitable tools and data and to automating their integration into complex applications or workflows. It includes over 2200 defined concepts and has successfully been used for annotations and implementations.
Availability: The latest stable version of EDAM is available in OWL format from http://edamontology.org/EDAM.owl and in OBO format from http://edamontology.org/EDAM.obo. It can be viewed online at the NCBO BioPortal and the EBI Ontology Lookup Service. For documentation and license please refer to http://edamontology.org. This article describes version 1.2 available at http://edamontology.org/EDAM_1.2.owl.
No matter how many times I read it, I just don’t get:
Advancing the search, publication and integration of bioinformatics tools and resources demands consistent machine-understandable descriptions. A comprehensive ontology allowing such descriptions is therefore required.
I will be generous and assume the authors meant “machine-processable descriptions” when I read “machine-understandable descriptions.” It is well known that machines don’t “understand” data, they simply process it according to specified instructions.
But more to the point, machines are indifferent to the type or number of descriptions they have for any subject. It might confuse a human processor to have thirty (30) different descriptions for the same subject but there has been no showing of such a limit for machines.
Every effort to produce a “comprehensive” ontology/classification/taxonomy, pick your brand of poison, has been in the face of competing and different descriptions. That is, after all, the rationale for a comprehensive …, that there are too many choices already.
The outcome of all such efforts, assuming there are N diverse descriptions is N + 1 diverse descriptions, the 1 being the current project added to existing diverse descriptions.
Posted in Bioinformatics, Ontology | No Comments »
Sunday, April 28th, 2013
Scientific Lenses over Linked Data: An approach to support task specific views of the data. A vision. by Christian Brenninkmeijer, Chris Evelo, Carole Goble, Alasdair J G Gray, Paul Groth, Steve Pettifer, Robert Stevens, Antony J Williams, and Egon L Willighagen.
Abstract:
Within complex scientific domains such as pharmacology, operational equivalence between two concepts is often context-, user- and task-specific. Existing Linked Data integration procedures and equivalence services do not take the context and task of the user into account. We present a vision for enabling users to control the notion of operational equivalence by applying scientic lenses over Linked Data. The scientific lenses vary the links that are activated between the datasets which affects the data returned to the user.
Two additional quotes from this paper should convince you of the importance of this work:
We aim to support users in controlling and varying their view of the data by applying a scientific lens which govern the notions of equivalence applied to the data. Users will be able to change their lens based on the task and role they are performing rather than having one fixed lens. To support this requirement, we propose an approach that applies context dependent sets of equality links. These links are stored in a stand-off fashion so that they are not intermingled with the datasets. This allows for multiple, context-dependent, linksets that can evolve without impact on the underlying datasets and support differing opinions on the relationships between data instances. This flexibility is in contrast to both Linked Data and traditional data integration approaches. We look at the role personae can play in guiding the nature of relationships between the data resources and the desired affects of applying scientific lenses over Linked Data.
and,
Within scientific datasets it is common to find links to the “equivalent” record in another dataset. However, there is no declaration of the form of the relationship. There is a great deal of variation in the notion of equivalence implied by the links both within a dataset’s usage and particularly across datasets, which degrades the quality of the data. The scientific user personae have very different needs about the notion of equivalence that should be applied between datasets. The users need a simple mechanism by which they can change the operational equivalence applied between datasets. We propose the use of scientific lenses.
Obvious questions:
Does your topic map software support multiple operational equivalences?
Does your topic map interface enable users to choose “lenses” (I like lenses better than roles) to view equivalence?
Does your topic map software support declaring the nature of equivalence?
I first saw this in the slide deck: Scientific Lenses: Supporting Alternative Views of the Data by Alasdair J G Gray at: 4th Open PHACTS Community Workshop.
BTW, the notion of equivalence being represented by “links” reminds me of a comment Peter Neubauer (Neo4j) once made to me, saying that equivalence could be modeled as edges. Imagine typing equivalence edges. Will have to think about that some more.
Posted in Bioinformatics, Biomedical, Drug Discovery, Linked Data, Medical Informatics, Operational Equivalence, Science | 2 Comments »
Sunday, April 28th, 2013
4th Open PHACTS Community Workshop : Using the power of Open PHACTS
From the post:
The fourth Open PHACTS Community Workshop was held at Burlington House in London on April 22 and 23, 2013. The Workshop focussed on “Using the Power of Open PHACTS” and featured the public release of the Open PHACTS application programming interface (API) and the first Open PHACTS example app, ChemBioNavigator.
The first day featured talks describing the data accessible via the Open PHACTS Discovery Platform and technical aspects of the API. The use of the API by example applications ChemBioNavigator and PharmaTrek was outlined, and the results of the Accelrys Pipeline Pilot Hackathon discussed.
The second day involved discussion of Open PHACTS sustainability and plans for the successor organisation, the Open PHACTS Foundation. The afternoon was attended by those keen to further discuss the potential of the Open PHACTS API and the future of Open PHACTS.
During talks, especially those detailing the Open PHACTS API, a good number of signup requests to the API via dev.openphacts.org were received. The hashtag #opslaunch was used to follow reactions to the workshop on Twitter (see storify), and showed the response amongst attendees to be overwhelmingly positive.
This summary is followed by slides from the two days of presentations.
Not like being there but still quite useful.
As a matter of fact, I found a lead on “operational equivalence” with this data set. More to follow in a separate post.
Posted in Bioinformatics, Biomedical, Drug Discovery, Linked Data, Medical Informatics | 1 Comment »
Wednesday, April 24th, 2013
Brain: biomedical knowledge manipulation by Samuel Croset, John P. Overington and Dietrich Rebholz-Schuhmann. (Bioinformatics (2013) 29 (9): 1238-1239. doi: 10.1093/bioinformatics/btt109)
Abstract:
Summary: Brain is a Java software library facilitating the manipulation and creation of ontologies and knowledge bases represented with the Web Ontology Language (OWL).
Availability and implementation: The Java source code and the library are freely available at https://github.com/loopasam/Brain and on the Maven Central repository (GroupId: uk.ac.ebi.brain). The documentation is available at https://github.com/loopasam/Brain/wiki.
Contact: croset@ebi.ac.uk
Supplementary information: Supplementary data are available at Bioinformatics online.
Odd how things like the topic naming constraint show up in unexpected contexts.
This article may be helpful if you are required to create or read OWL based data.
But as I read the article I saw:
The names (short forms) of OWL entities handled by a Brain object have to be unique. It is for instance not possible to add an OWL class, such as http://www.example.org/Cell to the ontology if an OWL entity with the short form ‘Cell’ already exists.
The explanation?
Despite being in contradiction with some Semantic Web principles, this design prevents ambiguous queries and hides as much as possible the cumbersome interaction with prefixes and Internationalized Resource Identifiers (IRI).
I suppose but doesn’t ambiguity exist in the mind of the user? That is they use a term than can have more than one meaning?
Having unique terms simply means inventing odd terms that no user will know.
Rather than unambiguous isn’t that unfound?
Posted in Bioinformatics, OWL, Semantic Web | No Comments »
Saturday, April 20th, 2013
PhenoMiner: quantitative phenotype curation at the rat genome database by Stanley J. F. Laulederkind, et.al. (Database (2013) 2013 : bat015 doi: 10.1093/database/bat015)
Abstract:
The Rat Genome Database (RGD) is the premier repository of rat genomic and genetic data and currently houses >40 000 rat gene records as well as human and mouse orthologs, >2000 rat and 1900 human quantitative trait loci (QTLs) records and >2900 rat strain records. Biological information curated for these data objects includes disease associations, phenotypes, pathways, molecular functions, biological processes and cellular components. Recently, a project was initiated at RGD to incorporate quantitative phenotype data for rat strains, in addition to the currently existing qualitative phenotype data for rat strains, QTLs and genes. A specialized curation tool was designed to generate manual annotations with up to six different ontologies/vocabularies used simultaneously to describe a single experimental value from the literature. Concurrently, three of those ontologies needed extensive addition of new terms to move the curation forward. The curation interface development, as well as ontology development, was an ongoing process during the early stages of the PhenoMiner curation project.
Database URL: http://rgd.mcw.edu
The line:
A specialized curation tool was designed to generate manual annotations with up to six different ontologies/vocabularies used simultaneously to describe a single experimental value from the literature.
sounded relevant to topic maps.
Turns out to be five ontologies and the article reports:
The ‘Create Record’ page (Figure 4) is where the rest of the data for a single record is entered. It consists of a series of autocomplete text boxes, drop-down text boxes and editable plain text boxes. All of the data entered are associated with terms from five ontologies/vocabularies: RS, CMO, MMO, XCO and the optional MA (Mouse Adult Gross Anatomy Dictionary) (13)
Important to note that authoring does not require the user to make explicit the properties underlying any of the terms from the different ontologies.
Some users probably know that level of detail but what is important is the capturing of their knowledge of subject sameness.
A topic map extension/add-on to such a system could flesh out those bare terms to provide a basis for treating terms from different ontologies as terms for the same subjects.
That merging/mapping detail need not bother an author or casual user.
But it increases the odds that future data sets can be reliably integrated with this one.
And issues with the correctness of a mapping can be meaningfully investigated.
If it helps, think of correctness of mappping as accountability, for someone else.
Posted in Authoring Topic Maps, Bioinformatics, Biology, Genomics, Interface Research/Design | No Comments »
Tuesday, April 16th, 2013
The non-negative matrix factorization toolbox for biological data mining by Yifeng Li and Alioune Ngom. (Source Code for Biology and Medicine 2013, 8:10 doi:10.1186/1751-0473-8-10)
From the post:
Background: Non-negative matrix factorization (NMF) has been introduced as an important method for mining biological data. Though there currently exists packages implemented in R and other programming languages, they either provide only a few optimization algorithms or focus on a specific application field. There does not exist a complete NMF package for the bioinformatics community, and in order to perform various data mining tasks on biological data.
Results: We provide a convenient MATLAB toolbox containing both the implementations of various NMF techniques and a variety of NMF-based data mining approaches for analyzing biological data. Data mining approaches implemented within the toolbox include data clustering and bi-clustering, feature extraction and selection, sample classification, missing values imputation, data visualization, and statistical comparison.
Conclusions: A series of analysis such as molecular pattern discovery, biological process identification, dimension reduction, disease prediction, visualization, and statistical comparison can be performed using this toolbox.
Written in a bioinformatics context but also used in text data mining (Enron emails), spectral analysis and other data mining fields. (See Non-negative matrix factorization)
Posted in Bioinformatics, Data Mining, Matrix | No Comments »
Sunday, April 14th, 2013
Planform: an application and database of graph-encoded planarian regenerative experiments by Daniel Lobo, Taylor J. Malone and Michael Levin. Bioinformatics (2013) 29 (8): 1098-1100. doi: 10.1093/bioinformatics/btt088
Abstract:
Summary: Understanding the mechanisms governing the regeneration capabilities of many organisms is a fundamental interest in biology and medicine. An ever-increasing number of manipulation and molecular experiments are attempting to discover a comprehensive model for regeneration, with the planarian flatworm being one of the most important model species. Despite much effort, no comprehensive, constructive, mechanistic models exist yet, and it is now clear that computational tools are needed to mine this huge dataset. However, until now, there is no database of regenerative experiments, and the current genotype–phenotype ontologies and databases are based on textual descriptions, which are not understandable by computers. To overcome these difficulties, we present here Planform (Planarian formalization), a manually curated database and software tool for planarian regenerative experiments, based on a mathematical graph formalism. The database contains more than a thousand experiments from the main publications in the planarian literature. The software tool provides the user with a graphical interface to easily interact with and mine the database. The presented system is a valuable resource for the regeneration community and, more importantly, will pave the way for the application of novel artificial intelligence tools to extract knowledge from this dataset.
Availability: The database and software tool are freely available at http://planform.daniel-lobo.com.
Watch the video tour for an example of a domain specific authoring tool.
It does not use any formal graph notation/terminology or attempt a new form of ASCII art.
Users can enter data about worms with four (4) heads. That bodes well for new techniques to author topic maps.
On the use of graphs, the authors write:
We have created a formalism based on graphs to encode the resultant morphologies and manipulations of regenerative experiments (Lobo et al., 2013). Mathematical graphs are ideal to encode relationships between individuals and have been previously used to encode morphologies (Lobo et al., 2011). The formalism divided a morphology into adjacent regions (graph nodes) connected to each other (graph edges). The geometrical characteristics of the regions (connection angles, distances, shapes, type, etc.) are stored as node and link labels. Importantly, the formalism permits automatic comparisons between morphologies: we implemented a metric to quantify the difference between two morphologies based on the graph edit distance algorithm.
The experiment manipulations are encoded in a tree structure. Nodes represent specific manipulations (cuts, irradiation and transplantations) where links define the order and relations between manipulations. This approach permits encode the majority of published planarian regenerative experiments.
The graph vs. relational crowd will be disappointed to learn the project uses SQLite (“the most widely deployed SQL database engine in the world”) for the storage/access to its data.
You were aware that hypergraphs were used to model relational databases in the “old days.” Yes?
I will try to pull together some of those publications in the near future.
Posted in Bioinformatics, Graphs, SQL, SQLite | No Comments »
Thursday, April 11th, 2013
Efficient comparison of sets of intervals with NC-lists by Matthias Zytnicki, YuFei Luo and Hadi Quesneville. (Bioinformatics (2013) 29 (7): 933-939. doi: 10.1093/bioinformatics/btt070)
Abstract:
Motivation: High-throughput sequencing produces in a small amount of time a large amount of data, which are usually difficult to analyze. Mapping the reads to the transcripts they originate from, to quantify the expression of the genes, is a simple, yet time demanding, example of analysis. Fast genomic comparison algorithms are thus crucial for the analysis of the ever-expanding number of reads sequenced.
Results: We used NC-lists to implement an algorithm that compares a set of query intervals with a set of reference intervals in two steps. The first step, a pre-processing done once for all, requires time O[#R log(#R) + #Q log(#Q)], where Q and R are the sets of query and reference intervals. The search phase requires constant space, and time O(#R + #Q + #M), where M is the set of overlaps. We showed that our algorithm compares favorably with five other algorithms, especially when several comparisons are performed.
Availability: The algorithm has been included to S–MART, a versatile tool box for RNA-Seq analysis, freely available at http://urgi.versailles.inra.fr/Tools/S-Mart. The algorithm can be used for many kinds of data (sequencing reads, annotations, etc.) in many formats (GFF3, BED, SAM, etc.), on any operating system. It is thus readily useable for the analysis of next-generation sequencing data.
Before you search for “NC-lists,” be aware that you will get this article as the first “hit” today in some popular search engines. Followed by a variety of lists for North Carolina.
A more useful search engine would allow me to choose the correct usage of a term and to re-run the query using the distinguished subject.
The expansion helps: Nested Containment List (NCList).
Familiar if you are working in bioinformatics.
More generally, consider the need to compare complex sequences of values for merging purposes.
Not a magic bullet but a technique you should keep in mind.
Origin: Nested Containment List (NCList): a new algorithm for accelerating interval query of genome alignment and interval databases, Alexander V. Alekseyenko and Christopher J. Lee. (Bioinformatics (2007) 23 (11): 1386-1393. doi: 10.1093/bioinformatics/btl647)
Posted in Bioinformatics, Set Intersection, Sets | No Comments »
Sunday, April 7th, 2013
Open PHACTS – Open Pharmacological Space
From the homepage:
Open PHACTS is building an Open Pharmacological Space in a 3-year knowledge management project of the Innovative Medicines Initiative (IMI), a unique partnership between the European Community and the European Federation of Pharmaceutical Industries and Associations (EFPIA).
The project is due to end in March 2014, and aims to deliver a sustainable service to continue after the project funding ends. The project consortium consists of leading academics in semantics, pharmacology and informatics, driven by solid industry business requirements: 28 partners, including 9 pharmaceutical companies and 3 biotechs.
Sourcecode has just appeared on GibHub: OpenPHACTS.
Important to different communities for different reasons. My interest isn’t the same as BigPharma.
A project to watch as they navigate the thickets of vocabularies, ontologies and other semantically diverse information sources.
Posted in Bioinformatics, Biology, Biomedical, Data Integration, Medical Informatics, Pharmaceutical Research | No Comments »
Saturday, March 23rd, 2013
Using Bayesian networks to discover relations between genes, environment, and disease by Chengwei Su, Angeline Andrew, Margaret R Karagas and Mark E Borsuk. (BioData Mining 2013, 6:6 doi:10.1186/1756-0381-6-6)
Abstract:
We review the applicability of Bayesian networks (BNs) for discovering relations between genes, environment, and disease. By translating probabilistic dependencies among variables into graphical models and vice versa, BNs provide a comprehensible and modular framework for representing complex systems. We first describe the Bayesian network approach and its applicability to understanding the genetic and environmental basis of disease. We then describe a variety of algorithms for learning the structure of a network from observational data. Because of their relevance to real-world applications, the topics of missing data and causal interpretation are emphasized. The BN approach is then exemplified through application to data from a population-based study of bladder cancer in New Hampshire, USA. For didactical purposes, we intentionally keep this example simple. When applied to complete data records, we find only minor differences in the performance and results of different algorithms. Subsequent incorporation of partial records through application of the EM algorithm gives us greater power to detect relations. Allowing for network structures that depart from a strict causal interpretation also enhances our ability to discover complex associations including gene-gene (epistasis) and gene-environment interactions. While BNs are already powerful tools for the genetic dissection of disease and generation of prognostic models, there remain some conceptual and computational challenges. These include the proper handling of continuous variables and unmeasured factors, the explicit incorporation of prior knowledge, and the evaluation and communication of the robustness of substantive conclusions to alternative assumptions and data manifestations.
From the introduction:
BNs have been applied in a variety of settings for the purposes of causal study and probabilistic prediction, including medical diagnosis, crime and terrorism risk, forensic science, and ecological conservation (see [7]). In bioinformatics, they have been used to analyze gene expression data [8,9], derive protein signaling networks [10-12], predict protein-protein interactions [13], perform pedigree analysis [14], conduct genetic epidemiological studies [5], and assess the performance of microsatellite markers on cancer recurrence [15].
Not to mention criminal investigations: Bayesian Network – [Crime Investigation] (Youtube).
Once relations are discovered, you are free to decorate them with roles, properties, etc., in other words, associations.
Posted in Bayesian Data Analysis, Bayesian Models, Bioinformatics, Medical Informatics | No Comments »
Saturday, March 16th, 2013
MetaNetX.org: a website and repository for accessing, analysing and manipulating metabolic networks by Mathias Ganter, Thomas Bernard, Sébastien Moretti, Joerg Stelling and Marco Pagni. (Bioinformatics (2013) 29 (6): 815-816. doi: 10.1093/bioinformatics/btt036)
Abstract:
MetaNetX.org is a website for accessing, analysing and manipulating genome-scale metabolic networks (GSMs) as well as biochemical pathways. It consistently integrates data from various public resources and makes the data accessible in a standardized format using a common namespace. Currently, it provides access to hundreds of GSMs and pathways that can be interactively compared (two or more), analysed (e.g. detection of dead-end metabolites and reactions, flux balance analysis or simulation of reaction and gene knockouts), manipulated and exported. Users can also upload their own metabolic models, choose to automatically map them into the common namespace and subsequently make use of the website’s functionality.
http://metanetx.org.
The authors are addressing a familiar problem:
Genome-scale metabolic networks (GSMs) consist of compartmentalized reactions that consistently combine biochemical, genetic and genomic information. When also considering a biomass reaction and both uptake and secretion reactions, GSMs are often used to study genotype–phenotype relationships, to direct new discoveries and to identify targets in metabolic engineering (Karr et al., 2012). However, a major difficulty in GSM comparisons and reconstructions is to integrate data from different resources with different nomenclatures and conventions for both metabolites and reactions. Hence, GSM consolidation and comparison may be impossible without detailed biological knowledge and programming skills. (emphasis added)
For which they propose an uncommon solution:
MetaNetX.org is implemented as a user-friendly and self-explanatory website that handles all user requests dynamically (Fig. 1a). It allows a user to access a collection of hundreds of published models, browse and select subsets for comparison and analysis, upload or modify new models and export models in conjunction with their results. Its functionality is based on a common namespace defined by MNXref (Bernard et al., 2012). In particular, all repository or user uploaded models are automatically translated with or without compartments into the common namespace; small deviations from the original model are possible due to the automatic reconciliation steps implemented by Bernard et al. (2012). However, a user can choose not to translate his model but still make use of the website’s functionalities. Furthermore, it is possible to augment the given reaction set by user-defined reactions, for example, for model augmentation.
The bioinformatics community recognizes the intellectual poverty of lock step models.
Wonder when the intelligence community is going to have that “a ha” moment?
Posted in Bioinformatics, Biomedical, Genomics, Modeling, Semantic Diversity | No Comments »
Monday, March 11th, 2013
The Annotation-enriched non-redundant patent sequence databases Weizhong Li, Bartosz Kondratowicz, Hamish McWilliam, Stephane Nauche and Rodrigo Lopez.
Not a real promising title is it?
The reason I cite it here is that by curation, the database is “non-redundant.”
Try searching for some of these sequences at the USPTO and compare the results.
The power of curation will be immediately obvious.
Abstract:
The EMBL-European Bioinformatics Institute (EMBL-EBI) offers public access to patent sequence data, providing a valuable service to the intellectual property and scientific communities. The non-redundant (NR) patent sequence databases comprise two-level nucleotide and protein sequence clusters (NRNL1, NRNL2, NRPL1 and NRPL2) based on sequence identity (level-1) and patent family (level-2). Annotation from the source entries in these databases is merged and enhanced with additional information from the patent literature and biological context. Corrections in patent publication numbers, kind-codes and patent equivalents significantly improve the data quality. Data are available through various user interfaces including web browser, downloads via FTP, SRS, Dbfetch and EBI-Search. Sequence similarity/homology searches against the databases are available using BLAST, FASTA and PSI-Search. In this article, we describe the data collection and annotation and also outline major changes and improvements introduced since 2009. Apart from data growth, these changes include additional annotation for singleton clusters, the identifier versioning for tracking entry change and the entry mappings between the two-level databases.
Database URL: http://www.ebi.ac.uk/patentdata/nr/
Topic maps are curated data. Which one do you prefer?
Posted in Bioinformatics, Biomedical, Marketing, Medical Informatics, Patents, Topic Maps | No Comments »
Friday, March 1st, 2013
Bellman’s GAP—a language and compiler for dynamic programming in sequence analysis by Georg Sauthoff, Mathias Möhl, Stefan Janssen and Robert Giegerich. (Bioinformatics (2013) 29 (5): 551-560. doi: 10.1093/bioinformatics/btt022)
Abstract:
Motivation: Dynamic programming is ubiquitous in bioinformatics. Developing and implementing non-trivial dynamic programming algorithms is often error prone and tedious. Bellman’s GAP is a new programming system, designed to ease the development of bioinformatics tools based on the dynamic programming technique.
Results: In Bellman’s GAP, dynamic programming algorithms are described in a declarative style by tree grammars, evaluation algebras and products formed thereof. This bypasses the design of explicit dynamic programming recurrences and yields programs that are free of subscript errors, modular and easy to modify. The declarative modules are compiled into C++ code that is competitive to carefully hand-crafted implementations.
This article introduces the Bellman’s GAP system and its language, GAP-L. It then demonstrates the ease of development and the degree of re-use by creating variants of two common bioinformatics algorithms. Finally, it evaluates Bellman’s GAP as an implementation platform of ‘real-world’ bioinformatics tools.
Availability: Bellman’s GAP is available under GPL license from http://bibiserv.cebitec.uni-bielefeld.de/bellmansgap. This Web site includes a repository of re-usable modules for RNA folding based on thermodynamics.
Contact: robert@techfak.uni-bielefeld.de
Supplementary information: Supplementary data are available at Bioinformatics online
Focused on bioinformatics but dynamic programming is not limited to that field.
There is a very amusing story about how the field came to have the name “dynamic programming” in the Wikipedia article: Dynamic Programming.
Posted in Bioinformatics, Programming | No Comments »
Thursday, February 21st, 2013
NetGestalt for Data Visualization in the Context of Pathways by Stephen Turner.
From the post:
Many of you may be familiar with WebGestalt, a wonderful web utility developed by Bing Zhang at Vanderbilt for doing basic gene-set enrichment analyses. Last year, we invited Bing to speak at our annual retreat for the Vanderbilt Graduate Program in Human Genetics, and he did not disappoint! Bing walked us through his new tool called NetGestalt.
NetGestalt provides users with the ability to overlay large-scale experimental data onto biological networks. Data are loaded using continuous and binary tracks that can contain either single or multiple lines of data (called composite tracks). Continuous tracks could be gene expression intensities from microarray data or any other quantitative measure that can be mapped to the genome. Binary tracks are usually insertion/deletion regions, or called regions like ChIP peaks. NetGestalt extends many of the features of WebGestalt, including enrichment analysis for modules within a biological network, and provides easy ways to visualize the overlay of multiple tracks with Venn diagrams.
Stephen also points to documentation and video tutorials.
NetGestalt uses gene symbol as the gene identifier. Data that uses other gene identifiers must be mapped to gene symbols before uploading. (Manual, page 4)
An impressive alignment of data sources even with the restriction to gene symbols.
Posted in Bioinformatics, Biomedical, Graphs, Networks, Visualization | No Comments »
Friday, February 15th, 2013
Systems chemistry: Using molecular networks to assess molecular similarity by Bailey Fallon.
From the post:
In new research published in Journal of Systems Chemistry, Sijbren Otto and colleagues have provided the first experimental approach towards molecular networks that can predict bioactivity based on an assessment of molecular similarity.
Molecular similarity is an important concept in drug discovery. Molecules that share certain features such as shape, structure or hydrogen bond donor/acceptor groups may have similar properties that make them common to a particular target. Assessment of molecular similarity has so far relied almost exclusively on computational approaches, but Dr Otto reasoned that a measure of similarity might be obtained by interrogating the molecules in solution experimentally.
Important work for drug discovery but there are semantic lessons here as well:
Tests for similarity/sameness are domain specific.
Which means there are no universal tests for similarity/sameness.
Lacking universal tests for similarity/sameness, we should focus on developing documented and domain specific tests for similarity/sameness.
Domain specific tests provide quicker ROI than less useful and doomed universal solutions.
Documented domain specific tests may, no guarantees, enable us to find commonalities between domain measures of similarity/sameness.
But our conclusions will be based on domain experience and not projection from our domain onto others, less well known domains.
Posted in Bioinformatics, Biology, Biomedical, Cheminformatics, Modeling, Molecular Graphs, Networks | No Comments »
Thursday, February 14th, 2013
InChI in the wild: An Assessment of InChIKey searching in Google by Christopher Southan. (Journal of Cheminformatics 2013, 5:10 doi:10.1186/1758-2946-5-10)
Abstract:
While chemical databases can be queried using the InChI string and InChIKey (IK) the latter was designed for open-web searching. It is becoming increasingly effective for this since more sources enhance crawling of their websites by the Googlebot and consequent IK indexing. Searchers who use Google as an adjunct to database access may be less familiar with the advantages of using the IK as explored in this review. As an example, the IK for atorvastatin retrieves ~200 low-redundancy links from a Google search in 0.3 of a second. These include most major databases and a very low false-positive rate. Results encompass less familiar but potentially useful sources and can be extended to isomer capture by using just the skeleton layer of the IK. Google Advanced Search can be used to filter large result sets and image searching with the IK is also effective and complementary to open-web queries. Results can be particularly useful for less-common structures as exemplified by a major metabolite of atorvastatin giving only three hits. Testing also demonstrated document-to-document and document-to-database joins via structure matching. The necessary generation of an IK from chemical names can be accomplished using open tools and resources for patents, papers, abstracts or other text sources. Active global sharing of local IK-linked information can be accomplished via surfacing in open laboratory notebooks, blogs, Twitter, figshare and other routes. While information-rich chemistry (e.g. approved drugs) can exhibit swamping and redundancy effects, the much smaller IK result sets for link-poor structures become a transformative first-pass option. The IK indexing has therefore turned Google into a de-facto open global chemical information hub by merging links to most significant sources, including over 50 million PubChem and ChemSpider records. The simplicity, specificity and speed of matching make it a useful option for biologists or others less familiar with chemical searching. However, compared to rigorously maintained major databases, users need to be circumspect about the consistency of Google results and provenance of retrieved links. In addition, community engagement may be necessary to ameliorate possible future degradation of utility.
An interesting use of an identifier, not as a key to a database, as a recent comment suggested, but as the basis for enhanced search results.
How else would you use identifiers “in the wild?”
Posted in Bioinformatics, Cheminformatics, InChl | No Comments »
Sunday, February 10th, 2013
Prize-based contests can provide solutions to computational biology problems by Karim R Lakhani, et al. (Nature Biotechnology 31, 108–111 (2013) doi:10.1038/nbt.2495)
From the article:
Advances in biotechnology have fueled the generation of unprecedented quantities of data across the life sciences. However, finding analysts who can address such ‘big data’ problems effectively has become a significant research bottleneck. Historically, prize-based contests have had striking success in attracting unconventional individuals who can overcome difficult challenges. To determine whether this approach could solve a real big-data biologic algorithm problem, we used a complex immunogenomics problem as the basis for a two-week online contest broadcast to participants outside academia and biomedical disciplines. Participants in our contest produced over 600 submissions containing 89 novel computational approaches to the problem. Thirty submissions exceeded the benchmark performance of the US National Institutes of Health’s MegaBLAST. The best achieved both greater accuracy and speed (1,000 times greater). Here we show the potential of using online prize-based contests to access individuals without domain-specific backgrounds to address big-data challenges in the life sciences.
….
Over the last ten years, online prize-based contest platforms have emerged to solve specific scientific and computational problems for the commercial sector. These platforms, with solvers in the range of tens to hundreds of thousands, have achieved considerable success by exposing thousands of problems to larger numbers of heterogeneous problem-solvers and by appealing to a wide range of motivations to exert effort and create innovative solutions18, 19. The large number of entrants in prize-based contests increases the probability that an ‘extreme-value’ (or maximally performing) solution can be found through multiple independent trials; this is also known as a parallel-search process19. In contrast to traditional approaches, in which experts are predefined and preselected, contest participants self-select to address problems and typically have diverse knowledge, skills and experience that would be virtually impossible to duplicate locally18. Thus, the contest sponsor can identify an appropriate solution by allowing many individuals to participate and observing the best performance. This is particularly useful for highly uncertain innovation problems in which prediction of the best solver or approach may be difficult and the best person to solve one problem may be unsuitable for another19.
An article that merits wider reading that it is likely to get behind a pay-wall.
A semantically diverse universe of potential solvers is more effective than a semantically monotone group of selected experts.
An indicator of what to expect from the monotone logic of the Semantic Web.
Good for scheduling tennis matches with Tim Berners-Lee.
For more complex tasks, rely on semantically diverse groups of humans.
I first saw this at: Solving Big-Data Bottleneck: Scientists Team With Business Innovators to Tackle Research Hurdles.
Posted in Bioinformatics, Biology, Contest, Crowd Sourcing | No Comments »
Saturday, February 9th, 2013
‘What’s in the NIDDK CDR?’—public query tools for the NIDDK central data repository by Nauqin Pan, et al., (Database (2013) 2013 : bas058 doi: 10.1093/database/bas058)
Abstract:
The National Institute of Diabetes and Digestive Disease (NIDDK) Central Data Repository (CDR) is a web-enabled resource available to researchers and the general public. The CDR warehouses clinical data and study documentation from NIDDK funded research, including such landmark studies as The Diabetes Control and Complications Trial (DCCT, 1983–93) and the Epidemiology of Diabetes Interventions and Complications (EDIC, 1994–present) follow-up study which has been ongoing for more than 20 years. The CDR also houses data from over 7 million biospecimens representing 2 million subjects. To help users explore the vast amount of data stored in the NIDDK CDR, we developed a suite of search mechanisms called the public query tools (PQTs). Five individual tools are available to search data from multiple perspectives: study search, basic search, ontology search, variable summary and sample by condition. PQT enables users to search for information across studies. Users can search for data such as number of subjects, types of biospecimens and disease outcome variables without prior knowledge of the individual studies. This suite of tools will increase the use and maximize the value of the NIDDK data and biospecimen repositories as important resources for the research community.
Database URL: https://www.niddkrepository.org/niddk/home.do
I would like to tell you more about this research, since “[t]he National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) is part of the National Institutes of Health (NIH) and the U.S. Department of Health and Human Services” (that’s a direct quote) and so doesn’t claim copyright on its publications.
Unfortunately, the NIDDK published this paper in the Oxford journal Database, which does believe in restricting access to publicly funded research.
Do visit the search interface to see what you think about it.
Not quite the same as curated content but an improvement over raw string matching.
Posted in Bioinformatics, Biomedical, Search Interface, Searching | No Comments »
Sunday, February 3rd, 2013
ToxPi GUI: an interactive visualization tool for transparent integration of data from diverse sources of evidence by David M. Reif, Myroslav Sypa, Eric F. Lock, Fred A. Wright, Ander Wilson, Tommy Cathey, Richard R. Judson and Ivan Rusyn. (Bioinformatics (2013) 29 (3): 402-403. doi: 10.1093/bioinformatics/bts686)
Abstract:
Motivation: Scientists and regulators are often faced with complex decisions, where use of scarce resources must be prioritized using collections of diverse information. The Toxicological Prioritization Index (ToxPi™) was developed to enable integration of multiple sources of evidence on exposure and/or safety, transformed into transparent visual rankings to facilitate decision making. The rankings and associated graphical profiles can be used to prioritize resources in various decision contexts, such as testing chemical toxicity or assessing similarity of predicted compound bioactivity profiles. The amount and types of information available to decision makers are increasing exponentially, while the complex decisions must rely on specialized domain knowledge across multiple criteria of varying importance. Thus, the ToxPi bridges a gap, combining rigorous aggregation of evidence with ease of communication to stakeholders.
Results: An interactive ToxPi graphical user interface (GUI) application has been implemented to allow straightforward decision support across a variety of decision-making contexts in environmental health. The GUI allows users to easily import and recombine data, then analyze, visualize, highlight, export and communicate ToxPi results. It also provides a statistical metric of stability for both individual ToxPi scores and relative prioritized ranks.
Availability: The ToxPi GUI application, complete user manual and example data files are freely available from http://comptox.unc.edu/toxpi.php.
Contact: reif.david@gmail.com
Very cool!
Although like having a Ford automobile in any color, so long as the color was black, you can integrate any data source, so long as the format is csv. And values are numbers. Subject to other restrictions as well.
That’s an observation, not a criticism.
The application serves a purpose within a domain and does not “integrate” information in the sense of a topic map.
But a topic map could recycle its data to add other identifications and properties. Without having to re-write this application or its data.
Once curated, data should be re-used, not re-created/curated.
Topic maps give you more bang for your data buck.
Posted in Bioinformatics, Biomedical, Integration, Medical Informatics, Subject Identity | No Comments »
Monday, January 28th, 2013
PoSSuM : Pocket Similarity Searching using Multi-Sketches
From the webpage:
Today, vast amounts of protein-small molecule binding sites can be found in the Protein Data Bank (PDB). Exhaustive comparison of them is computationally demanding, but useful in the prediction of protein functions and drug discovery. We proposed a tremendously fast algorithm called “SketchSort” that enables the enumeration of similar pairs in a huge number of protein-ligand binding sites. We conducted all-pair similarity searches for 3.4 million known and potential binding sites using the proposed method and discovered over 24 million similar pairs of binding sites. We present the results as a relational database Pocket Similarity Search using Multiple-Sketches (PoSSuM), which includes all the discovered pairs with annotations of various types (e.g., CATH, SCOP, EC number, Gene ontology). PoSSuM enables rapid exploration of similar binding sites among structures with different global folds as well as similar ones. Moreover, PoSSuM is useful for predicting the binding ligand for unbound structures. Basically, the users can search similar binding pockets using two search modes:
i) “Search K” is useful for finding similar binding sites for a known ligand-binding site. Post a known ligand-binding site (a pair of “PDB ID” and “HET code”) in the PDB, and PoSSuM will search similar sites for the query site.
ii) “Search P” is useful for predicting ligands that potentially bind to a structure of interest. Post a known protein structure (PDB ID) in the PDB, and PoSSuM will search similar known-ligand binding sites for the query structure.
Obviously useful for the bioinformatics crowd but relevant for topic maps as well.
In topic map terminology, the searches are for associations with a known role player in a particular role, leaving the other role player unspecified.
It does not define or seek an exact match but provides the user with data that may help them make a match determination.
Posted in Bioinformatics, Biomedical | No Comments »
Monday, January 28th, 2013
Toward a New Model of the Cell: Everything You Always Wanted to Know About Genes
From the post:
Turning vast amounts of genomic data into meaningful information about the cell is the great challenge of bioinformatics, with major implications for human biology and medicine. Researchers at the University of California, San Diego School of Medicine and colleagues have proposed a new method that creates a computational model of the cell from large networks of gene and protein interactions, discovering how genes and proteins connect to form higher-level cellular machinery.
…
“Our method creates ontology, or a specification of all the major players in the cell and the relationships between them,” said first author Janusz Dutkowski, PhD, postdoctoral researcher in the UC San Diego Department of Medicine. It uses knowledge about how genes and proteins interact with each other and automatically organizes this information to form a comprehensive catalog of gene functions, cellular components, and processes.
“What’s new about our ontology is that it is created automatically from large datasets. In this way, we see not only what is already known, but also potentially new biological components and processes — the bases for new hypotheses,” said Dutkowski.
Originally devised by philosophers attempting to explain the nature of existence, ontologies are now broadly used to encapsulate everything known about a subject in a hierarchy of terms and relationships. Intelligent information systems, such as iPhone’s Siri, are built on ontologies to enable reasoning about the real world. Ontologies are also used by scientists to structure knowledge about subjects like taxonomy, anatomy and development, bioactive compounds, disease and clinical diagnosis.
A Gene Ontology (GO) exists as well, constructed over the last decade through a joint effort of hundreds of scientists. It is considered the gold standard for understanding cell structure and gene function, containing 34,765 terms and 64,635 hierarchical relations annotating genes from more than 80 species.
“GO is very influential in biology and bioinformatics, but it is also incomplete and hard to update based on new data,” said senior author Trey Ideker, PhD, chief of the Division of Genetics in the School of Medicine and professor of bioengineering in UC San Diego’s Jacobs School of Engineering.
The conclusion to A gene ontology inferred from molecular networks (Janusz Dutkowski, Michael Kramer, Michal A Surma, Rama Balakrishnan, J Michael Cherry, Nevan J Krogan & Trey Ideker, Nature Biotechnology 31, 38–45 (2013) doi:10.1038/nbt.2463), illustrates a difference between ontology in the GO sense and that produced by the authors:
The research reported in this manuscript raises the possibility that, given the appropriate tools, ontologies might evolve over time with the addition of each new network map or high-throughput experiment that is published. More importantly, it enables a philosophical shift in bioinformatic analysis, from a regime in which the ontology is viewed as gold standard to one in which it is the major result. (emphasis added)
Ontology as representing reality as opposed to declaring it.
That is a novel concept.
Posted in Bioinformatics, Biomedical, Genomics | No Comments »
Tuesday, January 22nd, 2013
BioNLP-ST 2013
Dates:
Training Data Release 12:00 IDLW, 17 Jan. 2013
Test Data Release 22 Mar. 2013
Result Submission 29 Mar. 2013
BioNLP’11 Workshop 8-9 Aug. 2013
From the website:
The BioNLP Shared Task (BioNLP-ST) series represents a community-wide trend in text-mining for biology toward fine-grained information extraction (IE). The two previous events, BioNLP-ST 2009 and 2011, attracted wide attention, with over 30 teams submitting final results. The tasks and their data have since served as the basis of numerous studies, released event extraction systems, and published datasets. The upcoming BioNLP-ST 2013 follows the general outline and goals of the previous tasks. It identifies biologically relevant extraction targets and proposes a linguistically motivated approach to event representation. The tasks in BioNLP-ST 2013 cover many new hot topics in biology that are close to biologists’ needs. BioNLP-ST 2013 broadens the scope of the text-mining application domains in biology by introducing new issues on cancer genetics and pathway curation. It also builds on the well-known previous datasets GENIA, LLL/BI and BB to propose more realistic tasks that considered previously, closer to the actual needs of biological data integration.
The first event in 2009 triggered active research in the community on a specific fine-grained IE task. Expanding on this, the second BioNLP-ST was organized under the theme “Generalization”, which was well received by participants, who introduced numerous systems that could be straightforwardly applied to multiple tasks. This time, the BioNLP-ST takes a step further and pursues the grand theme of “Knowledge base construction”, which is addressed in various ways: semantic web (GE, GRO), pathways (PC), molecular mechanisms of cancer (CG), regulation networks (GRN) and ontology population (GRO, BB).
As in previous events, manually annotated data will be provided for training, development and evaluation of information extraction methods. According to their relevance for biological studies, the annotations are either bound to specific expressions in the text or represented as structured knowledge. Many tools for the detailed evaluation and graphical visualization of annotations and system outputs will be available for participants. Support in performing linguistic processing will be provided to the participants in the form of analyses created by various state-of-the art tools on the dataset texts.
Participation to the task will be open to the academia, industry, and all other interested parties.
Tasks:
Quick question: Do you think there is semantically diverse data available for each of these tasks?
I first saw this at: BioNLP Shared Task: Text Mining for Biology Competition.
Posted in Bioinformatics, Biomedical, Medical Informatics | No Comments »
Monday, January 21st, 2013
Designing concept maps for a precise and objective description of pharmaceutical innovations by Maia Iordatii, Alain Venot and Catherine Duclos. (BMC Medical Informatics and Decision Making 2013, 13:10 doi:10.1186/1472-6947-13-10)
Abstract:
Background
When a new drug is launched onto the market, information about the new manufactured product is contained in its monograph and evaluation report published by national drug agencies. Health professionals need to be able to determine rapidly and easily whether the new manufactured product is potentially useful for their practice. There is therefore a need to identify the best way to group together and visualize the main items of information describing the nature and potential impact of the new drug. The objective of this study was to identify these items of information and to bring them together in a model that could serve as the standard for presenting the main features of new manufactured product.
Methods
We developed a preliminary conceptual model of pharmaceutical innovations, based on the knowledge of the authors. We then refined this model, using a random sample of 40 new manufactured drugs recently approved by the national drug regulatory authorities in France and covering a broad spectrum of innovations and therapeutic areas. Finally, we used another sample of 20 new manufactured drugs to determine whether the model was sufficiently comprehensive.
Results
The results of our modeling led to three sub models described as conceptual maps representing: i) the medical context for use of the new drug (indications, type of effect, therapeutical arsenal for the same indications), ii) the nature of the novelty of the new drug (new molecule, new mechanism of action, new combination, new dosage, etc.), and iii) the impact of the drug in terms of efficacy, safety and ease of use, compared with other drugs with the same indications.
Conclusions
Our model can help to standardize information about new drugs released onto the market. It is potentially useful to the pharmaceutical industry, medical journals, editors of drug databases and medical software, and national or international drug regulation agencies, as a means of describing the main properties of new pharmaceutical products. It could also used as a guide for the writing of comprehensive and objective texts summarizing the nature and interest of new manufactured product. (emphasis added)
We all design categories starting with what we know, as pointed out under methods above.
And any three authors could undertake a such a quest, with equally valid results but different terminology and perhaps even a different arrangement of concepts.
The problem isn’t the undertaking, which is a useful.
The problem is a lack of a binding between such undertakings, which enables users to migrate between such maps, as they develop over time.
A problem that topic maps offer an infrastructure to solve.
Posted in Bioinformatics, Biomedical, Concept Maps | No Comments »
Sunday, January 20th, 2013
Semantic Web meets Integrative Biology: a survey by Haujun Chen, Tong Yu and Jake Y. Chen.
Abstract:
Integrative Biology (IB) uses experimental or computational quantitative technologies to characterize biological systems at the molecular, cellular, tissue and population levels. IB typically involves the integration of the data, knowledge and capabilities across disciplinary boundaries in order to solve complex problems. We identify a series of bioinformatics problems posed by interdisciplinary integration: (i) data integration that interconnects structured data across related biomedical domains; (ii) ontology integration that brings jargons, terminologies and taxonomies from various disciplines into a unified network of ontologies; (iii) knowledge integration that integrates disparate knowledge elements from multiple sources; (iv) service integration that build applications out of services provided by different vendors. We argue that IB can benefit significantly from the integration solutions enabled by Semantic Web (SW) technologies. The SW enables scientists to share content beyond the boundaries of applications and websites, resulting into a web of data that is meaningful and understandable to any computers. In this review, we provide insight into how SW technologies can be used to build open, standardized and interoperable solutions for interdisciplinary integration on a global basis. We present a rich set of case studies in system biology, integrative neuroscience, bio-pharmaceutics and translational medicine, to highlight the technical features and benefits of SW applications in IB.
A very good summary the issues of data integration in bioinformatics.
I disagree with the prescription, as you might imagine, but it is a good starting place for discussion of the issues of data integration.
Posted in Bioinformatics, Semantic Web | No Comments »
Sunday, January 20th, 2013
An overview of the BioCreative 2012 Workshop Track III: interactive text mining task Cecilia N. Arighi, et. al. (Database (2013) 2013 : bas056 doi: 10.1093/database/bas056)
Abstract:
In many databases, biocuration primarily involves literature curation, which usually involves retrieving relevant articles, extracting information that will translate into annotations and identifying new incoming literature. As the volume of biological literature increases, the use of text mining to assist in biocuration becomes increasingly relevant. A number of groups have developed tools for text mining from a computer science/linguistics perspective, and there are many initiatives to curate some aspect of biology from the literature. Some biocuration efforts already make use of a text mining tool, but there have not been many broad-based systematic efforts to study which aspects of a text mining tool contribute to its usefulness for a curation task. Here, we report on an effort to bring together text mining tool developers and database biocurators to test the utility and usability of tools. Six text mining systems presenting diverse biocuration tasks participated in a formal evaluation, and appropriate biocurators were recruited for testing. The performance results from this evaluation indicate that some of the systems were able to improve efficiency of curation by speeding up the curation task significantly (∼1.7- to 2.5-fold) over manual curation. In addition, some of the systems were able to improve annotation accuracy when compared with the performance on the manually curated set. In terms of inter-annotator agreement, the factors that contributed to significant differences for some of the systems included the expertise of the biocurator on the given curation task, the inherent difficulty of the curation and attention to annotation guidelines. After the task, annotators were asked to complete a survey to help identify strengths and weaknesses of the various systems. The analysis of this survey highlights how important task completion is to the biocurators’ overall experience of a system, regardless of the system’s high score on design, learnability and usability. In addition, strategies to refine the annotation guidelines and systems documentation, to adapt the tools to the needs and query types the end user might have and to evaluate performance in terms of efficiency, user interface, result export and traditional evaluation metrics have been analyzed during this task. This analysis will help to plan for a more intense study in BioCreative IV.
Curation is an aspect of topic map authoring, albeit with the latter capturing information for later merging with other sources of information.
Definitely an article you will want to read if you are designing text mining as part of a topic map solution.
Posted in Annotation, Bioinformatics, Curation, Text Mining | 1 Comment »
Saturday, January 19th, 2013
The Pacific Symposium on Biocomputing 2013 by Will Bush.
From the post:
For 18 years now, computational biologists have convened on the beautiful islands of Hawaii to present and discuss research emerging from new areas of biomedicine. PSB Conference Chairs Teri Klein (@teriklein), Keith Dunker, Russ Altman (@Rbaltman) and Larry Hunter (@ProfLHunter) organize innovative sessions and tutorials that are always interactive and thought-provoking. This year, sessions included Computational Drug Repositioning, Epigenomics, Aberrant Pathway and Network Activity, Personalized Medicine, Phylogenomics and Population Genomics, Post-Next Generation Sequencing, and Text and Data Mining. The Proceedings are available online here, and a few of the highlights are:
See Will’s post for the highlights. Or browse the proceedings. You are almost certainly going to find something relevant to you.
Do note Will’s use of Twiiter IDs as identifiers. Unique, persistent (I assume Twitter doesn’t re-assign them), easy to access.
It wasn’t clear from Will’s post if the following image was from Biocomputing 2013 or if he stopped by a markup conference. Hard to tell.

Posted in Bioinformatics, Biomedical, Data Mining, Text Mining | No Comments »
Friday, January 18th, 2013
PPInterFinder—a mining tool for extracting causal relations on human proteins from literature by Kalpana Raja, Suresh Subramani and Jeyakumar Natarajan. (Database (2013) 2013 : bas052 doi: 10.1093/database/bas052)
Abstract:
One of the most common and challenging problem in biomedical text mining is to mine protein–protein interactions (PPIs) from MEDLINE abstracts and full-text research articles because PPIs play a major role in understanding the various biological processes and the impact of proteins in diseases. We implemented, PPInterFinder—a web-based text mining tool to extract human PPIs from biomedical literature. PPInterFinder uses relation keyword co-occurrences with protein names to extract information on PPIs from MEDLINE abstracts and consists of three phases. First, it identifies the relation keyword using a parser with Tregex and a relation keyword dictionary. Next, it automatically identifies the candidate PPI pairs with a set of rules related to PPI recognition. Finally, it extracts the relations by matching the sentence with a set of 11 specific patterns based on the syntactic nature of PPI pair. We find that PPInterFinder is capable of predicting PPIs with the accuracy of 66.05% on AIMED corpus and outperforms most of the existing systems.
Database URL: http://www.biomining-bu.in/ppinterfinder/
I thought the shortened form of the title would catch your eye.
Important work for bioinformatics but it is also an example of domain specific association mining.
By focusing on a specific domain and forswearing designs on being a universal association solution, PPInterFinder produces useful results today.
A lesson that should be taken and applied to semantic mappings more generally.
Posted in Associations, Bioinformatics, Biomedical | No Comments »