Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 14, 2013

InChI in the wild: An Assessment of InChIKey searching in Google

Filed under: Bioinformatics,Cheminformatics,InChl — Patrick Durusau @ 8:19 pm

InChI in the wild: An Assessment of InChIKey searching in Google by Christopher Southan. (Journal of Cheminformatics 2013, 5:10 doi:10.1186/1758-2946-5-10)


While chemical databases can be queried using the InChI string and InChIKey (IK) the latter was designed for open-web searching. It is becoming increasingly effective for this since more sources enhance crawling of their websites by the Googlebot and consequent IK indexing. Searchers who use Google as an adjunct to database access may be less familiar with the advantages of using the IK as explored in this review. As an example, the IK for atorvastatin retrieves ~200 low-redundancy links from a Google search in 0.3 of a second. These include most major databases and a very low false-positive rate. Results encompass less familiar but potentially useful sources and can be extended to isomer capture by using just the skeleton layer of the IK. Google Advanced Search can be used to filter large result sets and image searching with the IK is also effective and complementary to open-web queries. Results can be particularly useful for less-common structures as exemplified by a major metabolite of atorvastatin giving only three hits. Testing also demonstrated document-to-document and document-to-database joins via structure matching. The necessary generation of an IK from chemical names can be accomplished using open tools and resources for patents, papers, abstracts or other text sources. Active global sharing of local IK-linked information can be accomplished via surfacing in open laboratory notebooks, blogs, Twitter, figshare and other routes. While information-rich chemistry (e.g. approved drugs) can exhibit swamping and redundancy effects, the much smaller IK result sets for link-poor structures become a transformative first-pass option. The IK indexing has therefore turned Google into a de-facto open global chemical information hub by merging links to most significant sources, including over 50 million PubChem and ChemSpider records. The simplicity, specificity and speed of matching make it a useful option for biologists or others less familiar with chemical searching. However, compared to rigorously maintained major databases, users need to be circumspect about the consistency of Google results and provenance of retrieved links. In addition, community engagement may be necessary to ameliorate possible future degradation of utility.

An interesting use of an identifier, not as a key to a database, as a recent comment suggested, but as the basis for enhanced search results.

How else would you use identifiers “in the wild?”

February 10, 2013

The Power of Semantic Diversity

Filed under: Bioinformatics,Biology,Contest,Crowd Sourcing — Patrick Durusau @ 3:10 pm

Prize-based contests can provide solutions to computational biology problems by Karim R Lakhani, et al. (Nature Biotechnology 31, 108–111 (2013) doi:10.1038/nbt.2495)

From the article:

Advances in biotechnology have fueled the generation of unprecedented quantities of data across the life sciences. However, finding analysts who can address such ‘big data’ problems effectively has become a significant research bottleneck. Historically, prize-based contests have had striking success in attracting unconventional individuals who can overcome difficult challenges. To determine whether this approach could solve a real big-data biologic algorithm problem, we used a complex immunogenomics problem as the basis for a two-week online contest broadcast to participants outside academia and biomedical disciplines. Participants in our contest produced over 600 submissions containing 89 novel computational approaches to the problem. Thirty submissions exceeded the benchmark performance of the US National Institutes of Health’s MegaBLAST. The best achieved both greater accuracy and speed (1,000 times greater). Here we show the potential of using online prize-based contests to access individuals without domain-specific backgrounds to address big-data challenges in the life sciences.


Over the last ten years, online prize-based contest platforms have emerged to solve specific scientific and computational problems for the commercial sector. These platforms, with solvers in the range of tens to hundreds of thousands, have achieved considerable success by exposing thousands of problems to larger numbers of heterogeneous problem-solvers and by appealing to a wide range of motivations to exert effort and create innovative solutions18, 19. The large number of entrants in prize-based contests increases the probability that an ‘extreme-value’ (or maximally performing) solution can be found through multiple independent trials; this is also known as a parallel-search process19. In contrast to traditional approaches, in which experts are predefined and preselected, contest participants self-select to address problems and typically have diverse knowledge, skills and experience that would be virtually impossible to duplicate locally18. Thus, the contest sponsor can identify an appropriate solution by allowing many individuals to participate and observing the best performance. This is particularly useful for highly uncertain innovation problems in which prediction of the best solver or approach may be difficult and the best person to solve one problem may be unsuitable for another19.

An article that merits wider reading that it is likely to get behind a pay-wall.

A semantically diverse universe of potential solvers is more effective than a semantically monotone group of selected experts.

An indicator of what to expect from the monotone logic of the Semantic Web.

Good for scheduling tennis matches with Tim Berners-Lee.

For more complex tasks, rely on semantically diverse groups of humans.

I first saw this at: Solving Big-Data Bottleneck: Scientists Team With Business Innovators to Tackle Research Hurdles.

February 9, 2013

‘What’s in the NIDDK CDR?’…

Filed under: Bioinformatics,Biomedical,Search Interface,Searching — Patrick Durusau @ 8:21 pm

‘What’s in the NIDDK CDR?’—public query tools for the NIDDK central data repository by Nauqin Pan, et al., (Database (2013) 2013 : bas058 doi: 10.1093/database/bas058)


The National Institute of Diabetes and Digestive Disease (NIDDK) Central Data Repository (CDR) is a web-enabled resource available to researchers and the general public. The CDR warehouses clinical data and study documentation from NIDDK funded research, including such landmark studies as The Diabetes Control and Complications Trial (DCCT, 1983–93) and the Epidemiology of Diabetes Interventions and Complications (EDIC, 1994–present) follow-up study which has been ongoing for more than 20 years. The CDR also houses data from over 7 million biospecimens representing 2 million subjects. To help users explore the vast amount of data stored in the NIDDK CDR, we developed a suite of search mechanisms called the public query tools (PQTs). Five individual tools are available to search data from multiple perspectives: study search, basic search, ontology search, variable summary and sample by condition. PQT enables users to search for information across studies. Users can search for data such as number of subjects, types of biospecimens and disease outcome variables without prior knowledge of the individual studies. This suite of tools will increase the use and maximize the value of the NIDDK data and biospecimen repositories as important resources for the research community.

Database URL:

I would like to tell you more about this research, since “[t]he National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) is part of the National Institutes of Health (NIH) and the U.S. Department of Health and Human Services” (that’s a direct quote) and so doesn’t claim copyright on its publications.

Unfortunately, the NIDDK published this paper in the Oxford journal Database, which does believe in restricting access to publicly funded research.

Do visit the search interface to see what you think about it.

Not quite the same as curated content but an improvement over raw string matching.

February 3, 2013

ToxPi GUI [Data Recycling]

Filed under: Bioinformatics,Biomedical,Integration,Medical Informatics,Subject Identity — Patrick Durusau @ 6:57 pm

ToxPi GUI: an interactive visualization tool for transparent integration of data from diverse sources of evidence by David M. Reif, Myroslav Sypa, Eric F. Lock, Fred A. Wright, Ander Wilson, Tommy Cathey, Richard R. Judson and Ivan Rusyn. (Bioinformatics (2013) 29 (3): 402-403. doi: 10.1093/bioinformatics/bts686)


Motivation: Scientists and regulators are often faced with complex decisions, where use of scarce resources must be prioritized using collections of diverse information. The Toxicological Prioritization Index (ToxPi™) was developed to enable integration of multiple sources of evidence on exposure and/or safety, transformed into transparent visual rankings to facilitate decision making. The rankings and associated graphical profiles can be used to prioritize resources in various decision contexts, such as testing chemical toxicity or assessing similarity of predicted compound bioactivity profiles. The amount and types of information available to decision makers are increasing exponentially, while the complex decisions must rely on specialized domain knowledge across multiple criteria of varying importance. Thus, the ToxPi bridges a gap, combining rigorous aggregation of evidence with ease of communication to stakeholders.

Results: An interactive ToxPi graphical user interface (GUI) application has been implemented to allow straightforward decision support across a variety of decision-making contexts in environmental health. The GUI allows users to easily import and recombine data, then analyze, visualize, highlight, export and communicate ToxPi results. It also provides a statistical metric of stability for both individual ToxPi scores and relative prioritized ranks.

Availability: The ToxPi GUI application, complete user manual and example data files are freely available from


Very cool!

Although like having a Ford automobile in any color, so long as the color was black, you can integrate any data source, so long as the format is csv. And values are numbers. Subject to other restrictions as well.

That’s an observation, not a criticism.

The application serves a purpose within a domain and does not “integrate” information in the sense of a topic map.

But a topic map could recycle its data to add other identifications and properties. Without having to re-write this application or its data.

Once curated, data should be re-used, not re-created/curated.

Topic maps give you more bang for your data buck.

January 28, 2013


Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 1:20 pm

PoSSuM : Pocket Similarity Searching using Multi-Sketches

From the webpage:

Today, vast amounts of protein-small molecule binding sites can be found in the Protein Data Bank (PDB). Exhaustive comparison of them is computationally demanding, but useful in the prediction of protein functions and drug discovery. We proposed a tremendously fast algorithm called “SketchSort” that enables the enumeration of similar pairs in a huge number of protein-ligand binding sites. We conducted all-pair similarity searches for 3.4 million known and potential binding sites using the proposed method and discovered over 24 million similar pairs of binding sites. We present the results as a relational database Pocket Similarity Search using Multiple-Sketches (PoSSuM), which includes all the discovered pairs with annotations of various types (e.g., CATH, SCOP, EC number, Gene ontology). PoSSuM enables rapid exploration of similar binding sites among structures with different global folds as well as similar ones. Moreover, PoSSuM is useful for predicting the binding ligand for unbound structures. Basically, the users can search similar binding pockets using two search modes:

i) “Search K” is useful for finding similar binding sites for a known ligand-binding site. Post a known ligand-binding site (a pair of “PDB ID” and “HET code”) in the PDB, and PoSSuM will search similar sites for the query site.

ii) “Search P” is useful for predicting ligands that potentially bind to a structure of interest. Post a known protein structure (PDB ID) in the PDB, and PoSSuM will search similar known-ligand binding sites for the query structure.

Obviously useful for the bioinformatics crowd but relevant for topic maps as well.

In topic map terminology, the searches are for associations with a known role player in a particular role, leaving the other role player unspecified.

It does not define or seek an exact match but provides the user with data that may help them make a match determination.

…Everything You Always Wanted to Know About Genes

Filed under: Bioinformatics,Biomedical,Genomics — Patrick Durusau @ 1:18 pm

Toward a New Model of the Cell: Everything You Always Wanted to Know About Genes

From the post:

Turning vast amounts of genomic data into meaningful information about the cell is the great challenge of bioinformatics, with major implications for human biology and medicine. Researchers at the University of California, San Diego School of Medicine and colleagues have proposed a new method that creates a computational model of the cell from large networks of gene and protein interactions, discovering how genes and proteins connect to form higher-level cellular machinery.

“Our method creates ontology, or a specification of all the major players in the cell and the relationships between them,” said first author Janusz Dutkowski, PhD, postdoctoral researcher in the UC San Diego Department of Medicine. It uses knowledge about how genes and proteins interact with each other and automatically organizes this information to form a comprehensive catalog of gene functions, cellular components, and processes.

“What’s new about our ontology is that it is created automatically from large datasets. In this way, we see not only what is already known, but also potentially new biological components and processes — the bases for new hypotheses,” said Dutkowski.

Originally devised by philosophers attempting to explain the nature of existence, ontologies are now broadly used to encapsulate everything known about a subject in a hierarchy of terms and relationships. Intelligent information systems, such as iPhone’s Siri, are built on ontologies to enable reasoning about the real world. Ontologies are also used by scientists to structure knowledge about subjects like taxonomy, anatomy and development, bioactive compounds, disease and clinical diagnosis.

A Gene Ontology (GO) exists as well, constructed over the last decade through a joint effort of hundreds of scientists. It is considered the gold standard for understanding cell structure and gene function, containing 34,765 terms and 64,635 hierarchical relations annotating genes from more than 80 species.

“GO is very influential in biology and bioinformatics, but it is also incomplete and hard to update based on new data,” said senior author Trey Ideker, PhD, chief of the Division of Genetics in the School of Medicine and professor of bioengineering in UC San Diego’s Jacobs School of Engineering.

The conclusion to A gene ontology inferred from molecular networks (Janusz Dutkowski, Michael Kramer, Michal A Surma, Rama Balakrishnan, J Michael Cherry, Nevan J Krogan & Trey Ideker, Nature Biotechnology 31, 38–45 (2013) doi:10.1038/nbt.2463), illustrates a difference between ontology in the GO sense and that produced by the authors:

The research reported in this manuscript raises the possibility that, given the appropriate tools, ontologies might evolve over time with the addition of each new network map or high-throughput experiment that is published. More importantly, it enables a philosophical shift in bioinformatic analysis, from a regime in which the ontology is viewed as gold standard to one in which it is the major result. (emphasis added)

Ontology as representing reality as opposed to declaring it.

That is a novel concept.

January 22, 2013

BioNLP-ST 2013

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 2:42 pm

BioNLP-ST 2013


Training Data Release 12:00 IDLW, 17 Jan. 2013
Test Data Release 22 Mar. 2013
Result Submission 29 Mar. 2013
BioNLP’11 Workshop 8-9 Aug. 2013

From the website:

The BioNLP Shared Task (BioNLP-ST) series represents a community-wide trend in text-mining for biology toward fine-grained information extraction (IE). The two previous events, BioNLP-ST 2009 and 2011, attracted wide attention, with over 30 teams submitting final results. The tasks and their data have since served as the basis of numerous studies, released event extraction systems, and published datasets. The upcoming BioNLP-ST 2013 follows the general outline and goals of the previous tasks. It identifies biologically relevant extraction targets and proposes a linguistically motivated approach to event representation. The tasks in BioNLP-ST 2013 cover many new hot topics in biology that are close to biologists’ needs. BioNLP-ST 2013 broadens the scope of the text-mining application domains in biology by introducing new issues on cancer genetics and pathway curation. It also builds on the well-known previous datasets GENIA, LLL/BI and BB to propose more realistic tasks that considered previously, closer to the actual needs of biological data integration.

The first event in 2009 triggered active research in the community on a specific fine-grained IE task. Expanding on this, the second BioNLP-ST was organized under the theme “Generalization”, which was well received by participants, who introduced numerous systems that could be straightforwardly applied to multiple tasks. This time, the BioNLP-ST takes a step further and pursues the grand theme of “Knowledge base construction”, which is addressed in various ways: semantic web (GE, GRO), pathways (PC), molecular mechanisms of cancer (CG), regulation networks (GRN) and ontology population (GRO, BB).

As in previous events, manually annotated data will be provided for training, development and evaluation of information extraction methods. According to their relevance for biological studies, the annotations are either bound to specific expressions in the text or represented as structured knowledge. Many tools for the detailed evaluation and graphical visualization of annotations and system outputs will be available for participants. Support in performing linguistic processing will be provided to the participants in the form of analyses created by various state-of-the art tools on the dataset texts.

Participation to the task will be open to the academia, industry, and all other interested parties.


Quick question: Do you think there is semantically diverse data available for each of these tasks?

I first saw this at: BioNLP Shared Task: Text Mining for Biology Competition.

January 21, 2013

Concept Maps – Pharmaceuticals

Filed under: Bioinformatics,Biomedical,Concept Maps — Patrick Durusau @ 7:29 pm

Designing concept maps for a precise and objective description of pharmaceutical innovations by Maia Iordatii, Alain Venot and Catherine Duclos. (BMC Medical Informatics and Decision Making 2013, 13:10 doi:10.1186/1472-6947-13-10)



When a new drug is launched onto the market, information about the new manufactured product is contained in its monograph and evaluation report published by national drug agencies. Health professionals need to be able to determine rapidly and easily whether the new manufactured product is potentially useful for their practice. There is therefore a need to identify the best way to group together and visualize the main items of information describing the nature and potential impact of the new drug. The objective of this study was to identify these items of information and to bring them together in a model that could serve as the standard for presenting the main features of new manufactured product.


We developed a preliminary conceptual model of pharmaceutical innovations, based on the knowledge of the authors. We then refined this model, using a random sample of 40 new manufactured drugs recently approved by the national drug regulatory authorities in France and covering a broad spectrum of innovations and therapeutic areas. Finally, we used another sample of 20 new manufactured drugs to determine whether the model was sufficiently comprehensive.


The results of our modeling led to three sub models described as conceptual maps representing: i) the medical context for use of the new drug (indications, type of effect, therapeutical arsenal for the same indications), ii) the nature of the novelty of the new drug (new molecule, new mechanism of action, new combination, new dosage, etc.), and iii) the impact of the drug in terms of efficacy, safety and ease of use, compared with other drugs with the same indications.


Our model can help to standardize information about new drugs released onto the market. It is potentially useful to the pharmaceutical industry, medical journals, editors of drug databases and medical software, and national or international drug regulation agencies, as a means of describing the main properties of new pharmaceutical products. It could also used as a guide for the writing of comprehensive and objective texts summarizing the nature and interest of new manufactured product. (emphasis added)

We all design categories starting with what we know, as pointed out under methods above.

And any three authors could undertake a such a quest, with equally valid results but different terminology and perhaps even a different arrangement of concepts.

The problem isn’t the undertaking, which is a useful.

The problem is a lack of a binding between such undertakings, which enables users to migrate between such maps, as they develop over time.

A problem that topic maps offer an infrastructure to solve.

January 20, 2013

Semantic Web meets Integrative Biology: a survey

Filed under: Bioinformatics,Semantic Web — Patrick Durusau @ 8:04 pm

Semantic Web meets Integrative Biology: a survey by Haujun Chen, Tong Yu and Jake Y. Chen.


Integrative Biology (IB) uses experimental or computational quantitative technologies to characterize biological systems at the molecular, cellular, tissue and population levels. IB typically involves the integration of the data, knowledge and capabilities across disciplinary boundaries in order to solve complex problems. We identify a series of bioinformatics problems posed by interdisciplinary integration: (i) data integration that interconnects structured data across related biomedical domains; (ii) ontology integration that brings jargons, terminologies and taxonomies from various disciplines into a unified network of ontologies; (iii) knowledge integration that integrates disparate knowledge elements from multiple sources; (iv) service integration that build applications out of services provided by different vendors. We argue that IB can benefit significantly from the integration solutions enabled by Semantic Web (SW) technologies. The SW enables scientists to share content beyond the boundaries of applications and websites, resulting into a web of data that is meaningful and understandable to any computers. In this review, we provide insight into how SW technologies can be used to build open, standardized and interoperable solutions for interdisciplinary integration on a global basis. We present a rich set of case studies in system biology, integrative neuroscience, bio-pharmaceutics and translational medicine, to highlight the technical features and benefits of SW applications in IB.

A very good summary the issues of data integration in bioinformatics.

I disagree with the prescription, as you might imagine, but it is a good starting place for discussion of the issues of data integration.

Interactive Text Mining

Filed under: Annotation,Bioinformatics,Curation,Text Mining — Patrick Durusau @ 8:03 pm

An overview of the BioCreative 2012 Workshop Track III: interactive text mining task Cecilia N. Arighi, et. al. (Database (2013) 2013 : bas056 doi: 10.1093/database/bas056)


In many databases, biocuration primarily involves literature curation, which usually involves retrieving relevant articles, extracting information that will translate into annotations and identifying new incoming literature. As the volume of biological literature increases, the use of text mining to assist in biocuration becomes increasingly relevant. A number of groups have developed tools for text mining from a computer science/linguistics perspective, and there are many initiatives to curate some aspect of biology from the literature. Some biocuration efforts already make use of a text mining tool, but there have not been many broad-based systematic efforts to study which aspects of a text mining tool contribute to its usefulness for a curation task. Here, we report on an effort to bring together text mining tool developers and database biocurators to test the utility and usability of tools. Six text mining systems presenting diverse biocuration tasks participated in a formal evaluation, and appropriate biocurators were recruited for testing. The performance results from this evaluation indicate that some of the systems were able to improve efficiency of curation by speeding up the curation task significantly (∼1.7- to 2.5-fold) over manual curation. In addition, some of the systems were able to improve annotation accuracy when compared with the performance on the manually curated set. In terms of inter-annotator agreement, the factors that contributed to significant differences for some of the systems included the expertise of the biocurator on the given curation task, the inherent difficulty of the curation and attention to annotation guidelines. After the task, annotators were asked to complete a survey to help identify strengths and weaknesses of the various systems. The analysis of this survey highlights how important task completion is to the biocurators’ overall experience of a system, regardless of the system’s high score on design, learnability and usability. In addition, strategies to refine the annotation guidelines and systems documentation, to adapt the tools to the needs and query types the end user might have and to evaluate performance in terms of efficiency, user interface, result export and traditional evaluation metrics have been analyzed during this task. This analysis will help to plan for a more intense study in BioCreative IV.

Curation is an aspect of topic map authoring, albeit with the latter capturing information for later merging with other sources of information.

Definitely an article you will want to read if you are designing text mining as part of a topic map solution.

January 19, 2013

The Pacific Symposium on Biocomputing 2013 [Proceedings]

Filed under: Bioinformatics,Biomedical,Data Mining,Text Mining — Patrick Durusau @ 7:09 pm

The Pacific Symposium on Biocomputing 2013 by Will Bush.

From the post:

For 18 years now, computational biologists have convened on the beautiful islands of Hawaii to present and discuss research emerging from new areas of biomedicine. PSB Conference Chairs Teri Klein (@teriklein), Keith Dunker, Russ Altman (@Rbaltman) and Larry Hunter (@ProfLHunter) organize innovative sessions and tutorials that are always interactive and thought-provoking. This year, sessions included Computational Drug Repositioning, Epigenomics, Aberrant Pathway and Network Activity, Personalized Medicine, Phylogenomics and Population Genomics, Post-Next Generation Sequencing, and Text and Data Mining. The Proceedings are available online here, and a few of the highlights are:

See Will’s post for the highlights. Or browse the proceedings. You are almost certainly going to find something relevant to you.

Do note Will’s use of Twiiter IDs as identifiers. Unique, persistent (I assume Twitter doesn’t re-assign them), easy to access.

It wasn’t clear from Will’s post if the following image was from Biocomputing 2013 or if he stopped by a markup conference. Hard to tell. 😉

Biocomputing 2013

January 18, 2013


Filed under: Associations,Bioinformatics,Biomedical — Patrick Durusau @ 7:18 pm

PPInterFinder—a mining tool for extracting causal relations on human proteins from literature by Kalpana Raja, Suresh Subramani and Jeyakumar Natarajan. (Database (2013) 2013 : bas052 doi: 10.1093/database/bas052)


One of the most common and challenging problem in biomedical text mining is to mine protein–protein interactions (PPIs) from MEDLINE abstracts and full-text research articles because PPIs play a major role in understanding the various biological processes and the impact of proteins in diseases. We implemented, PPInterFinder—a web-based text mining tool to extract human PPIs from biomedical literature. PPInterFinder uses relation keyword co-occurrences with protein names to extract information on PPIs from MEDLINE abstracts and consists of three phases. First, it identifies the relation keyword using a parser with Tregex and a relation keyword dictionary. Next, it automatically identifies the candidate PPI pairs with a set of rules related to PPI recognition. Finally, it extracts the relations by matching the sentence with a set of 11 specific patterns based on the syntactic nature of PPI pair. We find that PPInterFinder is capable of predicting PPIs with the accuracy of 66.05% on AIMED corpus and outperforms most of the existing systems.

Database URL:

I thought the shortened form of the title would catch your eye. 😉

Important work for bioinformatics but it is also an example of domain specific association mining.

By focusing on a specific domain and forswearing designs on being a universal association solution, PPInterFinder produces useful results today.

A lesson that should be taken and applied to semantic mappings more generally.

January 17, 2013

UniChem…[How Much Precision Can You Afford?]

Filed under: Bioinformatics,Biomedical,Cheminformatics,Topic Maps — Patrick Durusau @ 7:26 pm

UniChem: a unified chemical structure cross-referencing and identifier tracking system by Jon Chambers, Mark Davies, Anna Gaulton, Anne Hersey, Sameer Velankar, Robert Petryszak, Janna Hastings, Louisa Bellis, Shaun McGlinchey and John P Overington. (Journal of Cheminformatics 2013, 5:3 doi:10.1186/1758-2946-5-3)


UniChem is a freely available compound identifier mapping service on the internet, designed to optimize the efficiency with which structure-based hyperlinks may be built and maintained between chemistry-based resources. In the past, the creation and maintenance of such links at EMBL-EBI, where several chemistry-based resources exist, has required independent efforts by each of the separate teams. These efforts were complicated by the different data models, release schedules, and differing business rules for compound normalization and identifier nomenclature that exist across the organization. UniChem, a large-scale, non-redundant database of Standard InChIs with pointers between these structures and chemical identifiers from all the separate chemistry resources, was developed as a means of efficiently sharing the maintenance overhead of creating these links. Thus, for each source represented in UniChem, all links to and from all other sources are automatically calculated and immediately available for all to use. Updated mappings are immediately available upon loading of new data releases from the sources. Web services in UniChem provide users with a single simple automatable mechanism for maintaining all links from their resource to all other sources represented in UniChem. In addition, functionality to track changes in identifier usage allows users to monitor which identifiers are current, and which are obsolete. Lastly, UniChem has been deliberately designed to allow additional resources to be included with minimal effort. Indeed, the recent inclusion of data sources external to EMBL-EBI has provided a simple means of providing users with an even wider selection of resources with which to link to, all at no extra cost, while at the same time providing a simple mechanism for external resources to link to all EMBL-EBI chemistry resources.

From the background section:

Since these resources are continually developing in response to largely distinct active user communities, a full integration solution, or even the imposition of a requirement to adopt a common unifying chemical identifier, was considered unnecessarily complex, and would inhibit the freedom of each of the resources to successfully evolve in future. In addition, it was recognized that in the future more small molecule-containing databases might reside at EMBL-EBI, either because existing databases may begin to annotate their data with chemical information, or because entirely new resources are developed or adopted. This would make a full integration solution even more difficult to sustain. A need was therefore identified for a flexible integration solution, which would create, maintain and manage links between the resources, with minimal maintenance costs to the participant resources, whilst easily allowing the inclusion of additional sources in the future. Also, since the solution should allow different resources to maintain their own identifier systems, it was recognized as important for the system to have some simple means of tracking identifier usage, at least in the sense of being able to archive obsolete identifiers and assignments, and indicate when obsolete assignments were last in use.

The UniChem project highlights an important aspect of mapping identifiers: How much mapping can you afford?

Or perhaps even better: What is the cost/benefit ratio for a complete mapping?

The mapping in question isn’t a academic exercise in elegance and completeness.

It’s users have immediate need for the mapping data and it is it not quite right, human users are in the best position to correct it and suggest corrections.

Not to mention that new identifiers are likely to arrive before the old ones are completely mapped.

Suggestive that evolving mappings may be an appropriate paradigm for topic maps.

January 12, 2013

The Xenbase literature curation process

Filed under: Bioinformatics,Curation,Literature — Patrick Durusau @ 7:01 pm

The Xenbase literature curation process by Jeff B. Bowes, Kevin A. Snyder, Christina James-Zorn, Virgilio G. Ponferrada, Chris J. Jarabek, Kevin A. Burns, Bishnu Bhattacharyya, Aaron M. Zorn and Peter D. Vize.


Xenbase ( is the model organism database for Xenopus tropicalis and Xenopus laevis, two frog species used as model systems for developmental and cell biology. Xenbase curation processes centre on associating papers with genes and extracting gene expression patterns. Papers from PubMed with the keyword ‘Xenopus’ are imported into Xenbase and split into two curation tracks. In the first track, papers are automatically associated with genes and anatomy terms, images and captions are semi-automatically imported and gene expression patterns found in those images are manually annotated using controlled vocabularies. In the second track, full text of the same papers are downloaded and indexed by a number of controlled vocabularies and made available to users via the Textpresso search engine and text mining tool.

Which curation workflow will work best for your topic map activities will depend upon a number of factors.

What would you adopt, adapt or alter from the curation workflow in this article?

How would you evaluate the effectiveness of any of your changes?

Manual Alignment of Anatomy Ontologies

Filed under: Alignment,Bioinformatics,Biomedical,Ontology — Patrick Durusau @ 7:00 pm

Matching arthropod anatomy ontologies to the Hymenoptera Anatomy Ontology: results from a manual alignment by Matthew A. Bertone, István Mikó, Matthew J. Yoder, Katja C. Seltmann, James P. Balhoff, and Andrew R. Deans. (Database (2013) 2013 : bas057 doi: 10.1093/database/bas057)


Matching is an important step for increasing interoperability between heterogeneous ontologies. Here, we present alignments we produced as domain experts, using a manual mapping process, between the Hymenoptera Anatomy Ontology and other existing arthropod anatomy ontologies (representing spiders, ticks, mosquitoes and Drosophila melanogaster). The resulting alignments contain from 43 to 368 mappings (correspondences), all derived from domain-expert input. Despite the many pairwise correspondences, only 11 correspondences were found in common between all ontologies, suggesting either major intrinsic differences between each ontology or gaps in representing each group’s anatomy. Furthermore, we compare our findings with putative correspondences from Bioportal (derived from LOOM software) and summarize the results in a total evidence alignment. We briefly discuss characteristics of the ontologies and issues with the matching process.

Database URL:

A great example of the difficulty of matching across ontologies, particularly when the granularity or subjects of ontologies vary.

January 8, 2013

PLOS Computational Biology: Translational Bioinformatics

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 11:42 am

PLOS Computational Biology: Translational Bioinformatics. Maricel Kann, Guest Editor, and Fran Lewitter, PLOS Computational Biology Education Editor.

Following up on the collection where Biomedical Knowledge Integration appears, only to find:

Introduction to Translational Bioinformatics Collection by Russ B. Altman. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002796

Chapter 1: Biomedical Knowledge Integration by Philip R. O. Payne. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002826

Chapter 2: Data-Driven View of Disease Biology by Casey S. Greene and Olga G. Troyanskaya. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002816

Chapter 3: Small Molecules and Disease by David S. Wishart. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002805

Chapter 4: Protein Interactions and Disease by Mileidy W. Gonzalez by Maricel G. Kann. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002819

Chapter 5: Network Biology Approach to Complex Diseases by Dong-Yeon Cho, Yoo-Ah Kim and Teresa M. Przytycka. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002820

Chapter 6: Structural Variation and Medical Genomics by Benjamin J. Raphael. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002821

Chapter 7: Pharmacogenomics by Konrad J. Karczewski, Roxana Daneshjou and Russ B. Altman. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002817

Chapter 8: Biological Knowledge Assembly and Interpretation by Han Kim. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002858

Chapter 9: Analyses Using Disease Ontologies by Nigam H. Shah, Tyler Cole and Mark A. Musen. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002827

Chapter 10: Mining Genome-Wide Genetic Markers by Xiang Zhang, Shunping Huang, Zhaojun Zhang and Wei Wang. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002828

Chapter 11: Genome-Wide Association Studies by William S. Bush and Jason H. Moore. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002822

Chapter 12: Human Microbiome Analysis by Xochitl C. Morgan and Curtis Huttenhower. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002808

Chapter 13: Mining Electronic Health Records in the Genomics Era by Joshua C. Denny. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002823

Chapter 14: Cancer Genome AnalysisMiguel Vazquez, Victor de la Torre and Alfonso Valencia. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002824

An example of scholarship at its best!

Biomedical Knowledge Integration

Filed under: Bioinformatics,Biomedical,Data Integration,Medical Informatics — Patrick Durusau @ 11:41 am

Biomedical Knowledge Integration by Philip R. O. Payne.


The modern biomedical research and healthcare delivery domains have seen an unparalleled increase in the rate of innovation and novel technologies over the past several decades. Catalyzed by paradigm-shifting public and private programs focusing upon the formation and delivery of genomic and personalized medicine, the need for high-throughput and integrative approaches to the collection, management, and analysis of heterogeneous data sets has become imperative. This need is particularly pressing in the translational bioinformatics domain, where many fundamental research questions require the integration of large scale, multi-dimensional clinical phenotype and bio-molecular data sets. Modern biomedical informatics theory and practice has demonstrated the distinct benefits associated with the use of knowledge-based systems in such contexts. A knowledge-based system can be defined as an intelligent agent that employs a computationally tractable knowledge base or repository in order to reason upon data in a targeted domain and reproduce expert performance relative to such reasoning operations. The ultimate goal of the design and use of such agents is to increase the reproducibility, scalability, and accessibility of complex reasoning tasks. Examples of the application of knowledge-based systems in biomedicine span a broad spectrum, from the execution of clinical decision support, to epidemiologic surveillance of public data sets for the purposes of detecting emerging infectious diseases, to the discovery of novel hypotheses in large-scale research data sets. In this chapter, we will review the basic theoretical frameworks that define core knowledge types and reasoning operations with particular emphasis on the applicability of such conceptual models within the biomedical domain, and then go on to introduce a number of prototypical data integration requirements and patterns relevant to the conduct of translational bioinformatics that can be addressed via the design and use of knowledge-based systems.

A chapter in “Translational Bioinformatics” collection for PLOS Computational Biology.

A very good survey of the knowledge integration area, which alas does not include topic maps. 🙁

Well, but it does include use cases at the end of the chapter that are biomedical specific.

Thinking those would be good cases to illustrate the use of topic maps for biomedical knowledge integration.


January 5, 2013

Semantically enabling a genome-wide association study database

Filed under: Bioinformatics,Biomedical,Genomics,Medical Informatics,Ontology — Patrick Durusau @ 2:20 pm

Semantically enabling a genome-wide association study database by Tim Beck, Robert C Free, Gudmundur A Thorisson and Anthony J Brookes. Journal of Biomedical Semantics 2012, 3:9 doi:10.1186/2041-1480-3-9.



The amount of data generated from genome-wide association studies (GWAS) has grown rapidly, but considerations for GWAS phenotype data reuse and interchange have not kept pace. This impacts on the work of GWAS Central — a free and open access resource for the advanced querying and comparison of summary-level genetic association data. The benefits of employing ontologies for standardising and structuring data are widely accepted. The complex spectrum of observed human phenotypes (and traits), and the requirement for cross-species phenotype comparisons, calls for reflection on the most appropriate solution for the organisation of human phenotype data. The Semantic Web provides standards for the possibility of further integration of GWAS data and the ability to contribute to the web of Linked Data.


A pragmatic consideration when applying phenotype ontologies to GWAS data is the ability to retrieve all data, at the most granular level possible, from querying a single ontology graph. We found the Medical Subject Headings (MeSH) terminology suitable for describing all traits (diseases and medical signs and symptoms) at various levels of granularity and the Human Phenotype Ontology (HPO) most suitable for describing phenotypic abnormalities (medical signs and symptoms) at the most granular level. Diseases within MeSH are mapped to HPO to infer the phenotypic abnormalities associated with diseases. Building on the rich semantic phenotype annotation layer, we are able to make cross-species phenotype comparisons and publish a core subset of GWAS data as RDF nanopublications.


We present a methodology for applying phenotype annotations to a comprehensive genome-wide association dataset and for ensuring compatibility with the Semantic Web. The annotations are used to assist with cross-species genotype and phenotype comparisons. However, further processing and deconstructions of terms may be required to facilitate automatic phenotype comparisons. The provision of GWAS nanopublications enables a new dimension for exploring GWAS data, by way of intrinsic links to related data resources within the Linked Data web. The value of such annotation and integration will grow as more biomedical resources adopt the standards of the Semantic Web.

Rather than:

The benefits of employing ontologies for standardising and structuring data are widely accepted.

I would rephrase that to read:

The benefits and limitations of employing ontologies for standardising and structuring data are widely known.

Decades of use of relational database schemas, informal equivalents of ontologies, leave no doubt governing structures for data have benefits.

Less often acknowledged is those same governing structures impose limitations on data and what may be represented.

That’s not a dig at relational databases.

Just an observation that ontologies and their equivalents aren’t unalloyed precious metals.

December 27, 2012

Utopia Documents

Filed under: Annotation,Bioinformatics,Biomedical,Medical Informatics,PDF — Patrick Durusau @ 3:58 pm

Checking the “sponsored by” link for pdfx v1.0 and discovered: Utopia Documents.

From the homepage:

Reading, redefined.

Utopia Documents brings a fresh new perspective to reading the scientific literature, combining the convenience and reliability of the PDF with the flexibility and power of the web. Free for Linux, Mac and Windows.

Building Bridges

The scientific article has been described as a Story That Persuades With Data, but all too often the link between data and narrative is lost somewhere in the modern publishing process. Utopia Documents helps to rebuild these connections, linking articles to underlying datasets, and making it easy to access online resources relating to an article’s content.

A Living Resource

Published articles form the ‘minutes of science‘, creating a stable record of ideas and discoveries. But no idea exists in isolation, and just because something has been published doesn’t mean that the story is over. Utopia Documents reconnects PDFs with the ongoing discussion, keeping you up-to-date with the latest knowledge and metrics.


Make private notes for yourself, annotate a document for others to see or take part in an online discussion.

Explore article content

Looking for clarification of given terms? Or more information about them? Do just that, with integrated semantic search.

Interact with live data

Interact directly with curated database entries- play with molecular structures; edit sequence and alignment data; even plot and export tabular data.

A finger on the pulse

Stay up to date with the latest news. Utopia connects what you read with live data from Altmetric, Mendeley, CrossRef, Scibite and others.

A user can register for an account (enabling comments on documents) or use the application anonymously.

Presently focused on the life sciences but no impediment to expansion into computer science for example.

It doesn’t solve semantic diversity issues so an opportunity for topic maps there.

Doesn’t address the issue of documents being good at information delivery but not so good for information storage.

But issues of semantic diversity and information storage, are growth areas for Utopia Documents, not reservations about its use.

Suggest you start using and exploring Utopia Documents sooner rather than later!

December 26, 2012

EOL Classification Providers [Encyclopedia of Life]

Filed under: Bioinformatics,Biomedical,Classification — Patrick Durusau @ 7:17 pm

EOL Classification Providers

From the webpage:

The information on EOL is organized using hierarchical classifications of taxa (groups of organisms) from a number of different classification providers. You can explore these hierarchies in the Names tab of EOL taxon pages. Many visitors would expect to see a single classification of life on EOL. However, we are still far from having a classification scheme that is universally accepted.

Biologists all over the world are studying the genetic relationships between organisms in order to determine each species’ place in the hierarchy of life. While this research is underway, there will be differences in opinion on how to best classify each group. Therefore, we present our visitors with a number of alternatives. Each of these hierarchies is supported by a community of scientists, and all of them feature relationships that are controversial or unresolved.

How far from universally accepted?

Consider the sources for classification:

AntWeb is generally recognized as the most advanced biodiversity information system at species level dedicated to ants. Altogether, its acceptance by the ant research community, the number of participating remote curators that maintain the site, number of pictures, simplicity of web interface, and completeness of species, make AntWeb the premier reference for dissemination of data, information, and knowledge on ants. AntWeb is serving information on tens of thousands of ant species through the EOL.

Avibase is an extensive database information system about all birds of the world, containing over 6 million records about 10,000 species and 22,000 subspecies of birds, including distribution information, taxonomy, synonyms in several languages and more. This site is managed by Denis Lepage and hosted by Bird Studies Canada, the Canadian copartner of Birdlife International. Avibase has been a work in progress since 1992 and it is offered as a free service to the bird-watching and scientific community. In addition to links, Avibase helped us install Gill, F & D Donsker (Eds). 2012. IOC World Bird Names (v 3.1). Available at as of 2 May 2012.  More bird classifications are likely to follow

The Catalogue of Life Partnership (CoLP) is an informal partnership dedicated to creating an index of the world’s organisms, called the Catalogue of Life (CoL). The CoL provides different forms of access to an integrated, quality, maintained, comprehensive consensus species checklist and taxonomic hierarchy, presently covering more than one million species, and intended to cover all know species in the near future. The Annual Checklist EOL uses contains substantial contributions of taxonomic expertise from more than fifty organizations around the world, integrated into a single work by the ongoing work of the CoLP partners. 

FishBase is a global information system with all you ever wanted to know about fishes. FishBase is a relational database with information to cater to different professionals such as research scientists, fisheries managers, zoologists and many more. The FishBase Website contains data on practically every fish species known to science. The project was developed at the WorldFish Center in collaboration with the Food and Agriculture Organization of the United Nations and many other partners, and with support from the European Commission. FishBase is serving information on more than 30,000 fish species through EOL.

Index Fungorum
The Index Fungorum, the global fungal nomenclator coordinated and supported by the Index Fungorum Partnership (CABI, CBS, Landcare Research-NZ), contains names of fungi (including yeasts, lichens, chromistan fungal analogues, protozoan fungal analogues and fossil forms) at all ranks.

The Integrated Taxonomic Information System (ITIS) is a partnership of federal agencies and other organizations from the United States, Canada, and Mexico, with data stewards and experts from around the world (see The ITIS database is an automated reference of scientific and common names of biota of interest to North America . It contains more than 600,000 scientific and common names in all kingdoms, and is accessible via the World Wide Web in English, French, Spanish, and Portuguese ( ITIS is part of the US National Biological Information Infrastructure (

International Union for Conservation of Nature (IUCN) helps the world find pragmatic solutions to our most pressing environment and development challenges. IUCN supports scientific research; manages field projects all over the world; and brings governments, non-government organizations, United Nations agencies, companies and local communities together to develop and implement policy, laws and best practice. EOL partnered with the IUCN to indicate status of each species according to the Red List of Threatened Species.

Metalmark Moths of the World
Metalmark moths (Lepidoptera: Choreutidae) are a poorly known, mostly tropical family of microlepidopterans. The Metalmark Moths of the World LifeDesk provides species pages and an updated classification for the group.

As a U.S. national resource for molecular biology information, NCBI’s mission is to develop new information technologies to aid in the understanding of fundamental molecular and genetic processes that control health and disease. The NCBI taxonomy database contains the names of all organisms that are represented in the genetic databases with at least one nucleotide or protein sequence.

The Paleobiology Database
The Paleobiology Database is a public resource for the global scientific community. It has been organized and operated by a multi-disciplinary, multi-institutional, international group of paleobiological researchers. Its purpose is to provide global, collection-based occurrence and taxonomic data for marine and terrestrial animals and plants of any geological age, as well as web-based software for statistical analysis of the data. The project’s wider, long-term goal is to encourage collaborative efforts to answer large-scale paleobiological questions by developing a useful database infrastructure and bringing together large data sets.

The Reptile Database 
This database provides information on the classification of all living reptiles by listing all species and their pertinent higher taxa. The database therefore covers all living snakes, lizards, turtles, amphisbaenians, tuataras, and crocodiles. It is a source of taxonomic data, thus providing primarily (scientific) names, synonyms, distributions and related data. The database is currently supported by the Systematics working group of the German Herpetological Society (DGHT)

The aim of a World Register of Marine Species (WoRMS) is to provide an authoritative and comprehensive list of names of marine organisms, including information on synonymy. While highest priority goes to valid names, other names in use are included so that this register can serve as a guide to interpret taxonomic literature.

Those are “current” classifications, which don’t reflect historical classifications (used by our ancestors), nor future classifications.

The four states of matter becoming > 500 states of matter for example.

Instead of “universal acceptance,” how does “working agreement for a specific purpose” sound?

December 21, 2012

New Public-Access Source With 3-D Information for Protein Interactions

Filed under: Bioinformatics,Biomedical,Genome,Genomics — Patrick Durusau @ 5:24 pm

New Public-Access Source With 3-D Information for Protein Interactions

From the post:

Researchers have developed a platform that compiles all the atomic data, previously stored in diverse databases, on protein structures and protein interactions for eight organisms of relevance. They apply a singular homology-based modelling procedure.

The scientists Roberto Mosca, Arnaud Ceol and Patrick Aloy provide the international biomedical community with Interactome3D (, an open-access and free web platform developed entirely by the Institute for Research in Biomedicine (IRB Barcelona). Interactome 3D offers for the first time the possibility to anonymously access and add molecular details of protein interactions and to obtain the information in 3D models. For researchers, atomic level details about the reactions are fundamental to unravel the bases of biology, disease development, and the design of experiments and drugs to combat diseases.

Interactome 3D provides reliable information about more than 12,000 protein interactions for eight model organisms, namely the plant Arabidopsis thaliana, the worm Caenorhabditis elegans, the fly Drosophila melanogaster, the bacteria Escherichia coli and Helicobacter pylori, the brewer’s yeast Saccharomyces cerevisiae, the mouse Mus musculus, and Homo sapiens. These models are considered the most relevant in biomedical research and genetic studies. The journal Nature Methods presents the research results and accredits the platform on the basis of it high reliability and precision in modelling interactions, which reaches an average of 75%.

Further details can be found at:

Interactome3D: adding structural details to protein networks by Roberto Mosca, Arnaud Céol and Patrick Aloy. (Nature Methods (2012) doi:10.1038/nmeth.2289)


Network-centered approaches are increasingly used to understand the fundamentals of biology. However, the molecular details contained in the interaction networks, often necessary to understand cellular processes, are very limited, and the experimental difficulties surrounding the determination of protein complex structures make computational modeling techniques paramount. Here we present Interactome3D, a resource for the structural annotation and modeling of protein-protein interactions. Through the integration of interaction data from the main pathway repositories, we provide structural details at atomic resolution for over 12,000 protein-protein interactions in eight model organisms. Unlike static databases, Interactome3D also allows biologists to upload newly discovered interactions and pathways in any species, select the best combination of structural templates and build three-dimensional models in a fully automated manner. Finally, we illustrate the value of Interactome3D through the structural annotation of the complement cascade pathway, rationalizing a potential common mechanism of action suggested for several disease-causing mutations.

Interesting not only for its implications for bioinformatics but for the development of homology modeling (superficially, similar proteins have similar interaction sites) to assist in their work.

The topic map analogy would be to show a subject domain, different identifications of the same subject tend to have the same associations or to fall into other patterns.

Then constructing a subject identity test based upon a template of associations or other values.

December 18, 2012

Bio-Linux 7 – Released November 2012

Filed under: Bio-Linux,Bioinformatics,Biomedical,Linux OS — Patrick Durusau @ 5:24 pm

Bio-Linux 7 – Released November 2012

From the webpage:

Bio-Linux 7 is a fully featured, powerful, configurable and easy to maintain bioinformatics workstation. Bio-Linux provides more than 500 bioinformatics programs on an Ubuntu Linux 12.04 LTS base. There is a
graphical menu for bioinformatics programs, as well as easy access to the Bio-Linux bioinformatics documentation system and sample data useful for testing programs. 

Bio-Linux 7 adds many improvements over previous versions, including the Galaxy analysis environment.  There are also various packages to handle new generation sequence data types.

You can install Bio-Linux on your machine, either as the only operating system, or as part of a dual-boot setup which allows you to use your current system and Bio-Linux on the same hardware.

Bio-Linux also runs Live from the DVD or a USB stick. This runs in the memory of your machine and does not involve installing anything. This is a great, no-hassle way to try out Bio-Linux, demonstrate or teach with it, or to work with when you are on the move.

Bio-Linux is built on open source systems and software, and so is free to to install and use. See What’s new on Bio-Linux 7. Also, check out the  2006 paper on Bio-Linux and open source systems for biologists.

Useful for exploring bioinformatics tools for Ubuntu.

But useful as well for considering how those tools could be used in data/text mining for other domains.

Not to mention the packaging for installation to DVD or USB stick.

Are there any topic map engines that are setup for burning to DVD or USB stick?

Packaging them that way with more than a minimal set of maps and/or data sets might be a useful avenue to explore.

December 17, 2012

taxize: Taxonomic search and phylogeny retrieval [R]

Filed under: Bioinformatics,Biomedical,Phylogenetic Trees,Searching,Taxonomy — Patrick Durusau @ 4:58 pm

taxize: Taxonomic search and phylogeny retrieval by Scott Chamberlain, Eduard Szoecs and Carl Boettiger.

From the documentation:

We are developing taxize as a package to allow users to search over many websites for species names (scientific and common) and download up- and downstream taxonomic hierarchical information – and many other things. The functions in the package that hit a specific API have a prefix and suffix separated by an underscore. They follow the format of service_whatitdoes. For example, gnr_resolve uses the Global Names Resolver API to resolve species names. General functions in the package that don’t hit a specific API don’t have two words separated by an underscore, e.g., classification. You need API keys for Encyclopedia of Life (EOL), the Universal Biological Indexer and Organizer (uBio), Tropicos, and Plantminer.

Just in case you need species names and/or taxonomic hierarchy information for your topic map.

Go3R [Searching for Alternatives to Animal Testing]


A semantic search engine for finding alternatives to animal testing.

I mention it as an example of a search interface that assists the user in searching.

The help documentation is a bit sparse if you are looking for an opportunity to contribute to such a project.

I did locate some additional information on the project, all usefully with the same title to make locating it “easy.” 😉

[Introduction] Knowledge-based semantic search engine for alternative methods to animal experiments

[PubMed – entry] Go3R – semantic Internet search engine for alternative methods to animal testing by Sauer UG, Wächter T, Grune B, Doms A, Alvers MR, Spielmann H, Schroeder M. (ALTEX. 2009;26(1):17-31).


Consideration and incorporation of all available scientific information is an important part of the planning of any scientific project. As regards research with sentient animals, EU Directive 86/609/EEC for the protection of laboratory animals requires scientists to consider whether any planned animal experiment can be substituted by other scientifically satisfactory methods not entailing the use of animals or entailing less animals or less animal suffering, before performing the experiment. Thus, collection of relevant information is indispensable in order to meet this legal obligation. However, no standard procedures or services exist to provide convenient access to the information required to reliably determine whether it is possible to replace, reduce or refine a planned animal experiment in accordance with the 3Rs principle. The search engine Go3R, which is available free of charge under, runs up to become such a standard service. Go3R is the world-wide first search engine on alternative methods building on new semantic technologies that use an expert-knowledge based ontology to identify relevant documents. Due to Go3R’s concept and design, the search engine can be used without lengthy instructions. It enables all those involved in the planning, authorisation and performance of animal experiments to determine the availability of non-animal methodologies in a fast, comprehensive and transparent manner. Thereby, Go3R strives to significantly contribute to the avoidance and replacement of animal experiments.

[ALTEX entry – full text available] Go3R – Semantic Internet Search Engine for Alternative Methods to Animal Testing

December 15, 2012

Neuroscience Information Framework (NIF)

Filed under: Bioinformatics,Biomedical,Medical Informatics,Neuroinformatics,Searching — Patrick Durusau @ 8:21 pm

Neuroscience Information Framework (NIF)

From the about page:

The Neuroscience Information Framework is a dynamic inventory of Web-based neuroscience resources: data, materials, and tools accessible via any computer connected to the Internet. An initiative of the NIH Blueprint for Neuroscience Research, NIF advances neuroscience research by enabling discovery and access to public research data and tools worldwide through an open source, networked environment.

Example of a subject specific information resource that provides much deeper coverage than possible with Google, for example.

If you aren’t trying to index everything, you can out perform more general search solutions.


Filed under: Bioinformatics,Python,Teaching — Patrick Durusau @ 8:16 pm


From the homepage:

Rosalind is a platform for learning bioinformatics through problem solving.

Rather than teaching topic maps from the “basics” forward, what about teaching problems for which topic maps are a likely solution?

And introduce syntax/practices as solutions to particular issues?

Suggestions for problems?

November 27, 2012

SINAInnovation: Innovation and Data

Filed under: Bioinformatics,Cloudera,Data — Patrick Durusau @ 2:26 pm

SINAInnovation: Innovation and Data by Jeffrey Hammerbacher.

From the description:

Cloudera Co-founder Jeff Hammerbacher speaks about data and innovation in the biology and medicine fields.

Interesting presentation, particularly on creating structures for innovation.

One of his insights I would summarize as “break early, rebuild fast.” His term for it was “lower batch size.” Try new ideas and when they fail, try a new one.

I do wonder about his goal to : “Lower the cost of data storage and processing to zero.”

It may get to be “too cheap to meter” but that isn’t the same thing as being zero. Somewhere in the infrastructure, someone is paying bills for storage and processing.

I mention that because some political parties think that infrastructure can exist without ongoing maintenance and care.

Failing infrastructures don’t lead to innovation.

SINAInnovation description:

SINAInnovations was a three-day conference at The Mount Sinai Medical Center that examined all aspects of innovation and therapeutic discovery within academic medical centers, from how it can be taught and fostered within academia, to how it can accelerate drug discovery and the commercialization of emerging biotechnologies.

November 26, 2012

Collaborative biocuration… [Pre-Topic Map Tasks]

Filed under: Authoring Topic Maps,Bioinformatics,Biomedical,Curation,Genomics,Searching — Patrick Durusau @ 9:22 am

Collaborative biocuration—text-mining development task for document prioritization for curation by Thomas C. Wiegers, Allan Peter Davis and Carolyn J. Mattingly. (Database (2012) 2012 : bas037 doi: 10.1093/database/bas037)


The Critical Assessment of Information Extraction systems in Biology (BioCreAtIvE) challenge evaluation is a community-wide effort for evaluating text mining and information extraction systems for the biological domain. The ‘BioCreative Workshop 2012’ subcommittee identified three areas, or tracks, that comprised independent, but complementary aspects of data curation in which they sought community input: literature triage (Track I); curation workflow (Track II) and text mining/natural language processing (NLP) systems (Track III). Track I participants were invited to develop tools or systems that would effectively triage and prioritize articles for curation and present results in a prototype web interface. Training and test datasets were derived from the Comparative Toxicogenomics Database (CTD; and consisted of manuscripts from which chemical–gene–disease data were manually curated. A total of seven groups participated in Track I. For the triage component, the effectiveness of participant systems was measured by aggregate gene, disease and chemical ‘named-entity recognition’ (NER) across articles; the effectiveness of ‘information retrieval’ (IR) was also measured based on ‘mean average precision’ (MAP). Top recall scores for gene, disease and chemical NER were 49, 65 and 82%, respectively; the top MAP score was 80%. Each participating group also developed a prototype web interface; these interfaces were evaluated based on functionality and ease-of-use by CTD’s biocuration project manager. In this article, we present a detailed description of the challenge and a summary of the results.

The results:

“Top recall scores for gene, disease and chemical NER were 49, 65 and 82%, respectively; the top MAP score was 80%.”

indicate there is plenty of room for improvement. Perhaps even commercially viable improvement.

In hindsight, not talking about how to make a topic map along with ISO 13250, may have been a mistake. Even admitting there are multiple ways to get there, a technical report outlining one or two ways would have made the process more transparent.

Answering the question: “What can you say with a topic map?” with “Anything you want.” was, a truthful answer but not a helpful one.

I should try to crib something from one of those “how to write a research paper” guides. I haven’t looked at one in years but the process is remarkably similar to what would result in a topic map.

Some of the mechanics are different but the underlying intellectual process is quite similar. Everyone who has been to college (at least of my age), had a course that talked about writing research papers. So it should be familiar terminology.


November 25, 2012

FluxMap: visual exploration of flux distributions in biological networks [Import/Graphs]

Filed under: Bioinformatics,Biomedical,Graphs,Networks,Visualization — Patrick Durusau @ 2:29 pm

FluxMap: visual exploration of flux distributions in biological networks.

From the webpage:

FluxMap is an easy to use tool for the advanced visualisation of simulated or measured flux data in biological networks. Flux data import is achieved via a structured template basing on intuitive reaction equations. Flux data is mapped onto any network and visualised using edge thickness. Various visualisation options and interaction possibilities enable comparison and visual analysis of complex experimental setups in an interactive way.

Manuals and tutorials here.

Another easy to create graphs from data application. This one importing spreadsheet based data.

Wonder why some highly touted commercial graph databases don’t offer the same ease of use?

HIVE: Handy Integration and Visualisation of multimodal Experimental Data

Filed under: Bioinformatics,Biomedical,Graphs,Mapping,Merging,Visualization — Patrick Durusau @ 2:05 pm

HIVE: Handy Integration and Visualisation of multimodal Experimental Data

From the webpage:

HIVE is an Add-on for the VANTED system. VANTED is a graph editor extended for the visualisation and analysis of biological experimental data in context of pathways/networks. HIVE stands for

Handy Integration and Visualisation of multimodal Experimental Data

and extends the functionality of VANTED by adding the handling of volumes and images, together with a workspace approach, allowing one to integrate data of different biological data domains.

You need to see the demo video to appreciate this application!

It offers import of data, mapping rules to merge data from different data sets, easy visualization as a graph and other features.

Did I mention it also has 3-D image techniques as well?

PS: Yes, it is another example of “Who moved my acronym?”

« Newer PostsOlder Posts »

Powered by WordPress