Archive for the ‘Biomedical’ Category

A self-updating road map of The Cancer Genome Atlas

Friday, May 17th, 2013

A self-updating road map of The Cancer Genome Atlas by David E. Robbins, Alexander Grüneberg, Helena F. Deus, Murat M. Tanik and Jonas S. Almeida. (Bioinformatics (2013) 29 (10): 1333-1340. doi: 10.1093/bioinformatics/btt141)

Abstract:

Motivation: Since 2011, The Cancer Genome Atlas’ (TCGA) files have been accessible through HTTP from a public site, creating entirely new possibilities for cancer informatics by enhancing data discovery and retrieval. Significantly, these enhancements enable the reporting of analysis results that can be fully traced to and reproduced using their source data. However, to realize this possibility, a continually updated road map of files in the TCGA is required. Creation of such a road map represents a significant data modeling challenge, due to the size and fluidity of this resource: each of the 33 cancer types is instantiated in only partially overlapping sets of analytical platforms, while the number of data files available doubles approximately every 7 months.

Results: We developed an engine to index and annotate the TCGA files, relying exclusively on third-generation web technologies (Web 3.0). Specifically, this engine uses JavaScript in conjunction with the World Wide Web Consortium’s (W3C) Resource Description Framework (RDF), and SPARQL, the query language for RDF, to capture metadata of files in the TCGA open-access HTTP directory. The resulting index may be queried using SPARQL, and enables file-level provenance annotations as well as discovery of arbitrary subsets of files, based on their metadata, using web standard languages. In turn, these abilities enhance the reproducibility and distribution of novel results delivered as elements of a web-based computational ecosystem. The development of the TCGA Roadmap engine was found to provide specific clues about how biomedical big data initiatives should be exposed as public resources for exploratory analysis, data mining and reproducible research. These specific design elements align with the concept of knowledge reengineering and represent a sharp departure from top-down approaches in grid initiatives such as CaBIG. They also present a much more interoperable and reproducible alternative to the still pervasive use of data portals.

Availability: A prepared dashboard, including links to source code and a SPARQL endpoint, is available at http://bit.ly/TCGARoadmap. A video tutorial is available at http://bit.ly/TCGARoadmapTutorial.

Curious how the granularity of required semantics and the uniformity of the underlying data set impact the choice of semantic approaches?

And does access to data files present different challenges than say access to research publications in the same field?

Scientific Lenses over Linked Data… [Operational Equivalence]

Sunday, April 28th, 2013

Scientific Lenses over Linked Data: An approach to support task specifi c views of the data. A vision. by Christian Brenninkmeijer, Chris Evelo, Carole Goble, Alasdair J G Gray, Paul Groth, Steve Pettifer, Robert Stevens, Antony J Williams, and Egon L Willighagen.

Abstract:

Within complex scienti fic domains such as pharmacology, operational equivalence between two concepts is often context-, user- and task-specifi c. Existing Linked Data integration procedures and equivalence services do not take the context and task of the user into account. We present a vision for enabling users to control the notion of operational equivalence by applying scienti c lenses over Linked Data. The scientifi c lenses vary the links that are activated between the datasets which aff ects the data returned to the user.

Two additional quotes from this paper should convince you of the importance of this work:

We aim to support users in controlling and varying their view of the data by applying a scientifi c lens which govern the notions of equivalence applied to the data. Users will be able to change their lens based on the task and role they are performing rather than having one fixed lens. To support this requirement, we propose an approach that applies context dependent sets of equality links. These links are stored in a stand-off fashion so that they are not intermingled with the datasets. This allows for multiple, context-dependent, linksets that can evolve without impact on the underlying datasets and support diff ering opinions on the relationships between data instances. This flexibility is in contrast to both Linked Data and traditional data integration approaches. We look at the role personae can play in guiding the nature of relationships between the data resources and the desired a ffects of applying scientifi c lenses over Linked Data.

and,

Within scienti fic datasets it is common to fi nd links to the “equivalent” record in another dataset. However, there is no declaration of the form of the relationship. There is a great deal of variation in the notion of equivalence implied by the links both within a dataset’s usage and particularly across datasets, which degrades the quality of the data. The scienti fic user personae have very di fferent needs about the notion of equivalence that should be applied between datasets. The users need a simple mechanism by which they can change the operational equivalence applied between datasets. We propose the use of scientifi c lenses.

Obvious questions:

Does your topic map software support multiple operational equivalences?

Does your topic map interface enable users to choose “lenses” (I like lenses better than roles) to view equivalence?

Does your topic map software support declaring the nature of equivalence?

I first saw this in the slide deck: Scientific Lenses: Supporting Alternative Views of the Data by Alasdair J G Gray at: 4th Open PHACTS Community Workshop.

BTW, the notion of equivalence being represented by “links” reminds me of a comment Peter Neubauer (Neo4j) once made to me, saying that equivalence could be modeled as edges. Imagine typing equivalence edges. Will have to think about that some more.

4th Open PHACTS Community Workshop (slides) [Operational Equivalence]

Sunday, April 28th, 2013

4th Open PHACTS Community Workshop : Using the power of Open PHACTS

From the post:

The fourth Open PHACTS Community Workshop was held at Burlington House in London on April 22 and 23, 2013. The Workshop focussed on “Using the Power of Open PHACTS” and featured the public release of the Open PHACTS application programming interface (API) and the first Open PHACTS example app, ChemBioNavigator.

The first day featured talks describing the data accessible via the Open PHACTS Discovery Platform and technical aspects of the API. The use of the API by example applications ChemBioNavigator and PharmaTrek was outlined, and the results of the Accelrys Pipeline Pilot Hackathon discussed.

The second day involved discussion of Open PHACTS sustainability and plans for the successor organisation, the Open PHACTS Foundation. The afternoon was attended by those keen to further discuss the potential of the Open PHACTS API and the future of Open PHACTS.

During talks, especially those detailing the Open PHACTS API, a good number of signup requests to the API via dev.openphacts.org were received. The hashtag #opslaunch was used to follow reactions to the workshop on Twitter (see storify), and showed the response amongst attendees to be overwhelmingly positive.

This summary is followed by slides from the two days of presentations.

Not like being there but still quite useful.

As a matter of fact, I found a lead on “operational equivalence” with this data set. More to follow in a separate post.

Open PHACTS

Sunday, April 7th, 2013

Open PHACTS – Open Pharmacological Space

From the homepage:

Open PHACTS is building an Open Pharmacological Space in a 3-year knowledge management project of the Innovative Medicines Initiative (IMI), a unique partnership between the European Community and the European Federation of Pharmaceutical Industries and Associations (EFPIA).

The project is due to end in March 2014, and aims to deliver a sustainable service to continue after the project funding ends. The project consortium consists of leading academics in semantics, pharmacology and informatics, driven by solid industry business requirements: 28 partners, including 9 pharmaceutical companies and 3 biotechs.

Sourcecode has just appeared on GibHub: OpenPHACTS.

Important to different communities for different reasons. My interest isn’t the same as BigPharma. ;-)

A project to watch as they navigate the thickets of vocabularies, ontologies and other semantically diverse information sources.

Visualizing Biological Data Using the SVGmap Browser

Thursday, April 4th, 2013

Visualizing Biological Data Using the SVGmap Browser by Casey Bergman.

From the post:

Early in 2012, Nuria Lopez-Bigas‘ Biomedical Genomics Group published a paper in Bioinformatics describing a very interesting tool for visualizing biological data in a spatial context called SVGmap. The basic idea behind SVGMap is (like most good ideas) quite straightforward – to plot numerical data on a pre-defined image to give biological context to the data in an easy-to-interpret visual form.

To do this, SVGmap takes as input an image in Scalable Vector Graphics (SVG) format where elements of the image are tagged with an identifier, plus a table of numerical data with values assigned to the same identifier as in the elements of the image. SVGMap then integrates these files using either a graphical user interface that runs in standard web browser or a command line interface application that runs in your terminal, allowing the user to display color-coded numerical data on the original image. The overall framework of SVGMap is shown below in an image taken from a post on the Biomedical Genomics Group blog.

svgmap image

We’ve been using SVGMap over the last year to visualize tissue-specific gene expression data in Drosophila melanogaster from the FlyAtlas project, which comes as one of the pre-configured “experiments” in the SVGMap web application.

More recently, we’ve been also using the source distribution of SVGMap to display information about the insertion preferences of transposable elements in a tissue-specific context, which as required installing and configuring a local instance of SVGMap and run it via the browser. The documentation for SVGMap is good enough to do this on your own, but it took a while for us to get a working instance the first time around. We ran into the same issues again the second time, so I thought I write up my notes for future reference and to help others get SVGMap up and running as fast as possible.

Topic map interfaces aren’t required to take a particular form.

A drawing of a fly could be topic map interface.

Useful for people studying flies, less useful (maybe) if you are mapping Lady Gaga discography.

What interface do you want to create for a topic map?

MetaNetX.org…

Saturday, March 16th, 2013

MetaNetX.org: a website and repository for accessing, analysing and manipulating metabolic networks by Mathias Ganter, Thomas Bernard, Sébastien Moretti, Joerg Stelling and Marco Pagni. (Bioinformatics (2013) 29 (6): 815-816. doi: 10.1093/bioinformatics/btt036)

Abstract:

MetaNetX.org is a website for accessing, analysing and manipulating genome-scale metabolic networks (GSMs) as well as biochemical pathways. It consistently integrates data from various public resources and makes the data accessible in a standardized format using a common namespace. Currently, it provides access to hundreds of GSMs and pathways that can be interactively compared (two or more), analysed (e.g. detection of dead-end metabolites and reactions, flux balance analysis or simulation of reaction and gene knockouts), manipulated and exported. Users can also upload their own metabolic models, choose to automatically map them into the common namespace and subsequently make use of the website’s functionality.

http://metanetx.org.

The authors are addressing a familiar problem:

Genome-scale metabolic networks (GSMs) consist of compartmentalized reactions that consistently combine biochemical, genetic and genomic information. When also considering a biomass reaction and both uptake and secretion reactions, GSMs are often used to study genotype–phenotype relationships, to direct new discoveries and to identify targets in metabolic engineering (Karr et al., 2012). However, a major difficulty in GSM comparisons and reconstructions is to integrate data from different resources with different nomenclatures and conventions for both metabolites and reactions. Hence, GSM consolidation and comparison may be impossible without detailed biological knowledge and programming skills. (emphasis added)

For which they propose an uncommon solution:

MetaNetX.org is implemented as a user-friendly and self-explanatory website that handles all user requests dynamically (Fig. 1a). It allows a user to access a collection of hundreds of published models, browse and select subsets for comparison and analysis, upload or modify new models and export models in conjunction with their results. Its functionality is based on a common namespace defined by MNXref (Bernard et al., 2012). In particular, all repository or user uploaded models are automatically translated with or without compartments into the common namespace; small deviations from the original model are possible due to the automatic reconciliation steps implemented by Bernard et al. (2012). However, a user can choose not to translate his model but still make use of the website’s functionalities. Furthermore, it is possible to augment the given reaction set by user-defined reactions, for example, for model augmentation.

The bioinformatics community recognizes the intellectual poverty of lock step models.

Wonder when the intelligence community is going to have that “a ha” moment?

The Annotation-enriched non-redundant patent sequence databases [Curation vs. Search]

Monday, March 11th, 2013

The Annotation-enriched non-redundant patent sequence databases Weizhong Li, Bartosz Kondratowicz, Hamish McWilliam, Stephane Nauche and Rodrigo Lopez.

Not a real promising title is it? ;-) The reason I cite it here is that by curation, the database is “non-redundant.”

Try searching for some of these sequences at the USPTO and compare the results.

The power of curation will be immediately obvious.

Abstract:

The EMBL-European Bioinformatics Institute (EMBL-EBI) offers public access to patent sequence data, providing a valuable service to the intellectual property and scientific communities. The non-redundant (NR) patent sequence databases comprise two-level nucleotide and protein sequence clusters (NRNL1, NRNL2, NRPL1 and NRPL2) based on sequence identity (level-1) and patent family (level-2). Annotation from the source entries in these databases is merged and enhanced with additional information from the patent literature and biological context. Corrections in patent publication numbers, kind-codes and patent equivalents significantly improve the data quality. Data are available through various user interfaces including web browser, downloads via FTP, SRS, Dbfetch and EBI-Search. Sequence similarity/homology searches against the databases are available using BLAST, FASTA and PSI-Search. In this article, we describe the data collection and annotation and also outline major changes and improvements introduced since 2009. Apart from data growth, these changes include additional annotation for singleton clusters, the identifier versioning for tracking entry change and the entry mappings between the two-level databases.

Database URL: http://www.ebi.ac.uk/patentdata/nr/

Topic maps are curated data. Which one do you prefer?

NetGestalt for Data Visualization in the Context of Pathways

Thursday, February 21st, 2013

NetGestalt for Data Visualization in the Context of Pathways by Stephen Turner.

From the post:

Many of you may be familiar with WebGestalt, a wonderful web utility developed by Bing Zhang at Vanderbilt for doing basic gene-set enrichment analyses. Last year, we invited Bing to speak at our annual retreat for the Vanderbilt Graduate Program in Human Genetics, and he did not disappoint! Bing walked us through his new tool called NetGestalt.

NetGestalt provides users with the ability to overlay large-scale experimental data onto biological networks. Data are loaded using continuous and binary tracks that can contain either single or multiple lines of data (called composite tracks). Continuous tracks could be gene expression intensities from microarray data or any other quantitative measure that can be mapped to the genome. Binary tracks are usually insertion/deletion regions, or called regions like ChIP peaks. NetGestalt extends many of the features of WebGestalt, including enrichment analysis for modules within a biological network, and provides easy ways to visualize the overlay of multiple tracks with Venn diagrams.

Stephen also points to documentation and video tutorials.

NetGestalt uses gene symbol as the gene identifier. Data that uses other gene identifiers must be mapped to gene symbols before uploading. (Manual, page 4)

An impressive alignment of data sources even with the restriction to gene symbols.

New Query Tool Searches EHR Unstructured Data

Friday, February 15th, 2013

New Query Tool Searches EHR Unstructured Data by Ken Terry.

From the post:

A new electronic health record “intelligence platform” developed at Massachusetts General Hospital (MGH) and its parent organization, Partners Healthcare, is being touted as a solution to the problem of searching structured and unstructured data in EHRs for clinically useful information.

QPID Inc., a new firm spun off from Partners and backed by venture capital funds, is now selling its Web-based search engine to other healthcare organizations. Known as the Queriable Patient Inference Dossier (QPID), the tool is designed to allow clinicians to make ad hoc queries about particular patients and receive the desired information within seconds.

Today, 80% of stored health information is believed to be unstructured. It is trapped in free text such as physician notes and reports, discharge summaries, scanned documents and e-mail messages. One reason for the prevalence of unstructured data is that the standard methods for entering structured data, such as drop-down menus and check boxes, don’t fit into traditional physician workflow. Many doctors still dictate their notes, and the transcription goes into the EHR as free text.

and,

QPID, which was first used in the radiology department of MGH in 2005, incorporates an EHR search engine, a library of search queries based on clinical concepts, and a programming system for application and query development. When a clinician submits a query, QPID presents the desired data in a “dashboard” format that includes abnormal results, contraindications and other alerts, Doyle said.

The core of the system is a form of natural language processing (NLP) based on a library encompassing “thousands and thousands” of clinical concepts, he said. Because it was developed collaboratively by physicians and scientists, QPID identifies medical concepts imbedded in unstructured data more effectively than do other NLP systems from IBM, Nuance and M*Modal, Doyle maintained.

Take away points for data search/integration solutions:

  1. 80% of stored health information (need)
  2. traditional methods for data entry….don’t fit into traditional physician workflow (user requirement)
  3. developed collaboratively by physicians and scientists (semantics originate with users, not top down)

I am interested in how QPID conforms (or not) QPID to local medical terminology practices.

To duplicate their earlier success, conforming to local terminology practices is critical.

If for no other reason it will give physicians and other health professionals “ownership” of the vocabulary and hence faith in the system.

Using molecular networks to assess molecular similarity

Friday, February 15th, 2013

Systems chemistry: Using molecular networks to assess molecular similarity by Bailey Fallon.

From the post:

In new research published in Journal of Systems Chemistry, Sijbren Otto and colleagues have provided the first experimental approach towards molecular networks that can predict bioactivity based on an assessment of molecular similarity.

Molecular similarity is an important concept in drug discovery. Molecules that share certain features such as shape, structure or hydrogen bond donor/acceptor groups may have similar properties that make them common to a particular target. Assessment of molecular similarity has so far relied almost exclusively on computational approaches, but Dr Otto reasoned that a measure of similarity might be obtained by interrogating the molecules in solution experimentally.

Important work for drug discovery but there are semantic lessons here as well:

Tests for similarity/sameness are domain specific.

Which means there are no universal tests for similarity/sameness.

Lacking universal tests for similarity/sameness, we should focus on developing documented and domain specific tests for similarity/sameness.

Domain specific tests provide quicker ROI than less useful and doomed universal solutions.

Documented domain specific tests may, no guarantees, enable us to find commonalities between domain measures of similarity/sameness.

But our conclusions will be based on domain experience and not projection from our domain onto others, less well known domains.

A Tale of Five Languages

Monday, February 11th, 2013

Evaluating standard terminologies for encoding allergy information by Foster R Goss, Li Zhou, Joseph M Plasek, Carol Broverman, George Robinson, Blackford Middleton, Roberto A Rocha. (J Am Med Inform Assoc doi:10.1136/amiajnl-2012-000816)

Abstract:

Objective Allergy documentation and exchange are vital to ensuring patient safety. This study aims to analyze and compare various existing standard terminologies for representing allergy information.

Methods Five terminologies were identified, including the Systemized Nomenclature of Medical Clinical Terms (SNOMED CT), National Drug File–Reference Terminology (NDF-RT), Medication Dictionary for Regulatory Activities (MedDRA), Unique Ingredient Identifier (UNII), and RxNorm. A qualitative analysis was conducted to compare desirable characteristics of each terminology, including content coverage, concept orientation, formal definitions, multiple granularities, vocabulary structure, subset capability, and maintainability. A quantitative analysis was also performed to compare the content coverage of each terminology for (1) common food, drug, and environmental allergens and (2) descriptive concepts for common drug allergies, adverse reactions (AR), and no known allergies.

Results Our qualitative results show that SNOMED CT fulfilled the greatest number of desirable characteristics, followed by NDF-RT, RxNorm, UNII, and MedDRA. Our quantitative results demonstrate that RxNorm had the highest concept coverage for representing drug allergens, followed by UNII, SNOMED CT, NDF-RT, and MedDRA. For food and environmental allergens, UNII demonstrated the highest concept coverage, followed by SNOMED CT. For representing descriptive allergy concepts and adverse reactions, SNOMED CT and NDF-RT showed the highest coverage. Only SNOMED CT was capable of representing unique concepts for encoding no known allergies.

Conclusions The proper terminology for encoding a patient’s allergy is complex, as multiple elements need to be captured to form a fully structured clinical finding. Our results suggest that while gaps still exist, a combination of SNOMED CT and RxNorm can satisfy most criteria for encoding common allergies and provide sufficient content coverage.

Interesting article but some things that may not be apparent to the casual reader:

MedDRA:

The Medical Dictionary for Regulatory Activities (MedDRA) was developed by the International Conference on Harmonisation (ICH) and is owned by the International Federation of Pharmaceutical Manufacturers and Associations (IFPMA) acting as trustee for the ICH steering committee. The Maintenance and Support Services Organization (MSSO) serves as the repository, maintainer, and distributor of MedDRA as well as the source for the most up-to-date information regarding MedDRA and its application within the biopharmaceutical industry and regulators. (source: http://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/MDR/index.html

MedDRA has a metathesaurus with translations into: Czech, Dutch, French, German, Hungarian, Italian, Japanese, Portuguese, and Spanish.

Unique Ingredient Identifier (UNII)

The overall purpose of the joint FDA/USP Substance Registration System (SRS) is to support health information technology initiatives by generating unique ingredient identifiers (UNIIs) for substances in drugs, biologics, foods, and devices. The UNII is a non- proprietary, free, unique, unambiguous, non semantic, alphanumeric identifier based on a substance’s molecular structure and/or descriptive information.

The UNII may be found in:

  • NLM’s Unified Medical Language System (UMLS)
  • National Cancer Institutes Enterprise Vocabulary Service
  • USP Dictionary of USAN and International Drug Names (future)
  • FDA Data Standards Council website
  • VA National Drug File Reference Terminology (NDF-RT)
  • FDA Inactive Ingredient Query Application

(source: http://www.fda.gov/ForIndustry/DataStandards/SubstanceRegistrationSystem-UniqueIngredientIdentifierUNII/

National Drug File – Reference Terminology (NDF-RT)

The National Drug File – Reference Terminology (NDF-RT) is produced by the U.S. Department of Veterans Affairs, Veterans Health Administration (VHA).

NDF-RT combines the NDF hierarchical drug classification with a multi-category reference model. The categories are:

  1. Cellular or Molecular Interactions [MoA]
  2. Chemical Ingredients [Chemical/Ingredient]
  3. Clinical Kinetics [PK]
  4. Diseases, Manifestations or Physiologic States [Disease/Finding]
  5. Dose Forms [Dose Form]
  6. Pharmaceutical Preparations
  7. Physiological Effects [PE]
  8. Therapeutic Categories [TC]
  9. VA Drug Interactions [VA Drug Interaction]

(source: http://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/NDFRT/

MedDRA, UNII, and NDF-RT have been in use for years, MedDRA internationally in multiple languages. An uncounted number of medical records, histories and no doubt publications rely upon these vocabularies.

Assume the conclusion: SNOMED CT with RxNorm (links between drug vocabularies) provide the best coverage for “encoding common allergies.”

A critical question remains:

How to access medical records using other terminologies?

Recalling from the adventures of owl:sameAs (The Semantic Web Is Failing — But Why? (Part 5)) that any single string identifier is subject to multiple interpretations. Interpretations that can only be disambiguated by additional information.

You might present a search engine with string to string mappings but those are inherently less robust and harder to maintain than richer mappings.

The sort of richer mappings that are supported by topic maps.

‘What’s in the NIDDK CDR?’…

Saturday, February 9th, 2013

‘What’s in the NIDDK CDR?’—public query tools for the NIDDK central data repository by Nauqin Pan, et al., (Database (2013) 2013 : bas058 doi: 10.1093/database/bas058)

Abstract:

The National Institute of Diabetes and Digestive Disease (NIDDK) Central Data Repository (CDR) is a web-enabled resource available to researchers and the general public. The CDR warehouses clinical data and study documentation from NIDDK funded research, including such landmark studies as The Diabetes Control and Complications Trial (DCCT, 1983–93) and the Epidemiology of Diabetes Interventions and Complications (EDIC, 1994–present) follow-up study which has been ongoing for more than 20 years. The CDR also houses data from over 7 million biospecimens representing 2 million subjects. To help users explore the vast amount of data stored in the NIDDK CDR, we developed a suite of search mechanisms called the public query tools (PQTs). Five individual tools are available to search data from multiple perspectives: study search, basic search, ontology search, variable summary and sample by condition. PQT enables users to search for information across studies. Users can search for data such as number of subjects, types of biospecimens and disease outcome variables without prior knowledge of the individual studies. This suite of tools will increase the use and maximize the value of the NIDDK data and biospecimen repositories as important resources for the research community.

Database URL: https://www.niddkrepository.org/niddk/home.do

I would like to tell you more about this research, since “[t]he National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) is part of the National Institutes of Health (NIH) and the U.S. Department of Health and Human Services” (that’s a direct quote) and so doesn’t claim copyright on its publications.

Unfortunately, the NIDDK published this paper in the Oxford journal Database, which does believe in restricting access to publicly funded research.

Do visit the search interface to see what you think about it.

Not quite the same as curated content but an improvement over raw string matching.

ToxPi GUI [Data Recycling]

Sunday, February 3rd, 2013

ToxPi GUI: an interactive visualization tool for transparent integration of data from diverse sources of evidence by David M. Reif, Myroslav Sypa, Eric F. Lock, Fred A. Wright, Ander Wilson, Tommy Cathey, Richard R. Judson and Ivan Rusyn. (Bioinformatics (2013) 29 (3): 402-403. doi: 10.1093/bioinformatics/bts686)

Abstract:

Motivation: Scientists and regulators are often faced with complex decisions, where use of scarce resources must be prioritized using collections of diverse information. The Toxicological Prioritization Index (ToxPi™) was developed to enable integration of multiple sources of evidence on exposure and/or safety, transformed into transparent visual rankings to facilitate decision making. The rankings and associated graphical profiles can be used to prioritize resources in various decision contexts, such as testing chemical toxicity or assessing similarity of predicted compound bioactivity profiles. The amount and types of information available to decision makers are increasing exponentially, while the complex decisions must rely on specialized domain knowledge across multiple criteria of varying importance. Thus, the ToxPi bridges a gap, combining rigorous aggregation of evidence with ease of communication to stakeholders.

Results: An interactive ToxPi graphical user interface (GUI) application has been implemented to allow straightforward decision support across a variety of decision-making contexts in environmental health. The GUI allows users to easily import and recombine data, then analyze, visualize, highlight, export and communicate ToxPi results. It also provides a statistical metric of stability for both individual ToxPi scores and relative prioritized ranks.

Availability: The ToxPi GUI application, complete user manual and example data files are freely available from http://comptox.unc.edu/toxpi.php.

Contact: reif.david@gmail.com

Very cool!

Although like having a Ford automobile in any color, so long as the color was black, you can integrate any data source, so long as the format is csv. And values are numbers. Subject to other restrictions as well.

That’s an observation, not a criticism.

The application serves a purpose within a domain and does not “integrate” information in the sense of a topic map.

But a topic map could recycle its data to add other identifications and properties. Without having to re-write this application or its data.

Once curated, data should be re-used, not re-created/curated.

Topic maps give you more bang for your data buck.

PoSSuM

Monday, January 28th, 2013

PoSSuM : Pocket Similarity Searching using Multi-Sketches

From the webpage:

Today, vast amounts of protein-small molecule binding sites can be found in the Protein Data Bank (PDB). Exhaustive comparison of them is computationally demanding, but useful in the prediction of protein functions and drug discovery. We proposed a tremendously fast algorithm called “SketchSort” that enables the enumeration of similar pairs in a huge number of protein-ligand binding sites. We conducted all-pair similarity searches for 3.4 million known and potential binding sites using the proposed method and discovered over 24 million similar pairs of binding sites. We present the results as a relational database Pocket Similarity Search using Multiple-Sketches (PoSSuM), which includes all the discovered pairs with annotations of various types (e.g., CATH, SCOP, EC number, Gene ontology). PoSSuM enables rapid exploration of similar binding sites among structures with different global folds as well as similar ones. Moreover, PoSSuM is useful for predicting the binding ligand for unbound structures. Basically, the users can search similar binding pockets using two search modes:

i) “Search K” is useful for finding similar binding sites for a known ligand-binding site. Post a known ligand-binding site (a pair of “PDB ID” and “HET code”) in the PDB, and PoSSuM will search similar sites for the query site.

ii) “Search P” is useful for predicting ligands that potentially bind to a structure of interest. Post a known protein structure (PDB ID) in the PDB, and PoSSuM will search similar known-ligand binding sites for the query structure.

Obviously useful for the bioinformatics crowd but relevant for topic maps as well.

In topic map terminology, the searches are for associations with a known role player in a particular role, leaving the other role player unspecified.

It does not define or seek an exact match but provides the user with data that may help them make a match determination.

…Everything You Always Wanted to Know About Genes

Monday, January 28th, 2013

Toward a New Model of the Cell: Everything You Always Wanted to Know About Genes

From the post:

Turning vast amounts of genomic data into meaningful information about the cell is the great challenge of bioinformatics, with major implications for human biology and medicine. Researchers at the University of California, San Diego School of Medicine and colleagues have proposed a new method that creates a computational model of the cell from large networks of gene and protein interactions, discovering how genes and proteins connect to form higher-level cellular machinery.

“Our method creates ontology, or a specification of all the major players in the cell and the relationships between them,” said first author Janusz Dutkowski, PhD, postdoctoral researcher in the UC San Diego Department of Medicine. It uses knowledge about how genes and proteins interact with each other and automatically organizes this information to form a comprehensive catalog of gene functions, cellular components, and processes.

“What’s new about our ontology is that it is created automatically from large datasets. In this way, we see not only what is already known, but also potentially new biological components and processes — the bases for new hypotheses,” said Dutkowski.

Originally devised by philosophers attempting to explain the nature of existence, ontologies are now broadly used to encapsulate everything known about a subject in a hierarchy of terms and relationships. Intelligent information systems, such as iPhone’s Siri, are built on ontologies to enable reasoning about the real world. Ontologies are also used by scientists to structure knowledge about subjects like taxonomy, anatomy and development, bioactive compounds, disease and clinical diagnosis.

A Gene Ontology (GO) exists as well, constructed over the last decade through a joint effort of hundreds of scientists. It is considered the gold standard for understanding cell structure and gene function, containing 34,765 terms and 64,635 hierarchical relations annotating genes from more than 80 species.

“GO is very influential in biology and bioinformatics, but it is also incomplete and hard to update based on new data,” said senior author Trey Ideker, PhD, chief of the Division of Genetics in the School of Medicine and professor of bioengineering in UC San Diego’s Jacobs School of Engineering.

The conclusion to A gene ontology inferred from molecular networks (Janusz Dutkowski, Michael Kramer, Michal A Surma, Rama Balakrishnan, J Michael Cherry, Nevan J Krogan & Trey Ideker, Nature Biotechnology 31, 38–45 (2013) doi:10.1038/nbt.2463), illustrates a difference between ontology in the GO sense and that produced by the authors:

The research reported in this manuscript raises the possibility that, given the appropriate tools, ontologies might evolve over time with the addition of each new network map or high-throughput experiment that is published. More importantly, it enables a philosophical shift in bioinformatic analysis, from a regime in which the ontology is viewed as gold standard to one in which it is the major result. (emphasis added)

Ontology as representing reality as opposed to declaring it.

That is a novel concept.

BioNLP-ST 2013

Tuesday, January 22nd, 2013

BioNLP-ST 2013

Dates:

Training Data Release 12:00 IDLW, 17 Jan. 2013
Test Data Release 22 Mar. 2013
Result Submission 29 Mar. 2013
BioNLP’11 Workshop 8-9 Aug. 2013

From the website:

The BioNLP Shared Task (BioNLP-ST) series represents a community-wide trend in text-mining for biology toward fine-grained information extraction (IE). The two previous events, BioNLP-ST 2009 and 2011, attracted wide attention, with over 30 teams submitting final results. The tasks and their data have since served as the basis of numerous studies, released event extraction systems, and published datasets. The upcoming BioNLP-ST 2013 follows the general outline and goals of the previous tasks. It identifies biologically relevant extraction targets and proposes a linguistically motivated approach to event representation. The tasks in BioNLP-ST 2013 cover many new hot topics in biology that are close to biologists’ needs. BioNLP-ST 2013 broadens the scope of the text-mining application domains in biology by introducing new issues on cancer genetics and pathway curation. It also builds on the well-known previous datasets GENIA, LLL/BI and BB to propose more realistic tasks that considered previously, closer to the actual needs of biological data integration.

The first event in 2009 triggered active research in the community on a specific fine-grained IE task. Expanding on this, the second BioNLP-ST was organized under the theme “Generalization”, which was well received by participants, who introduced numerous systems that could be straightforwardly applied to multiple tasks. This time, the BioNLP-ST takes a step further and pursues the grand theme of “Knowledge base construction”, which is addressed in various ways: semantic web (GE, GRO), pathways (PC), molecular mechanisms of cancer (CG), regulation networks (GRN) and ontology population (GRO, BB).

As in previous events, manually annotated data will be provided for training, development and evaluation of information extraction methods. According to their relevance for biological studies, the annotations are either bound to specific expressions in the text or represented as structured knowledge. Many tools for the detailed evaluation and graphical visualization of annotations and system outputs will be available for participants. Support in performing linguistic processing will be provided to the participants in the form of analyses created by various state-of-the art tools on the dataset texts.

Participation to the task will be open to the academia, industry, and all other interested parties.

Tasks:

Quick question: Do you think there is semantically diverse data available for each of these tasks?

I first saw this at: BioNLP Shared Task: Text Mining for Biology Competition.

Concept Maps – Pharmaceuticals

Monday, January 21st, 2013

Designing concept maps for a precise and objective description of pharmaceutical innovations by Maia Iordatii, Alain Venot and Catherine Duclos. (BMC Medical Informatics and Decision Making 2013, 13:10 doi:10.1186/1472-6947-13-10)

Abstract:

Background

When a new drug is launched onto the market, information about the new manufactured product is contained in its monograph and evaluation report published by national drug agencies. Health professionals need to be able to determine rapidly and easily whether the new manufactured product is potentially useful for their practice. There is therefore a need to identify the best way to group together and visualize the main items of information describing the nature and potential impact of the new drug. The objective of this study was to identify these items of information and to bring them together in a model that could serve as the standard for presenting the main features of new manufactured product.

Methods

We developed a preliminary conceptual model of pharmaceutical innovations, based on the knowledge of the authors. We then refined this model, using a random sample of 40 new manufactured drugs recently approved by the national drug regulatory authorities in France and covering a broad spectrum of innovations and therapeutic areas. Finally, we used another sample of 20 new manufactured drugs to determine whether the model was sufficiently comprehensive.

Results

The results of our modeling led to three sub models described as conceptual maps representing: i) the medical context for use of the new drug (indications, type of effect, therapeutical arsenal for the same indications), ii) the nature of the novelty of the new drug (new molecule, new mechanism of action, new combination, new dosage, etc.), and iii) the impact of the drug in terms of efficacy, safety and ease of use, compared with other drugs with the same indications.

Conclusions

Our model can help to standardize information about new drugs released onto the market. It is potentially useful to the pharmaceutical industry, medical journals, editors of drug databases and medical software, and national or international drug regulation agencies, as a means of describing the main properties of new pharmaceutical products. It could also used as a guide for the writing of comprehensive and objective texts summarizing the nature and interest of new manufactured product. (emphasis added)

We all design categories starting with what we know, as pointed out under methods above.

And any three authors could undertake a such a quest, with equally valid results but different terminology and perhaps even a different arrangement of concepts.

The problem isn’t the undertaking, which is a useful.

The problem is a lack of a binding between such undertakings, which enables users to migrate between such maps, as they develop over time.

A problem that topic maps offer an infrastructure to solve.

The Pacific Symposium on Biocomputing 2013 [Proceedings]

Saturday, January 19th, 2013

The Pacific Symposium on Biocomputing 2013 by Will Bush.

From the post:

For 18 years now, computational biologists have convened on the beautiful islands of Hawaii to present and discuss research emerging from new areas of biomedicine. PSB Conference Chairs Teri Klein (@teriklein), Keith Dunker, Russ Altman (@Rbaltman) and Larry Hunter (@ProfLHunter) organize innovative sessions and tutorials that are always interactive and thought-provoking. This year, sessions included Computational Drug Repositioning, Epigenomics, Aberrant Pathway and Network Activity, Personalized Medicine, Phylogenomics and Population Genomics, Post-Next Generation Sequencing, and Text and Data Mining. The Proceedings are available online here, and a few of the highlights are:

See Will’s post for the highlights. Or browse the proceedings. You are almost certainly going to find something relevant to you.

Do note Will’s use of Twiiter IDs as identifiers. Unique, persistent (I assume Twitter doesn’t re-assign them), easy to access.

It wasn’t clear from Will’s post if the following image was from Biocomputing 2013 or if he stopped by a markup conference. Hard to tell. ;-)

Biocomputing 2013

PPInterFinder

Friday, January 18th, 2013

PPInterFinder—a mining tool for extracting causal relations on human proteins from literature by Kalpana Raja, Suresh Subramani and Jeyakumar Natarajan. (Database (2013) 2013 : bas052 doi: 10.1093/database/bas052)

Abstract:

One of the most common and challenging problem in biomedical text mining is to mine protein–protein interactions (PPIs) from MEDLINE abstracts and full-text research articles because PPIs play a major role in understanding the various biological processes and the impact of proteins in diseases. We implemented, PPInterFinder—a web-based text mining tool to extract human PPIs from biomedical literature. PPInterFinder uses relation keyword co-occurrences with protein names to extract information on PPIs from MEDLINE abstracts and consists of three phases. First, it identifies the relation keyword using a parser with Tregex and a relation keyword dictionary. Next, it automatically identifies the candidate PPI pairs with a set of rules related to PPI recognition. Finally, it extracts the relations by matching the sentence with a set of 11 specific patterns based on the syntactic nature of PPI pair. We find that PPInterFinder is capable of predicting PPIs with the accuracy of 66.05% on AIMED corpus and outperforms most of the existing systems.

Database URL: http://www.biomining-bu.in/ppinterfinder/

I thought the shortened form of the title would catch your eye. ;-)

Important work for bioinformatics but it is also an example of domain specific association mining.

By focusing on a specific domain and forswearing designs on being a universal association solution, PPInterFinder produces useful results today.

A lesson that should be taken and applied to semantic mappings more generally.

UniChem…[How Much Precision Can You Afford?]

Thursday, January 17th, 2013

UniChem: a unified chemical structure cross-referencing and identifier tracking system by Jon Chambers, Mark Davies, Anna Gaulton, Anne Hersey, Sameer Velankar, Robert Petryszak, Janna Hastings, Louisa Bellis, Shaun McGlinchey and John P Overington. (Journal of Cheminformatics 2013, 5:3 doi:10.1186/1758-2946-5-3)

Abstract:

UniChem is a freely available compound identifier mapping service on the internet, designed to optimize the efficiency with which structure-based hyperlinks may be built and maintained between chemistry-based resources. In the past, the creation and maintenance of such links at EMBL-EBI, where several chemistry-based resources exist, has required independent efforts by each of the separate teams. These efforts were complicated by the different data models, release schedules, and differing business rules for compound normalization and identifier nomenclature that exist across the organization. UniChem, a large-scale, non-redundant database of Standard InChIs with pointers between these structures and chemical identifiers from all the separate chemistry resources, was developed as a means of efficiently sharing the maintenance overhead of creating these links. Thus, for each source represented in UniChem, all links to and from all other sources are automatically calculated and immediately available for all to use. Updated mappings are immediately available upon loading of new data releases from the sources. Web services in UniChem provide users with a single simple automatable mechanism for maintaining all links from their resource to all other sources represented in UniChem. In addition, functionality to track changes in identifier usage allows users to monitor which identifiers are current, and which are obsolete. Lastly, UniChem has been deliberately designed to allow additional resources to be included with minimal effort. Indeed, the recent inclusion of data sources external to EMBL-EBI has provided a simple means of providing users with an even wider selection of resources with which to link to, all at no extra cost, while at the same time providing a simple mechanism for external resources to link to all EMBL-EBI chemistry resources.

From the background section:

Since these resources are continually developing in response to largely distinct active user communities, a full integration solution, or even the imposition of a requirement to adopt a common unifying chemical identifier, was considered unnecessarily complex, and would inhibit the freedom of each of the resources to successfully evolve in future. In addition, it was recognized that in the future more small molecule-containing databases might reside at EMBL-EBI, either because existing databases may begin to annotate their data with chemical information, or because entirely new resources are developed or adopted. This would make a full integration solution even more difficult to sustain. A need was therefore identified for a flexible integration solution, which would create, maintain and manage links between the resources, with minimal maintenance costs to the participant resources, whilst easily allowing the inclusion of additional sources in the future. Also, since the solution should allow different resources to maintain their own identifier systems, it was recognized as important for the system to have some simple means of tracking identifier usage, at least in the sense of being able to archive obsolete identifiers and assignments, and indicate when obsolete assignments were last in use.

The UniChem project highlights an important aspect of mapping identifiers: How much mapping can you afford?

Or perhaps even better: What is the cost/benefit ratio for a complete mapping?

The mapping in question isn’t a academic exercise in elegance and completeness.

It’s users have immediate need for the mapping data and it is it not quite right, human users are in the best position to correct it and suggest corrections.

Not to mention that new identifiers are likely to arrive before the old ones are completely mapped.

Suggestive that evolving mappings may be an appropriate paradigm for topic maps.

Manual Alignment of Anatomy Ontologies

Saturday, January 12th, 2013

Matching arthropod anatomy ontologies to the Hymenoptera Anatomy Ontology: results from a manual alignment by Matthew A. Bertone, István Mikó, Matthew J. Yoder, Katja C. Seltmann, James P. Balhoff, and Andrew R. Deans. (Database (2013) 2013 : bas057 doi: 10.1093/database/bas057)

Abstract:

Matching is an important step for increasing interoperability between heterogeneous ontologies. Here, we present alignments we produced as domain experts, using a manual mapping process, between the Hymenoptera Anatomy Ontology and other existing arthropod anatomy ontologies (representing spiders, ticks, mosquitoes and Drosophila melanogaster). The resulting alignments contain from 43 to 368 mappings (correspondences), all derived from domain-expert input. Despite the many pairwise correspondences, only 11 correspondences were found in common between all ontologies, suggesting either major intrinsic differences between each ontology or gaps in representing each group’s anatomy. Furthermore, we compare our findings with putative correspondences from Bioportal (derived from LOOM software) and summarize the results in a total evidence alignment. We briefly discuss characteristics of the ontologies and issues with the matching process.

Database URL: http://purl.obolibrary.org/obo/hao/2012-07-18/arthropod-mappings.obo.

A great example of the difficulty of matching across ontologies, particularly when the granularity or subjects of ontologies vary.

Ontology Alert! Molds are able to reproduce sexually

Thursday, January 10th, 2013

Unlike we thought for 100 years: Molds are able to reproduce sexually

For over 100 years, it was assumed that the penicillin-producing mould fungus Penicillium chrysogenum only reproduced asexually through spores. An international research team led by Prof. Dr. Ulrich Kück and Julia Böhm from the Chair of General and Molecular Botany at the Ruhr-Universität has now shown for the first time that the fungus also has a sexual cycle, i.e. two “genders”. Through sexual reproduction of P. chrysogenum, the researchers generated fungal strains with new biotechnologically relevant properties – such as high penicillin production without the contaminating chrysogenin. The team from Bochum, Göttingen, Nottingham (England), Kundl (Austria) and Sandoz GmbH reports in PNAS. The article will be published in this week’s Online Early Edition and was selected as a cover story.

J. Böhm, B. Hoff, C.M. O’Gorman, S. Wolfers, V. Klix, D. Binger, I. Zadra, H. Kürnsteiner, S. Pöggeler, P.S. Dyer, U. Kück (2013): Sexual reproduction and mating-type – mediated strain development in the penicillin-producing fungus Penicillium chrysogenum, PNAS, DOI: 10.1073/pnas.1217943110

If you have hard coded asexual reproduction into your ontology, time to reconsider that decision. And get agreement on reworking all the dependent relationships.

PLOS Computational Biology: Translational Bioinformatics

Tuesday, January 8th, 2013

PLOS Computational Biology: Translational Bioinformatics. Maricel Kann, Guest Editor, and Fran Lewitter, PLOS Computational Biology Education Editor.

Following up on the collection where Biomedical Knowledge Integration appears, only to find:

Introduction to Translational Bioinformatics Collection by Russ B. Altman. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002796

Chapter 1: Biomedical Knowledge Integration by Philip R. O. Payne. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002826

Chapter 2: Data-Driven View of Disease Biology by Casey S. Greene and Olga G. Troyanskaya. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002816

Chapter 3: Small Molecules and Disease by David S. Wishart. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002805

Chapter 4: Protein Interactions and Disease by Mileidy W. Gonzalez by Maricel G. Kann. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002819

Chapter 5: Network Biology Approach to Complex Diseases by Dong-Yeon Cho, Yoo-Ah Kim and Teresa M. Przytycka. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002820

Chapter 6: Structural Variation and Medical Genomics by Benjamin J. Raphael. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002821

Chapter 7: Pharmacogenomics by Konrad J. Karczewski, Roxana Daneshjou and Russ B. Altman. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002817

Chapter 8: Biological Knowledge Assembly and Interpretation by Han Kim. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002858

Chapter 9: Analyses Using Disease Ontologies by Nigam H. Shah, Tyler Cole and Mark A. Musen. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002827

Chapter 10: Mining Genome-Wide Genetic Markers by Xiang Zhang, Shunping Huang, Zhaojun Zhang and Wei Wang. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002828

Chapter 11: Genome-Wide Association Studies by William S. Bush and Jason H. Moore. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002822

Chapter 12: Human Microbiome Analysis by Xochitl C. Morgan and Curtis Huttenhower. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002808

Chapter 13: Mining Electronic Health Records in the Genomics Era by Joshua C. Denny. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002823

Chapter 14: Cancer Genome AnalysisMiguel Vazquez, Victor de la Torre and Alfonso Valencia. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002824

An example of scholarship at its best!

Biomedical Knowledge Integration

Tuesday, January 8th, 2013

Biomedical Knowledge Integration by Philip R. O. Payne.

Abstract:

The modern biomedical research and healthcare delivery domains have seen an unparalleled increase in the rate of innovation and novel technologies over the past several decades. Catalyzed by paradigm-shifting public and private programs focusing upon the formation and delivery of genomic and personalized medicine, the need for high-throughput and integrative approaches to the collection, management, and analysis of heterogeneous data sets has become imperative. This need is particularly pressing in the translational bioinformatics domain, where many fundamental research questions require the integration of large scale, multi-dimensional clinical phenotype and bio-molecular data sets. Modern biomedical informatics theory and practice has demonstrated the distinct benefits associated with the use of knowledge-based systems in such contexts. A knowledge-based system can be defined as an intelligent agent that employs a computationally tractable knowledge base or repository in order to reason upon data in a targeted domain and reproduce expert performance relative to such reasoning operations. The ultimate goal of the design and use of such agents is to increase the reproducibility, scalability, and accessibility of complex reasoning tasks. Examples of the application of knowledge-based systems in biomedicine span a broad spectrum, from the execution of clinical decision support, to epidemiologic surveillance of public data sets for the purposes of detecting emerging infectious diseases, to the discovery of novel hypotheses in large-scale research data sets. In this chapter, we will review the basic theoretical frameworks that define core knowledge types and reasoning operations with particular emphasis on the applicability of such conceptual models within the biomedical domain, and then go on to introduce a number of prototypical data integration requirements and patterns relevant to the conduct of translational bioinformatics that can be addressed via the design and use of knowledge-based systems.

A chapter in “Translational Bioinformatics” collection for PLOS Computational Biology.

A very good survey of the knowledge integration area, which alas does not include topic maps. :-(

Well, but it does include use cases at the end of the chapter that are biomedical specific.

Thinking those would be good cases to illustrate the use of topic maps for biomedical knowledge integration.

Yes?

Semantically enabling a genome-wide association study database

Saturday, January 5th, 2013

Semantically enabling a genome-wide association study database by Tim Beck, Robert C Free, Gudmundur A Thorisson and Anthony J Brookes. Journal of Biomedical Semantics 2012, 3:9 doi:10.1186/2041-1480-3-9.

Abstract:

Background

The amount of data generated from genome-wide association studies (GWAS) has grown rapidly, but considerations for GWAS phenotype data reuse and interchange have not kept pace. This impacts on the work of GWAS Central — a free and open access resource for the advanced querying and comparison of summary-level genetic association data. The benefits of employing ontologies for standardising and structuring data are widely accepted. The complex spectrum of observed human phenotypes (and traits), and the requirement for cross-species phenotype comparisons, calls for reflection on the most appropriate solution for the organisation of human phenotype data. The Semantic Web provides standards for the possibility of further integration of GWAS data and the ability to contribute to the web of Linked Data.

Results

A pragmatic consideration when applying phenotype ontologies to GWAS data is the ability to retrieve all data, at the most granular level possible, from querying a single ontology graph. We found the Medical Subject Headings (MeSH) terminology suitable for describing all traits (diseases and medical signs and symptoms) at various levels of granularity and the Human Phenotype Ontology (HPO) most suitable for describing phenotypic abnormalities (medical signs and symptoms) at the most granular level. Diseases within MeSH are mapped to HPO to infer the phenotypic abnormalities associated with diseases. Building on the rich semantic phenotype annotation layer, we are able to make cross-species phenotype comparisons and publish a core subset of GWAS data as RDF nanopublications.

Conclusions

We present a methodology for applying phenotype annotations to a comprehensive genome-wide association dataset and for ensuring compatibility with the Semantic Web. The annotations are used to assist with cross-species genotype and phenotype comparisons. However, further processing and deconstructions of terms may be required to facilitate automatic phenotype comparisons. The provision of GWAS nanopublications enables a new dimension for exploring GWAS data, by way of intrinsic links to related data resources within the Linked Data web. The value of such annotation and integration will grow as more biomedical resources adopt the standards of the Semantic Web.

Rather than:

The benefits of employing ontologies for standardising and structuring data are widely accepted.

I would rephrase that to read:

The benefits and limitations of employing ontologies for standardising and structuring data are widely known.

Decades of use of relational database schemas, informal equivalents of ontologies, leave no doubt governing structures for data have benefits.

Less often acknowledged is those same governing structures impose limitations on data and what may be represented.

That’s not a dig at relational databases.

Just an observation that ontologies and their equivalents aren’t unalloyed precious metals.

Utopia Documents

Thursday, December 27th, 2012

Checking the “sponsored by” link for pdfx v1.0 and discovered: Utopia Documents.

From the homepage:

Reading, redefined.

Utopia Documents brings a fresh new perspective to reading the scientific literature, combining the convenience and reliability of the PDF with the flexibility and power of the web. Free for Linux, Mac and Windows.

Building Bridges

The scientific article has been described as a Story That Persuades With Data, but all too often the link between data and narrative is lost somewhere in the modern publishing process. Utopia Documents helps to rebuild these connections, linking articles to underlying datasets, and making it easy to access online resources relating to an article’s content.

A Living Resource

Published articles form the ‘minutes of science‘, creating a stable record of ideas and discoveries. But no idea exists in isolation, and just because something has been published doesn’t mean that the story is over. Utopia Documents reconnects PDFs with the ongoing discussion, keeping you up-to-date with the latest knowledge and metrics.

Comment

Make private notes for yourself, annotate a document for others to see or take part in an online discussion.

Explore article content

Looking for clarification of given terms? Or more information about them? Do just that, with integrated semantic search.

Interact with live data

Interact directly with curated database entries- play with molecular structures; edit sequence and alignment data; even plot and export tabular data.

A finger on the pulse

Stay up to date with the latest news. Utopia connects what you read with live data from Altmetric, Mendeley, CrossRef, Scibite and others.

A user can register for an account (enabling comments on documents) or use the application anonymously.

Presently focused on the life sciences but no impediment to expansion into computer science for example.

It doesn’t solve semantic diversity issues so an opportunity for topic maps there.

Doesn’t address the issue of documents being good at information delivery but not so good for information storage.

But issues of semantic diversity and information storage, are growth areas for Utopia Documents, not reservations about its use.

Suggest you start using and exploring Utopia Documents sooner rather than later!

EOL Classification Providers [Encyclopedia of Life]

Wednesday, December 26th, 2012

EOL Classification Providers

From the webpage:

The information on EOL is organized using hierarchical classifications of taxa (groups of organisms) from a number of different classification providers. You can explore these hierarchies in the Names tab of EOL taxon pages. Many visitors would expect to see a single classification of life on EOL. However, we are still far from having a classification scheme that is universally accepted.

Biologists all over the world are studying the genetic relationships between organisms in order to determine each species’ place in the hierarchy of life. While this research is underway, there will be differences in opinion on how to best classify each group. Therefore, we present our visitors with a number of alternatives. Each of these hierarchies is supported by a community of scientists, and all of them feature relationships that are controversial or unresolved.

How far from universally accepted?

Consider the sources for classification:

AntWeb
AntWeb is generally recognized as the most advanced biodiversity information system at species level dedicated to ants. Altogether, its acceptance by the ant research community, the number of participating remote curators that maintain the site, number of pictures, simplicity of web interface, and completeness of species, make AntWeb the premier reference for dissemination of data, information, and knowledge on ants. AntWeb is serving information on tens of thousands of ant species through the EOL.

Avibase
Avibase is an extensive database information system about all birds of the world, containing over 6 million records about 10,000 species and 22,000 subspecies of birds, including distribution information, taxonomy, synonyms in several languages and more. This site is managed by Denis Lepage and hosted by Bird Studies Canada, the Canadian copartner of Birdlife International. Avibase has been a work in progress since 1992 and it is offered as a free service to the bird-watching and scientific community. In addition to links, Avibase helped us install Gill, F & D Donsker (Eds). 2012. IOC World Bird Names (v 3.1). Available at http://www.worldbirdnames.org as of 2 May 2012.  More bird classifications are likely to follow

CoL
The Catalogue of Life Partnership (CoLP) is an informal partnership dedicated to creating an index of the world’s organisms, called the Catalogue of Life (CoL). The CoL provides different forms of access to an integrated, quality, maintained, comprehensive consensus species checklist and taxonomic hierarchy, presently covering more than one million species, and intended to cover all know species in the near future. The Annual Checklist EOL uses contains substantial contributions of taxonomic expertise from more than fifty organizations around the world, integrated into a single work by the ongoing work of the CoLP partners. 

FishBase
FishBase is a global information system with all you ever wanted to know about fishes. FishBase is a relational database with information to cater to different professionals such as research scientists, fisheries managers, zoologists and many more. The FishBase Website contains data on practically every fish species known to science. The project was developed at the WorldFish Center in collaboration with the Food and Agriculture Organization of the United Nations and many other partners, and with support from the European Commission. FishBase is serving information on more than 30,000 fish species through EOL.

Index Fungorum
The Index Fungorum, the global fungal nomenclator coordinated and supported by the Index Fungorum Partnership (CABI, CBS, Landcare Research-NZ), contains names of fungi (including yeasts, lichens, chromistan fungal analogues, protozoan fungal analogues and fossil forms) at all ranks.

ITIS
The Integrated Taxonomic Information System (ITIS) is a partnership of federal agencies and other organizations from the United States, Canada, and Mexico, with data stewards and experts from around the world (see http://www.itis.gov). The ITIS database is an automated reference of scientific and common names of biota of interest to North America . It contains more than 600,000 scientific and common names in all kingdoms, and is accessible via the World Wide Web in English, French, Spanish, and Portuguese (http://itis.gbif.net). ITIS is part of the US National Biological Information Infrastructure (http://www.nbii.gov).

IUCN
International Union for Conservation of Nature (IUCN) helps the world find pragmatic solutions to our most pressing environment and development challenges. IUCN supports scientific research; manages field projects all over the world; and brings governments, non-government organizations, United Nations agencies, companies and local communities together to develop and implement policy, laws and best practice. EOL partnered with the IUCN to indicate status of each species according to the Red List of Threatened Species.

Metalmark Moths of the World
Metalmark moths (Lepidoptera: Choreutidae) are a poorly known, mostly tropical family of microlepidopterans. The Metalmark Moths of the World LifeDesk provides species pages and an updated classification for the group.

NCBI
As a U.S. national resource for molecular biology information, NCBI’s mission is to develop new information technologies to aid in the understanding of fundamental molecular and genetic processes that control health and disease. The NCBI taxonomy database contains the names of all organisms that are represented in the genetic databases with at least one nucleotide or protein sequence.

The Paleobiology Database
The Paleobiology Database is a public resource for the global scientific community. It has been organized and operated by a multi-disciplinary, multi-institutional, international group of paleobiological researchers. Its purpose is to provide global, collection-based occurrence and taxonomic data for marine and terrestrial animals and plants of any geological age, as well as web-based software for statistical analysis of the data. The project’s wider, long-term goal is to encourage collaborative efforts to answer large-scale paleobiological questions by developing a useful database infrastructure and bringing together large data sets.

The Reptile Database 
This database provides information on the classification of all living reptiles by listing all species and their pertinent higher taxa. The database therefore covers all living snakes, lizards, turtles, amphisbaenians, tuataras, and crocodiles. It is a source of taxonomic data, thus providing primarily (scientific) names, synonyms, distributions and related data. The database is currently supported by the Systematics working group of the German Herpetological Society (DGHT)

WoRMS
The aim of a World Register of Marine Species (WoRMS) is to provide an authoritative and comprehensive list of names of marine organisms, including information on synonymy. While highest priority goes to valid names, other names in use are included so that this register can serve as a guide to interpret taxonomic literature.

Those are “current” classifications, which don’t reflect historical classifications (used by our ancestors), nor future classifications.

The four states of matter becoming > 500 states of matter for example.

Instead of “universal acceptance,” how does “working agreement for a specific purpose” sound?

New Public-Access Source With 3-D Information for Protein Interactions

Friday, December 21st, 2012

New Public-Access Source With 3-D Information for Protein Interactions

From the post:

Researchers have developed a platform that compiles all the atomic data, previously stored in diverse databases, on protein structures and protein interactions for eight organisms of relevance. They apply a singular homology-based modelling procedure.

The scientists Roberto Mosca, Arnaud Ceol and Patrick Aloy provide the international biomedical community with Interactome3D (interactome3d.irbbarcelona.org), an open-access and free web platform developed entirely by the Institute for Research in Biomedicine (IRB Barcelona). Interactome 3D offers for the first time the possibility to anonymously access and add molecular details of protein interactions and to obtain the information in 3D models. For researchers, atomic level details about the reactions are fundamental to unravel the bases of biology, disease development, and the design of experiments and drugs to combat diseases.

Interactome 3D provides reliable information about more than 12,000 protein interactions for eight model organisms, namely the plant Arabidopsis thaliana, the worm Caenorhabditis elegans, the fly Drosophila melanogaster, the bacteria Escherichia coli and Helicobacter pylori, the brewer’s yeast Saccharomyces cerevisiae, the mouse Mus musculus, and Homo sapiens. These models are considered the most relevant in biomedical research and genetic studies. The journal Nature Methods presents the research results and accredits the platform on the basis of it high reliability and precision in modelling interactions, which reaches an average of 75%.

Further details can be found at:

Interactome3D: adding structural details to protein networks by Roberto Mosca, Arnaud Céol and Patrick Aloy. (Nature Methods (2012) doi:10.1038/nmeth.2289)

Abstract:

Network-centered approaches are increasingly used to understand the fundamentals of biology. However, the molecular details contained in the interaction networks, often necessary to understand cellular processes, are very limited, and the experimental difficulties surrounding the determination of protein complex structures make computational modeling techniques paramount. Here we present Interactome3D, a resource for the structural annotation and modeling of protein-protein interactions. Through the integration of interaction data from the main pathway repositories, we provide structural details at atomic resolution for over 12,000 protein-protein interactions in eight model organisms. Unlike static databases, Interactome3D also allows biologists to upload newly discovered interactions and pathways in any species, select the best combination of structural templates and build three-dimensional models in a fully automated manner. Finally, we illustrate the value of Interactome3D through the structural annotation of the complement cascade pathway, rationalizing a potential common mechanism of action suggested for several disease-causing mutations.

Interesting not only for its implications for bioinformatics but for the development of homology modeling (superficially, similar proteins have similar interaction sites) to assist in their work.

The topic map analogy would be to show a subject domain, different identifications of the same subject tend to have the same associations or to fall into other patterns.

Then constructing a subject identity test based upon a template of associations or other values.

Bio-Linux 7 – Released November 2012

Tuesday, December 18th, 2012

Bio-Linux 7 – Released November 2012

From the webpage:

Bio-Linux 7 is a fully featured, powerful, configurable and easy to maintain bioinformatics workstation. Bio-Linux provides more than 500 bioinformatics programs on an Ubuntu Linux 12.04 LTS base. There is a
graphical menu for bioinformatics programs, as well as easy access to the Bio-Linux bioinformatics documentation system and sample data useful for testing programs. 

Bio-Linux 7 adds many improvements over previous versions, including the Galaxy analysis environment.  There are also various packages to handle new generation sequence data types.

You can install Bio-Linux on your machine, either as the only operating system, or as part of a dual-boot setup which allows you to use your current system and Bio-Linux on the same hardware.

Bio-Linux also runs Live from the DVD or a USB stick. This runs in the memory of your machine and does not involve installing anything. This is a great, no-hassle way to try out Bio-Linux, demonstrate or teach with it, or to work with when you are on the move.

Bio-Linux is built on open source systems and software, and so is free to to install and use. See What’s new on Bio-Linux 7. Also, check out the  2006 paper on Bio-Linux and open source systems for biologists.

Useful for exploring bioinformatics tools for Ubuntu.

But useful as well for considering how those tools could be used in data/text mining for other domains.

Not to mention the packaging for installation to DVD or USB stick.

Are there any topic map engines that are setup for burning to DVD or USB stick?

Packaging them that way with more than a minimal set of maps and/or data sets might be a useful avenue to explore.

taxize: Taxonomic search and phylogeny retrieval [R]

Monday, December 17th, 2012

taxize: Taxonomic search and phylogeny retrieval by Scott Chamberlain, Eduard Szoecs and Carl Boettiger.

From the documentation:

We are developing taxize as a package to allow users to search over many websites for species names (scientific and common) and download up- and downstream taxonomic hierarchical information – and many other things. The functions in the package that hit a specific API have a prefix and suffix separated by an underscore. They follow the format of service_whatitdoes. For example, gnr_resolve uses the Global Names Resolver API to resolve species names. General functions in the package that don’t hit a specific API don’t have two words separated by an underscore, e.g., classification. You need API keys for Encyclopedia of Life (EOL), the Universal Biological Indexer and Organizer (uBio), Tropicos, and Plantminer.

Just in case you need species names and/or taxonomic hierarchy information for your topic map.