Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 6, 2014

Needles in Stacks of Needles:…

Filed under: Bioinformatics,Biomedical,Genomics,Searching,Visualization — Patrick Durusau @ 3:33 pm

Needles in Stacks of Needles: genomics + data mining by Martin Krzywinski. (ICDM2012 Keynote)

Abstract:

In 2001, the first human genome sequence was published. Now, just over 10 years later, we capable of sequencing a genome in just a few days. Massive parallel sequencing projects now make it possible to study the cancers of thousands of individuals. New data mining approaches are required to robustly interrogate the data for causal relationships among the inherently noisy biology. How does one identify genetic changes that are specific and causal to a disease within the rich variation that is either natural or merely correlated? The problem is one of finding a needle in a stack of needles. I will provide a non-specialist introduction to data mining methods and challenges in genomics, with a focus on the role visualization plays in the exploration of the underlying data.

This page links to the slides Martin used in his presentation.

Excellent graphics and a number of amusing points, even without the presentation itself:

Cheap Data: A fruit fly that expresses high sensitivity to alcohol.

Kenny: A fruit fly without this gene dies in two days, named for the South Park character who dies in each episode.

Ken and Barbie: Fruit flys that fail to develop external genitalia.

One observation that rings true across disciplines:

Literature is still largely composed and published opaquely.

I searched for a video recording of the presentation but came up empty.

Need a Human

Filed under: Bioinformatics,Biomedical,Genomics — Patrick Durusau @ 11:38 am

Need a Human

Shamelessly stolen from Martin Krzywinski’s ICDM2012 Keynote — Needles in Stacks of Needles.

I am about to post on that keynote but thought the image merited a post of its own.

December 27, 2013

Galaxy:…

Filed under: Bioinformatics,Biomedical,Biostatistics — Patrick Durusau @ 5:41 pm

Galaxy: Data Intensive Biology For Everyone

From the website:

Galaxy is an open, web-based platform for data intensive biomedical research. Whether on the free public server or your own instance, you can perform, reproduce, and share complete analyses.

From the Galaxy wiki:

Galaxy is an open, web-based platform for accessible, reproducible, and transparent computational biomedical research.

  • Accessible: Users without programming experience can easily specify parameters and run tools and workflows.
  • Reproducible: Galaxy captures information so that any user can repeat and understand a complete computational analysis.
  • Transparent: Users share and publish analyses via the web and create Pages, interactive, web-based documents that describe a complete analysis.

This is the Galaxy Community Wiki. It describes all things Galaxy.

Whether you are a home bio-hacker or an IT person looking to understand computational biology, Galaxy may be a good fit for you.

You can try out the public server before troubling to install it locally. Assuming you are paranoid about your bits going over the network. 😉

December 24, 2013

Resource Identification Initiative

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 4:30 pm

Resource Identification Initiative

From the webpage:

We are starting a pilot project, sponsored by the Neuroscience Information Framework and the International Neuroinformatics Coordinating Facility, to address the issue of proper resource identification within the neuroscience (actually biomedical) literature. We have now christened this project the Resource Identification Initiative (hastag #RII) and expanded the scope beyond neuroscience. This project is designed to make it easier for researchers to identify the key resources (materials, data, tools) used to produce the scientific findings within a published study and to find other studies that used the same resources. It is also designed to make it easier for resource providers to track usage of their resources and for funders to measure impacts of resource funding. The requirements are that key resources are identified in such a manner that they are identified uniquely and are:

1) Machine readable;

2) Are available outside the paywall;

3) Are uniform across publishers and journals. We are seeking broad input from the FORCE11 community to ensure that we come up with a solution that represents the best thinking available on these topics.

The pilot project was an outcome of a meeting held at the NIH on Jun 26th. A draft report from the June 26th Resource Identification meeting at the NIH is now available. As the report indicates, we have preliminary agreements from journals and publishers to implement a pilot project. We hope to extend this project well beyond the neuroscience literature, so please join this group if you are interested in participating.

….

Yes, another “unique identifier” project.

Don’t get me wrong, to the extent that a unique vocabulary can be developed and used, that’s great.

But it does not address:

  • tools/techniques/data that existed before the unique vocabulary came into existence
  • future tools/techniques/data that isn’t covered by the unique vocabulary
  • mappings between old, current and future tool/techniques/data

The project is trying to address a real need in neuroscience journals (lack of robust identification of organisms or antibodies).

If you have the time and interest, it is a worthwhile project that needs to consider the requirements for “robust” identification.

September 26, 2013

Computational Chemogenomics

Filed under: Bioinformatics,Biomedical,Genomics — Patrick Durusau @ 11:00 am

Computational Chemogenomics by Edgar Jacoby (Novartis Pharma AG, Switzerland).

Description:

In the post-genomic era, one of the key challenges for drug discovery consists in making optimal use of comprehensive genomic data to identify effective new medicines. Chemogenomics addresses this challenge and aims to systematically identify all ligands and modulators for all gene products expressed, besides allowing accelerated exploration of their biological function.

Computational chemogenomics focuses on applications of compound library design and virtual screening to expand the bioactive chemical space, to target hopping of chemotypes to identify synergies within related drug discovery projects or to repurpose known drugs, to propose mechanisms of action of compounds, and to identify off-target effects by cross-reactivity analysis.

Both ligand-based and structure-based in silico approaches, as reviewed in this book, play important roles in all these applications. Computational chemogenomics is expected to increase the quality and productivity of drug discovery and lead to the discovery of new medicines.

If you are on the cutting edge of bioinformatics or want to keep up with the cutting edge in bioinformatics, this is a volume to consider.

The hard copy price is $149.95 so it may be a while before I acquire a copy of it.

September 24, 2013

Rumors of Legends (the TMRM kind?)

Filed under: Bioinformatics,Biomedical,Legends,Semantics,TMRM,XML — Patrick Durusau @ 3:42 pm

BioC: a minimalist approach to interoperability for biomedical text processing (numerous authors, see the article).

Abstract:

A vast amount of scientific information is encoded in natural language text, and the quantity of such text has become so great that it is no longer economically feasible to have a human as the first step in the search process. Natural language processing and text mining tools have become essential to facilitate the search for and extraction of information from text. This has led to vigorous research efforts to create useful tools and to create humanly labeled text corpora, which can be used to improve such tools. To encourage combining these efforts into larger, more powerful and more capable systems, a common interchange format to represent, store and exchange the data in a simple manner between different language processing systems and text mining tools is highly desirable. Here we propose a simple extensible mark-up language format to share text documents and annotations. The proposed annotation approach allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities. In addition, we provide simple code to hold this data, read it from and write it back to extensible mark-up language files and perform some sample processing. We also describe completed as well as ongoing work to apply the approach in several directions. Code and data are available at http://bioc.sourceforge.net/.

From the introduction:

With the proliferation of natural language text, text mining has emerged as an important research area. As a result many researchers are developing natural language processing (NLP) and information retrieval tools for text mining purposes. However, while the capabilities and the quality of tools continue to grow, it remains challenging to combine these into more complex systems. Every new generation of researchers creates their own software specific to their research, their environment and the format of the data they study; possibly due to the fact that this is the path requiring the least labor. However, with every new cycle restarting in this manner, the sophistication of systems that can be developed is limited. (emphasis added)

That is the experience with creating electronic versions of the Hebrew Bible. Every project has started from a blank screen, requiring re-proofing of the same text, etc. As a result, there is no electronic encoding of the masora magna (think long margin notes). Duplicated effort has a real cost to scholarship.

The authors stray into legend land when they write:

Our approach to these problems is what we would like to call a ‘minimalist’ approach. How ‘little’ can one do to obtain interoperability? We provide an extensible mark-up language (XML) document type definition (DTD) defining ways in which a document can contain text, annotations and relations. Major XML elements may contain ‘infon’ elements, which store key-value pairs with any desired semantic information. We have adapted the term ‘infon’ from the writings of Devlin (1), where it is given the sense of a discrete item of information. An associated ‘key’ file is necessary to define the semantics that appear in tags such as the infon elements. Key files are simple text files where the developer defines the semantics associated with the data. Different corpora or annotation sets sharing the same semantics may reuse an existing key file, thus representing an accepted standard for a particular data type. In addition, key files may describe a new kind of data not seen before. At this point we prescribe no semantic standards. BioC users are encouraged to create their own key files to represent their BioC data collections. In time, we believe, the most useful key files will develop a life of their own, thus providing emerging standards that are naturally adopted by the community.

The “key files” don’t specify subject identities for the purposes of merging. But defining the semantics of data is a first step in that direction.

I like the idea of popular “key files” (read legends) taking on a life of their own due to their usefulness. An economic activity based on reducing the friction in using or re-using data. That should have legs.

BTW, don’t overlook the author’s data and code, available at: http://bioc.sourceforge.net/.

August 4, 2013

Building Smaller Data

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 9:41 am

Throw the Bath Water Out, Keep the Baby: Keeping Medically-Relevant Terms for Text Mining by Jay Jarman, MS and Donald J. Berndt, PhD.

Abstract:

The purpose of this research is to answer the question, can medically-relevant terms be extracted from text notes and text mined for the purpose of classification and obtain equal or better results than text mining the original note? A novel method is used to extract medically-relevant terms for the purpose of text mining. A dataset of 5,009 EMR text notes (1,151 related to falls) was obtained from a Veterans Administration Medical Center. The dataset was processed with a natural language processing (NLP) application which extracted concepts based on SNOMED-CT terms from the Unified Medical Language System (UMLS) Metathesaurus. SAS Enterprise Miner was used to text mine both the set of complete text notes and the set represented by the extracted concepts. Logistic regression models were built from the results, with the extracted concept model performing slightly better than the complete note model.

The researchers created two datasets. One composed of the original text medical notes and the second of extracted named entities using NLP and medical vocabularies.

The named entity only dataset was found to perform better than the full text mining approach.

A smaller data set that had a higher performance than the larger data set of notes.

Wait! Isn’t that backwards? I thought “big data” was always better than “smaller data?”

Maybe not?

Maybe having the “right” dataset is better than having a “big data” set.

The 97% Junk Part of Human DNA

Filed under: Bioinformatics,Biomedical,Gene Ontology,Genome,Genomics — Patrick Durusau @ 9:21 am

Researchers from the Gene and Stem Cell Therapy Program at Sydney’s Centenary Institute have confirmed that, far from being “junk,” the 97 per cent of human DNA that does not encode instructions for making proteins can play a significant role in controlling cell development.

And in doing so, the researchers have unravelled a previously unknown mechanism for regulating the activity of genes, increasing our understanding of the way cells develop and opening the way to new possibilities for therapy.

Using the latest gene sequencing techniques and sophisticated computer analysis, a research group led by Professor John Rasko AO and including Centenary’s Head of Bioinformatics, Dr William Ritchie, has shown how particular white blood cells use non-coding DNA to regulate the activity of a group of genes that determines their shape and function. The work is published today in the scientific journal Cell.*

There’s a poke with a sharp stick to any gene ontology.

Roles in associations of genes have suddenly expanded.

Your call:

  1. Wait until a committee can officially name the new roles and parts of the “junk” that play those roles, or
  2. Create names/roles on the fly and merge those with subsequent identifiers on an ongoing basis as our understanding improves.

Any questions?

*Justin J.-L. Wong, William Ritchie, Olivia A. Ebner, Matthias Selbach, Jason W.H. Wong, Yizhou Huang, Dadi Gao, Natalia Pinello, Maria Gonzalez, Kinsha Baidya, Annora Thoeng, Teh-Liane Khoo, Charles G. Bailey, Jeff Holst, John E.J. Rasko. Orchestrated Intron Retention Regulates Normal Granulocyte Differentiation. Cell, 2013; 154 (3): 583 DOI: 10.1016/j.cell.2013.06.052

July 28, 2013

NIH Big Data to Knowledge (BD2K) Initiative [TM Opportunity?]

Filed under: Bioinformatics,Biomedical,Funding — Patrick Durusau @ 3:23 pm

NIH Big Data to Knowledge (BD2K) Initiative by Shar Steed.

From the post:

The National Institutes of Health (NIH) has announced the Centers of Excellence for Big Data Computing in the Biomedical Sciences (U54) funding opportunity announcement, the first in its Big Data to Knowledge (BD2K) Initiative.

The purpose of the BD2K initiative is to help biomedical scientists fully utilize Big Data being generated by research communities. As technology advances, scientists are generating and using large, complex, and diverse datasets, which is making the biomedical research enterprise more data-intensive and data-driven. According to the BD2K website:

[further down in the post]

Data integration: An applicant may propose a Center that will develop efficient and meaningful ways to create connections across data types (i.e., unimodal or multimodal data integration).

That sounds like topic maps doesn’t it?

At least if we get away from black/white, match one of a set of IRIs or not, type merging practices.

For more details:

A webinar for applicants is scheduled for Thursday, September 12, 2013, from 3 – 4:30 pm EDT. Click here for more information.

Be aware of this workshop:

August 21, 2013 – August 22, 2013
NIH Data Catalogue
Chair:
Francine Berman, Ph.D.

This workshop seeks to identify the least duplicative and burdensome, and most sustainable and scalable method to create and maintain an NIH Data Catalog. An NIH Data Catalog would make biomedical data findable and citable, as PubMed does for scientific publications, and would link data to relevant grants, publications, software, or other relevant resources. The Data Catalog would be integrated with other BD2K initiatives as part of the broad NIH response to the challenges and opportunities of Big Data and seek to create an ongoing dialog with stakeholders and users from the biomedical community.

Contact: BD2Kworkshops@mail.nih.gov

Let’s see: “…least duplicative and burdensome, and most sustainable and scalable method to create and maintain an NIH Data Catalog.”

Recast existing data as RDF with a suitable OWL Ontology. – Duplicative, burdensome, not sustainable or scalable.

Accept all existing data as it exists and write subject identity and merging rules: Non-duplicative, existing systems persist so less burdensome, re-use of existing data = sustainable, only open question is scalability.

Sounds like a topic map opportunity to me.

You?

July 3, 2013

CHD@ZJU…

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 9:37 am

CHD@ZJU: a knowledgebase providing network-based research platform on coronary heart disease by Leihong Wu, Xiang Li, Jihong Yang, Yufeng Liu, Xiaohui Fan and Yiyu Cheng. (Database (2013) 2013 : bat047 doi: 10.1093/database/bat047)

From the webpage:

Abstract:

Coronary heart disease (CHD), the leading cause of global morbidity and mortality in adults, has been reported to be associated with hundreds of genes. A comprehensive understanding of the CHD-related genes and their corresponding interactions is essential to advance the translational research on CHD. Accordingly, we construct this knowledgebase, CHD@ZJU, which records CHD-related information (genes, pathways, drugs and references) collected from different resources and through text-mining method followed by manual confirmation. In current release, CHD@ZJU contains 660 CHD-related genes, 45 common pathways and 1405 drugs accompanied with >8000 supporting references. Almost half of the genes collected in CHD@ZJU were novel to other publicly available CHD databases. Additionally, CHD@ZJU incorporated the protein–protein interactions to investigate the cross-talk within the pathways from a multi-layer network view. These functions offered by CHD@ZJU would allow researchers to dissect the molecular mechanism of CHD in a systematic manner and therefore facilitate the research on CHD-related multi-target therapeutic discovery.

Database URL: http://tcm.zju.edu.cn/chd/

The article outlines the construction of CHD@ZJU as follows:

chd@zju

Figure 1.
Procedure for CHD@ZJU construction. CHD-related genes were extracted with text-mining technique and manual confirmation. PPI, pathway and drugs information were then collected from public resources such as KEGG and HPRD. Interactome network of every pathway was constructed based on their corresponding genes and related PPIs, and the whole CHD diseasome network was then constructed with all CHD-related genes. With CHD@ZJU, users could find information related to CHD from gene, pathway and the whole biological network level.

While assisted by computer technology, there is a manual confirmation step that binds all the information together.

May 17, 2013

A self-updating road map of The Cancer Genome Atlas

Filed under: Bioinformatics,Biology,Biomedical,Medical Informatics,RDF,Semantic Web,SPARQL — Patrick Durusau @ 4:33 pm

A self-updating road map of The Cancer Genome Atlas by David E. Robbins, Alexander Grüneberg, Helena F. Deus, Murat M. Tanik and Jonas S. Almeida. (Bioinformatics (2013) 29 (10): 1333-1340. doi: 10.1093/bioinformatics/btt141)

Abstract:

Motivation: Since 2011, The Cancer Genome Atlas’ (TCGA) files have been accessible through HTTP from a public site, creating entirely new possibilities for cancer informatics by enhancing data discovery and retrieval. Significantly, these enhancements enable the reporting of analysis results that can be fully traced to and reproduced using their source data. However, to realize this possibility, a continually updated road map of files in the TCGA is required. Creation of such a road map represents a significant data modeling challenge, due to the size and fluidity of this resource: each of the 33 cancer types is instantiated in only partially overlapping sets of analytical platforms, while the number of data files available doubles approximately every 7 months.

Results: We developed an engine to index and annotate the TCGA files, relying exclusively on third-generation web technologies (Web 3.0). Specifically, this engine uses JavaScript in conjunction with the World Wide Web Consortium’s (W3C) Resource Description Framework (RDF), and SPARQL, the query language for RDF, to capture metadata of files in the TCGA open-access HTTP directory. The resulting index may be queried using SPARQL, and enables file-level provenance annotations as well as discovery of arbitrary subsets of files, based on their metadata, using web standard languages. In turn, these abilities enhance the reproducibility and distribution of novel results delivered as elements of a web-based computational ecosystem. The development of the TCGA Roadmap engine was found to provide specific clues about how biomedical big data initiatives should be exposed as public resources for exploratory analysis, data mining and reproducible research. These specific design elements align with the concept of knowledge reengineering and represent a sharp departure from top-down approaches in grid initiatives such as CaBIG. They also present a much more interoperable and reproducible alternative to the still pervasive use of data portals.

Availability: A prepared dashboard, including links to source code and a SPARQL endpoint, is available at http://bit.ly/TCGARoadmap. A video tutorial is available at http://bit.ly/TCGARoadmapTutorial.

Curious how the granularity of required semantics and the uniformity of the underlying data set impact the choice of semantic approaches?

And does access to data files present different challenges than say access to research publications in the same field?

April 28, 2013

Scientific Lenses over Linked Data… [Operational Equivalence]

Scientific Lenses over Linked Data: An approach to support task specifi c views of the data. A vision. by Christian Brenninkmeijer, Chris Evelo, Carole Goble, Alasdair J G Gray, Paul Groth, Steve Pettifer, Robert Stevens, Antony J Williams, and Egon L Willighagen.

Abstract:

Within complex scienti fic domains such as pharmacology, operational equivalence between two concepts is often context-, user- and task-specifi c. Existing Linked Data integration procedures and equivalence services do not take the context and task of the user into account. We present a vision for enabling users to control the notion of operational equivalence by applying scienti c lenses over Linked Data. The scientifi c lenses vary the links that are activated between the datasets which aff ects the data returned to the user.

Two additional quotes from this paper should convince you of the importance of this work:

We aim to support users in controlling and varying their view of the data by applying a scientifi c lens which govern the notions of equivalence applied to the data. Users will be able to change their lens based on the task and role they are performing rather than having one fixed lens. To support this requirement, we propose an approach that applies context dependent sets of equality links. These links are stored in a stand-off fashion so that they are not intermingled with the datasets. This allows for multiple, context-dependent, linksets that can evolve without impact on the underlying datasets and support diff ering opinions on the relationships between data instances. This flexibility is in contrast to both Linked Data and traditional data integration approaches. We look at the role personae can play in guiding the nature of relationships between the data resources and the desired a ffects of applying scientifi c lenses over Linked Data.

and,

Within scienti fic datasets it is common to fi nd links to the “equivalent” record in another dataset. However, there is no declaration of the form of the relationship. There is a great deal of variation in the notion of equivalence implied by the links both within a dataset’s usage and particularly across datasets, which degrades the quality of the data. The scienti fic user personae have very di fferent needs about the notion of equivalence that should be applied between datasets. The users need a simple mechanism by which they can change the operational equivalence applied between datasets. We propose the use of scientifi c lenses.

Obvious questions:

Does your topic map software support multiple operational equivalences?

Does your topic map interface enable users to choose “lenses” (I like lenses better than roles) to view equivalence?

Does your topic map software support declaring the nature of equivalence?

I first saw this in the slide deck: Scientific Lenses: Supporting Alternative Views of the Data by Alasdair J G Gray at: 4th Open PHACTS Community Workshop.

BTW, the notion of equivalence being represented by “links” reminds me of a comment Peter Neubauer (Neo4j) once made to me, saying that equivalence could be modeled as edges. Imagine typing equivalence edges. Will have to think about that some more.

4th Open PHACTS Community Workshop (slides) [Operational Equivalence]

Filed under: Bioinformatics,Biomedical,Drug Discovery,Linked Data,Medical Informatics — Patrick Durusau @ 12:24 pm

4th Open PHACTS Community Workshop : Using the power of Open PHACTS

From the post:

The fourth Open PHACTS Community Workshop was held at Burlington House in London on April 22 and 23, 2013. The Workshop focussed on “Using the Power of Open PHACTS” and featured the public release of the Open PHACTS application programming interface (API) and the first Open PHACTS example app, ChemBioNavigator.

The first day featured talks describing the data accessible via the Open PHACTS Discovery Platform and technical aspects of the API. The use of the API by example applications ChemBioNavigator and PharmaTrek was outlined, and the results of the Accelrys Pipeline Pilot Hackathon discussed.

The second day involved discussion of Open PHACTS sustainability and plans for the successor organisation, the Open PHACTS Foundation. The afternoon was attended by those keen to further discuss the potential of the Open PHACTS API and the future of Open PHACTS.

During talks, especially those detailing the Open PHACTS API, a good number of signup requests to the API via dev.openphacts.org were received. The hashtag #opslaunch was used to follow reactions to the workshop on Twitter (see storify), and showed the response amongst attendees to be overwhelmingly positive.

This summary is followed by slides from the two days of presentations.

Not like being there but still quite useful.

As a matter of fact, I found a lead on “operational equivalence” with this data set. More to follow in a separate post.

April 7, 2013

Open PHACTS

Open PHACTS – Open Pharmacological Space

From the homepage:

Open PHACTS is building an Open Pharmacological Space in a 3-year knowledge management project of the Innovative Medicines Initiative (IMI), a unique partnership between the European Community and the European Federation of Pharmaceutical Industries and Associations (EFPIA).

The project is due to end in March 2014, and aims to deliver a sustainable service to continue after the project funding ends. The project consortium consists of leading academics in semantics, pharmacology and informatics, driven by solid industry business requirements: 28 partners, including 9 pharmaceutical companies and 3 biotechs.

Sourcecode has just appeared on GibHub: OpenPHACTS.

Important to different communities for different reasons. My interest isn’t the same as BigPharma. 😉

A project to watch as they navigate the thickets of vocabularies, ontologies and other semantically diverse information sources.

April 4, 2013

Visualizing Biological Data Using the SVGmap Browser

Filed under: Biology,Biomedical,Graphics,Mapping,Maps,SVG,Visualization — Patrick Durusau @ 1:26 pm

Visualizing Biological Data Using the SVGmap Browser by Casey Bergman.

From the post:

Early in 2012, Nuria Lopez-Bigas‘ Biomedical Genomics Group published a paper in Bioinformatics describing a very interesting tool for visualizing biological data in a spatial context called SVGmap. The basic idea behind SVGMap is (like most good ideas) quite straightforward – to plot numerical data on a pre-defined image to give biological context to the data in an easy-to-interpret visual form.

To do this, SVGmap takes as input an image in Scalable Vector Graphics (SVG) format where elements of the image are tagged with an identifier, plus a table of numerical data with values assigned to the same identifier as in the elements of the image. SVGMap then integrates these files using either a graphical user interface that runs in standard web browser or a command line interface application that runs in your terminal, allowing the user to display color-coded numerical data on the original image. The overall framework of SVGMap is shown below in an image taken from a post on the Biomedical Genomics Group blog.

svgmap image

We’ve been using SVGMap over the last year to visualize tissue-specific gene expression data in Drosophila melanogaster from the FlyAtlas project, which comes as one of the pre-configured “experiments” in the SVGMap web application.

More recently, we’ve been also using the source distribution of SVGMap to display information about the insertion preferences of transposable elements in a tissue-specific context, which as required installing and configuring a local instance of SVGMap and run it via the browser. The documentation for SVGMap is good enough to do this on your own, but it took a while for us to get a working instance the first time around. We ran into the same issues again the second time, so I thought I write up my notes for future reference and to help others get SVGMap up and running as fast as possible.

Topic map interfaces aren’t required to take a particular form.

A drawing of a fly could be topic map interface.

Useful for people studying flies, less useful (maybe) if you are mapping Lady Gaga discography.

What interface do you want to create for a topic map?

March 16, 2013

MetaNetX.org…

Filed under: Bioinformatics,Biomedical,Genomics,Modeling,Semantic Diversity — Patrick Durusau @ 1:42 pm

MetaNetX.org: a website and repository for accessing, analysing and manipulating metabolic networks by Mathias Ganter, Thomas Bernard, Sébastien Moretti, Joerg Stelling and Marco Pagni. (Bioinformatics (2013) 29 (6): 815-816. doi: 10.1093/bioinformatics/btt036)

Abstract:

MetaNetX.org is a website for accessing, analysing and manipulating genome-scale metabolic networks (GSMs) as well as biochemical pathways. It consistently integrates data from various public resources and makes the data accessible in a standardized format using a common namespace. Currently, it provides access to hundreds of GSMs and pathways that can be interactively compared (two or more), analysed (e.g. detection of dead-end metabolites and reactions, flux balance analysis or simulation of reaction and gene knockouts), manipulated and exported. Users can also upload their own metabolic models, choose to automatically map them into the common namespace and subsequently make use of the website’s functionality.

http://metanetx.org.

The authors are addressing a familiar problem:

Genome-scale metabolic networks (GSMs) consist of compartmentalized reactions that consistently combine biochemical, genetic and genomic information. When also considering a biomass reaction and both uptake and secretion reactions, GSMs are often used to study genotype–phenotype relationships, to direct new discoveries and to identify targets in metabolic engineering (Karr et al., 2012). However, a major difficulty in GSM comparisons and reconstructions is to integrate data from different resources with different nomenclatures and conventions for both metabolites and reactions. Hence, GSM consolidation and comparison may be impossible without detailed biological knowledge and programming skills. (emphasis added)

For which they propose an uncommon solution:

MetaNetX.org is implemented as a user-friendly and self-explanatory website that handles all user requests dynamically (Fig. 1a). It allows a user to access a collection of hundreds of published models, browse and select subsets for comparison and analysis, upload or modify new models and export models in conjunction with their results. Its functionality is based on a common namespace defined by MNXref (Bernard et al., 2012). In particular, all repository or user uploaded models are automatically translated with or without compartments into the common namespace; small deviations from the original model are possible due to the automatic reconciliation steps implemented by Bernard et al. (2012). However, a user can choose not to translate his model but still make use of the website’s functionalities. Furthermore, it is possible to augment the given reaction set by user-defined reactions, for example, for model augmentation.

The bioinformatics community recognizes the intellectual poverty of lock step models.

Wonder when the intelligence community is going to have that “a ha” moment?

March 11, 2013

The Annotation-enriched non-redundant patent sequence databases [Curation vs. Search]

Filed under: Bioinformatics,Biomedical,Marketing,Medical Informatics,Patents,Topic Maps — Patrick Durusau @ 2:01 pm

The Annotation-enriched non-redundant patent sequence databases Weizhong Li, Bartosz Kondratowicz, Hamish McWilliam, Stephane Nauche and Rodrigo Lopez.

Not a real promising title is it? 😉 The reason I cite it here is that by curation, the database is “non-redundant.”

Try searching for some of these sequences at the USPTO and compare the results.

The power of curation will be immediately obvious.

Abstract:

The EMBL-European Bioinformatics Institute (EMBL-EBI) offers public access to patent sequence data, providing a valuable service to the intellectual property and scientific communities. The non-redundant (NR) patent sequence databases comprise two-level nucleotide and protein sequence clusters (NRNL1, NRNL2, NRPL1 and NRPL2) based on sequence identity (level-1) and patent family (level-2). Annotation from the source entries in these databases is merged and enhanced with additional information from the patent literature and biological context. Corrections in patent publication numbers, kind-codes and patent equivalents significantly improve the data quality. Data are available through various user interfaces including web browser, downloads via FTP, SRS, Dbfetch and EBI-Search. Sequence similarity/homology searches against the databases are available using BLAST, FASTA and PSI-Search. In this article, we describe the data collection and annotation and also outline major changes and improvements introduced since 2009. Apart from data growth, these changes include additional annotation for singleton clusters, the identifier versioning for tracking entry change and the entry mappings between the two-level databases.

Database URL: http://www.ebi.ac.uk/patentdata/nr/

Topic maps are curated data. Which one do you prefer?

February 21, 2013

NetGestalt for Data Visualization in the Context of Pathways

Filed under: Bioinformatics,Biomedical,Graphs,Networks,Visualization — Patrick Durusau @ 7:06 pm

NetGestalt for Data Visualization in the Context of Pathways by Stephen Turner.

From the post:

Many of you may be familiar with WebGestalt, a wonderful web utility developed by Bing Zhang at Vanderbilt for doing basic gene-set enrichment analyses. Last year, we invited Bing to speak at our annual retreat for the Vanderbilt Graduate Program in Human Genetics, and he did not disappoint! Bing walked us through his new tool called NetGestalt.

NetGestalt provides users with the ability to overlay large-scale experimental data onto biological networks. Data are loaded using continuous and binary tracks that can contain either single or multiple lines of data (called composite tracks). Continuous tracks could be gene expression intensities from microarray data or any other quantitative measure that can be mapped to the genome. Binary tracks are usually insertion/deletion regions, or called regions like ChIP peaks. NetGestalt extends many of the features of WebGestalt, including enrichment analysis for modules within a biological network, and provides easy ways to visualize the overlay of multiple tracks with Venn diagrams.

Stephen also points to documentation and video tutorials.

NetGestalt uses gene symbol as the gene identifier. Data that uses other gene identifiers must be mapped to gene symbols before uploading. (Manual, page 4)

An impressive alignment of data sources even with the restriction to gene symbols.

February 15, 2013

New Query Tool Searches EHR Unstructured Data

Filed under: Biomedical,Medical Informatics,Searching,Unstructured Data — Patrick Durusau @ 1:32 pm

New Query Tool Searches EHR Unstructured Data by Ken Terry.

From the post:

A new electronic health record “intelligence platform” developed at Massachusetts General Hospital (MGH) and its parent organization, Partners Healthcare, is being touted as a solution to the problem of searching structured and unstructured data in EHRs for clinically useful information.

QPID Inc., a new firm spun off from Partners and backed by venture capital funds, is now selling its Web-based search engine to other healthcare organizations. Known as the Queriable Patient Inference Dossier (QPID), the tool is designed to allow clinicians to make ad hoc queries about particular patients and receive the desired information within seconds.

Today, 80% of stored health information is believed to be unstructured. It is trapped in free text such as physician notes and reports, discharge summaries, scanned documents and e-mail messages. One reason for the prevalence of unstructured data is that the standard methods for entering structured data, such as drop-down menus and check boxes, don’t fit into traditional physician workflow. Many doctors still dictate their notes, and the transcription goes into the EHR as free text.

and,

QPID, which was first used in the radiology department of MGH in 2005, incorporates an EHR search engine, a library of search queries based on clinical concepts, and a programming system for application and query development. When a clinician submits a query, QPID presents the desired data in a “dashboard” format that includes abnormal results, contraindications and other alerts, Doyle said.

The core of the system is a form of natural language processing (NLP) based on a library encompassing “thousands and thousands” of clinical concepts, he said. Because it was developed collaboratively by physicians and scientists, QPID identifies medical concepts imbedded in unstructured data more effectively than do other NLP systems from IBM, Nuance and M*Modal, Doyle maintained.

Take away points for data search/integration solutions:

  1. 80% of stored health information (need)
  2. traditional methods for data entry….don’t fit into traditional physician workflow (user requirement)
  3. developed collaboratively by physicians and scientists (semantics originate with users, not top down)

I am interested in how QPID conforms (or not) QPID to local medical terminology practices.

To duplicate their earlier success, conforming to local terminology practices is critical.

If for no other reason it will give physicians and other health professionals “ownership” of the vocabulary and hence faith in the system.

Using molecular networks to assess molecular similarity

Systems chemistry: Using molecular networks to assess molecular similarity by Bailey Fallon.

From the post:

In new research published in Journal of Systems Chemistry, Sijbren Otto and colleagues have provided the first experimental approach towards molecular networks that can predict bioactivity based on an assessment of molecular similarity.

Molecular similarity is an important concept in drug discovery. Molecules that share certain features such as shape, structure or hydrogen bond donor/acceptor groups may have similar properties that make them common to a particular target. Assessment of molecular similarity has so far relied almost exclusively on computational approaches, but Dr Otto reasoned that a measure of similarity might be obtained by interrogating the molecules in solution experimentally.

Important work for drug discovery but there are semantic lessons here as well:

Tests for similarity/sameness are domain specific.

Which means there are no universal tests for similarity/sameness.

Lacking universal tests for similarity/sameness, we should focus on developing documented and domain specific tests for similarity/sameness.

Domain specific tests provide quicker ROI than less useful and doomed universal solutions.

Documented domain specific tests may, no guarantees, enable us to find commonalities between domain measures of similarity/sameness.

But our conclusions will be based on domain experience and not projection from our domain onto others, less well known domains.

February 11, 2013

A Tale of Five Languages

Filed under: Biomedical,Medical Informatics,SNOMED — Patrick Durusau @ 10:58 am

Evaluating standard terminologies for encoding allergy information by Foster R Goss, Li Zhou, Joseph M Plasek, Carol Broverman, George Robinson, Blackford Middleton, Roberto A Rocha. (J Am Med Inform Assoc doi:10.1136/amiajnl-2012-000816)

Abstract:

Objective Allergy documentation and exchange are vital to ensuring patient safety. This study aims to analyze and compare various existing standard terminologies for representing allergy information.

Methods Five terminologies were identified, including the Systemized Nomenclature of Medical Clinical Terms (SNOMED CT), National Drug File–Reference Terminology (NDF-RT), Medication Dictionary for Regulatory Activities (MedDRA), Unique Ingredient Identifier (UNII), and RxNorm. A qualitative analysis was conducted to compare desirable characteristics of each terminology, including content coverage, concept orientation, formal definitions, multiple granularities, vocabulary structure, subset capability, and maintainability. A quantitative analysis was also performed to compare the content coverage of each terminology for (1) common food, drug, and environmental allergens and (2) descriptive concepts for common drug allergies, adverse reactions (AR), and no known allergies.

Results Our qualitative results show that SNOMED CT fulfilled the greatest number of desirable characteristics, followed by NDF-RT, RxNorm, UNII, and MedDRA. Our quantitative results demonstrate that RxNorm had the highest concept coverage for representing drug allergens, followed by UNII, SNOMED CT, NDF-RT, and MedDRA. For food and environmental allergens, UNII demonstrated the highest concept coverage, followed by SNOMED CT. For representing descriptive allergy concepts and adverse reactions, SNOMED CT and NDF-RT showed the highest coverage. Only SNOMED CT was capable of representing unique concepts for encoding no known allergies.

Conclusions The proper terminology for encoding a patient’s allergy is complex, as multiple elements need to be captured to form a fully structured clinical finding. Our results suggest that while gaps still exist, a combination of SNOMED CT and RxNorm can satisfy most criteria for encoding common allergies and provide sufficient content coverage.

Interesting article but some things that may not be apparent to the casual reader:

MedDRA:

The Medical Dictionary for Regulatory Activities (MedDRA) was developed by the International Conference on Harmonisation (ICH) and is owned by the International Federation of Pharmaceutical Manufacturers and Associations (IFPMA) acting as trustee for the ICH steering committee. The Maintenance and Support Services Organization (MSSO) serves as the repository, maintainer, and distributor of MedDRA as well as the source for the most up-to-date information regarding MedDRA and its application within the biopharmaceutical industry and regulators. (source: http://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/MDR/index.html

MedDRA has a metathesaurus with translations into: Czech, Dutch, French, German, Hungarian, Italian, Japanese, Portuguese, and Spanish.

Unique Ingredient Identifier (UNII)

The overall purpose of the joint FDA/USP Substance Registration System (SRS) is to support health information technology initiatives by generating unique ingredient identifiers (UNIIs) for substances in drugs, biologics, foods, and devices. The UNII is a non- proprietary, free, unique, unambiguous, non semantic, alphanumeric identifier based on a substance’s molecular structure and/or descriptive information.

The UNII may be found in:

  • NLM’s Unified Medical Language System (UMLS)
  • National Cancer Institutes Enterprise Vocabulary Service
  • USP Dictionary of USAN and International Drug Names (future)
  • FDA Data Standards Council website
  • VA National Drug File Reference Terminology (NDF-RT)
  • FDA Inactive Ingredient Query Application

(source: http://www.fda.gov/ForIndustry/DataStandards/SubstanceRegistrationSystem-UniqueIngredientIdentifierUNII/

National Drug File – Reference Terminology (NDF-RT)

The National Drug File – Reference Terminology (NDF-RT) is produced by the U.S. Department of Veterans Affairs, Veterans Health Administration (VHA).

NDF-RT combines the NDF hierarchical drug classification with a multi-category reference model. The categories are:

  1. Cellular or Molecular Interactions [MoA]
  2. Chemical Ingredients [Chemical/Ingredient]
  3. Clinical Kinetics [PK]
  4. Diseases, Manifestations or Physiologic States [Disease/Finding]
  5. Dose Forms [Dose Form]
  6. Pharmaceutical Preparations
  7. Physiological Effects [PE]
  8. Therapeutic Categories [TC]
  9. VA Drug Interactions [VA Drug Interaction]

(source: http://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/NDFRT/

MedDRA, UNII, and NDF-RT have been in use for years, MedDRA internationally in multiple languages. An uncounted number of medical records, histories and no doubt publications rely upon these vocabularies.

Assume the conclusion: SNOMED CT with RxNorm (links between drug vocabularies) provide the best coverage for “encoding common allergies.”

A critical question remains:

How to access medical records using other terminologies?

Recalling from the adventures of owl:sameAs (The Semantic Web Is Failing — But Why? (Part 5)) that any single string identifier is subject to multiple interpretations. Interpretations that can only be disambiguated by additional information.

You might present a search engine with string to string mappings but those are inherently less robust and harder to maintain than richer mappings.

The sort of richer mappings that are supported by topic maps.

February 9, 2013

‘What’s in the NIDDK CDR?’…

Filed under: Bioinformatics,Biomedical,Search Interface,Searching — Patrick Durusau @ 8:21 pm

‘What’s in the NIDDK CDR?’—public query tools for the NIDDK central data repository by Nauqin Pan, et al., (Database (2013) 2013 : bas058 doi: 10.1093/database/bas058)

Abstract:

The National Institute of Diabetes and Digestive Disease (NIDDK) Central Data Repository (CDR) is a web-enabled resource available to researchers and the general public. The CDR warehouses clinical data and study documentation from NIDDK funded research, including such landmark studies as The Diabetes Control and Complications Trial (DCCT, 1983–93) and the Epidemiology of Diabetes Interventions and Complications (EDIC, 1994–present) follow-up study which has been ongoing for more than 20 years. The CDR also houses data from over 7 million biospecimens representing 2 million subjects. To help users explore the vast amount of data stored in the NIDDK CDR, we developed a suite of search mechanisms called the public query tools (PQTs). Five individual tools are available to search data from multiple perspectives: study search, basic search, ontology search, variable summary and sample by condition. PQT enables users to search for information across studies. Users can search for data such as number of subjects, types of biospecimens and disease outcome variables without prior knowledge of the individual studies. This suite of tools will increase the use and maximize the value of the NIDDK data and biospecimen repositories as important resources for the research community.

Database URL: https://www.niddkrepository.org/niddk/home.do

I would like to tell you more about this research, since “[t]he National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) is part of the National Institutes of Health (NIH) and the U.S. Department of Health and Human Services” (that’s a direct quote) and so doesn’t claim copyright on its publications.

Unfortunately, the NIDDK published this paper in the Oxford journal Database, which does believe in restricting access to publicly funded research.

Do visit the search interface to see what you think about it.

Not quite the same as curated content but an improvement over raw string matching.

February 3, 2013

ToxPi GUI [Data Recycling]

Filed under: Bioinformatics,Biomedical,Integration,Medical Informatics,Subject Identity — Patrick Durusau @ 6:57 pm

ToxPi GUI: an interactive visualization tool for transparent integration of data from diverse sources of evidence by David M. Reif, Myroslav Sypa, Eric F. Lock, Fred A. Wright, Ander Wilson, Tommy Cathey, Richard R. Judson and Ivan Rusyn. (Bioinformatics (2013) 29 (3): 402-403. doi: 10.1093/bioinformatics/bts686)

Abstract:

Motivation: Scientists and regulators are often faced with complex decisions, where use of scarce resources must be prioritized using collections of diverse information. The Toxicological Prioritization Index (ToxPi™) was developed to enable integration of multiple sources of evidence on exposure and/or safety, transformed into transparent visual rankings to facilitate decision making. The rankings and associated graphical profiles can be used to prioritize resources in various decision contexts, such as testing chemical toxicity or assessing similarity of predicted compound bioactivity profiles. The amount and types of information available to decision makers are increasing exponentially, while the complex decisions must rely on specialized domain knowledge across multiple criteria of varying importance. Thus, the ToxPi bridges a gap, combining rigorous aggregation of evidence with ease of communication to stakeholders.

Results: An interactive ToxPi graphical user interface (GUI) application has been implemented to allow straightforward decision support across a variety of decision-making contexts in environmental health. The GUI allows users to easily import and recombine data, then analyze, visualize, highlight, export and communicate ToxPi results. It also provides a statistical metric of stability for both individual ToxPi scores and relative prioritized ranks.

Availability: The ToxPi GUI application, complete user manual and example data files are freely available from http://comptox.unc.edu/toxpi.php.

Contact: reif.david@gmail.com

Very cool!

Although like having a Ford automobile in any color, so long as the color was black, you can integrate any data source, so long as the format is csv. And values are numbers. Subject to other restrictions as well.

That’s an observation, not a criticism.

The application serves a purpose within a domain and does not “integrate” information in the sense of a topic map.

But a topic map could recycle its data to add other identifications and properties. Without having to re-write this application or its data.

Once curated, data should be re-used, not re-created/curated.

Topic maps give you more bang for your data buck.

January 28, 2013

PoSSuM

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 1:20 pm

PoSSuM : Pocket Similarity Searching using Multi-Sketches

From the webpage:

Today, vast amounts of protein-small molecule binding sites can be found in the Protein Data Bank (PDB). Exhaustive comparison of them is computationally demanding, but useful in the prediction of protein functions and drug discovery. We proposed a tremendously fast algorithm called “SketchSort” that enables the enumeration of similar pairs in a huge number of protein-ligand binding sites. We conducted all-pair similarity searches for 3.4 million known and potential binding sites using the proposed method and discovered over 24 million similar pairs of binding sites. We present the results as a relational database Pocket Similarity Search using Multiple-Sketches (PoSSuM), which includes all the discovered pairs with annotations of various types (e.g., CATH, SCOP, EC number, Gene ontology). PoSSuM enables rapid exploration of similar binding sites among structures with different global folds as well as similar ones. Moreover, PoSSuM is useful for predicting the binding ligand for unbound structures. Basically, the users can search similar binding pockets using two search modes:

i) “Search K” is useful for finding similar binding sites for a known ligand-binding site. Post a known ligand-binding site (a pair of “PDB ID” and “HET code”) in the PDB, and PoSSuM will search similar sites for the query site.

ii) “Search P” is useful for predicting ligands that potentially bind to a structure of interest. Post a known protein structure (PDB ID) in the PDB, and PoSSuM will search similar known-ligand binding sites for the query structure.

Obviously useful for the bioinformatics crowd but relevant for topic maps as well.

In topic map terminology, the searches are for associations with a known role player in a particular role, leaving the other role player unspecified.

It does not define or seek an exact match but provides the user with data that may help them make a match determination.

…Everything You Always Wanted to Know About Genes

Filed under: Bioinformatics,Biomedical,Genomics — Patrick Durusau @ 1:18 pm

Toward a New Model of the Cell: Everything You Always Wanted to Know About Genes

From the post:

Turning vast amounts of genomic data into meaningful information about the cell is the great challenge of bioinformatics, with major implications for human biology and medicine. Researchers at the University of California, San Diego School of Medicine and colleagues have proposed a new method that creates a computational model of the cell from large networks of gene and protein interactions, discovering how genes and proteins connect to form higher-level cellular machinery.

“Our method creates ontology, or a specification of all the major players in the cell and the relationships between them,” said first author Janusz Dutkowski, PhD, postdoctoral researcher in the UC San Diego Department of Medicine. It uses knowledge about how genes and proteins interact with each other and automatically organizes this information to form a comprehensive catalog of gene functions, cellular components, and processes.

“What’s new about our ontology is that it is created automatically from large datasets. In this way, we see not only what is already known, but also potentially new biological components and processes — the bases for new hypotheses,” said Dutkowski.

Originally devised by philosophers attempting to explain the nature of existence, ontologies are now broadly used to encapsulate everything known about a subject in a hierarchy of terms and relationships. Intelligent information systems, such as iPhone’s Siri, are built on ontologies to enable reasoning about the real world. Ontologies are also used by scientists to structure knowledge about subjects like taxonomy, anatomy and development, bioactive compounds, disease and clinical diagnosis.

A Gene Ontology (GO) exists as well, constructed over the last decade through a joint effort of hundreds of scientists. It is considered the gold standard for understanding cell structure and gene function, containing 34,765 terms and 64,635 hierarchical relations annotating genes from more than 80 species.

“GO is very influential in biology and bioinformatics, but it is also incomplete and hard to update based on new data,” said senior author Trey Ideker, PhD, chief of the Division of Genetics in the School of Medicine and professor of bioengineering in UC San Diego’s Jacobs School of Engineering.

The conclusion to A gene ontology inferred from molecular networks (Janusz Dutkowski, Michael Kramer, Michal A Surma, Rama Balakrishnan, J Michael Cherry, Nevan J Krogan & Trey Ideker, Nature Biotechnology 31, 38–45 (2013) doi:10.1038/nbt.2463), illustrates a difference between ontology in the GO sense and that produced by the authors:

The research reported in this manuscript raises the possibility that, given the appropriate tools, ontologies might evolve over time with the addition of each new network map or high-throughput experiment that is published. More importantly, it enables a philosophical shift in bioinformatic analysis, from a regime in which the ontology is viewed as gold standard to one in which it is the major result. (emphasis added)

Ontology as representing reality as opposed to declaring it.

That is a novel concept.

January 22, 2013

BioNLP-ST 2013

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 2:42 pm

BioNLP-ST 2013

Dates:

Training Data Release 12:00 IDLW, 17 Jan. 2013
Test Data Release 22 Mar. 2013
Result Submission 29 Mar. 2013
BioNLP’11 Workshop 8-9 Aug. 2013

From the website:

The BioNLP Shared Task (BioNLP-ST) series represents a community-wide trend in text-mining for biology toward fine-grained information extraction (IE). The two previous events, BioNLP-ST 2009 and 2011, attracted wide attention, with over 30 teams submitting final results. The tasks and their data have since served as the basis of numerous studies, released event extraction systems, and published datasets. The upcoming BioNLP-ST 2013 follows the general outline and goals of the previous tasks. It identifies biologically relevant extraction targets and proposes a linguistically motivated approach to event representation. The tasks in BioNLP-ST 2013 cover many new hot topics in biology that are close to biologists’ needs. BioNLP-ST 2013 broadens the scope of the text-mining application domains in biology by introducing new issues on cancer genetics and pathway curation. It also builds on the well-known previous datasets GENIA, LLL/BI and BB to propose more realistic tasks that considered previously, closer to the actual needs of biological data integration.

The first event in 2009 triggered active research in the community on a specific fine-grained IE task. Expanding on this, the second BioNLP-ST was organized under the theme “Generalization”, which was well received by participants, who introduced numerous systems that could be straightforwardly applied to multiple tasks. This time, the BioNLP-ST takes a step further and pursues the grand theme of “Knowledge base construction”, which is addressed in various ways: semantic web (GE, GRO), pathways (PC), molecular mechanisms of cancer (CG), regulation networks (GRN) and ontology population (GRO, BB).

As in previous events, manually annotated data will be provided for training, development and evaluation of information extraction methods. According to their relevance for biological studies, the annotations are either bound to specific expressions in the text or represented as structured knowledge. Many tools for the detailed evaluation and graphical visualization of annotations and system outputs will be available for participants. Support in performing linguistic processing will be provided to the participants in the form of analyses created by various state-of-the art tools on the dataset texts.

Participation to the task will be open to the academia, industry, and all other interested parties.

Tasks:

Quick question: Do you think there is semantically diverse data available for each of these tasks?

I first saw this at: BioNLP Shared Task: Text Mining for Biology Competition.

January 21, 2013

Concept Maps – Pharmaceuticals

Filed under: Bioinformatics,Biomedical,Concept Maps — Patrick Durusau @ 7:29 pm

Designing concept maps for a precise and objective description of pharmaceutical innovations by Maia Iordatii, Alain Venot and Catherine Duclos. (BMC Medical Informatics and Decision Making 2013, 13:10 doi:10.1186/1472-6947-13-10)

Abstract:

Background

When a new drug is launched onto the market, information about the new manufactured product is contained in its monograph and evaluation report published by national drug agencies. Health professionals need to be able to determine rapidly and easily whether the new manufactured product is potentially useful for their practice. There is therefore a need to identify the best way to group together and visualize the main items of information describing the nature and potential impact of the new drug. The objective of this study was to identify these items of information and to bring them together in a model that could serve as the standard for presenting the main features of new manufactured product.

Methods

We developed a preliminary conceptual model of pharmaceutical innovations, based on the knowledge of the authors. We then refined this model, using a random sample of 40 new manufactured drugs recently approved by the national drug regulatory authorities in France and covering a broad spectrum of innovations and therapeutic areas. Finally, we used another sample of 20 new manufactured drugs to determine whether the model was sufficiently comprehensive.

Results

The results of our modeling led to three sub models described as conceptual maps representing: i) the medical context for use of the new drug (indications, type of effect, therapeutical arsenal for the same indications), ii) the nature of the novelty of the new drug (new molecule, new mechanism of action, new combination, new dosage, etc.), and iii) the impact of the drug in terms of efficacy, safety and ease of use, compared with other drugs with the same indications.

Conclusions

Our model can help to standardize information about new drugs released onto the market. It is potentially useful to the pharmaceutical industry, medical journals, editors of drug databases and medical software, and national or international drug regulation agencies, as a means of describing the main properties of new pharmaceutical products. It could also used as a guide for the writing of comprehensive and objective texts summarizing the nature and interest of new manufactured product. (emphasis added)

We all design categories starting with what we know, as pointed out under methods above.

And any three authors could undertake a such a quest, with equally valid results but different terminology and perhaps even a different arrangement of concepts.

The problem isn’t the undertaking, which is a useful.

The problem is a lack of a binding between such undertakings, which enables users to migrate between such maps, as they develop over time.

A problem that topic maps offer an infrastructure to solve.

January 19, 2013

The Pacific Symposium on Biocomputing 2013 [Proceedings]

Filed under: Bioinformatics,Biomedical,Data Mining,Text Mining — Patrick Durusau @ 7:09 pm

The Pacific Symposium on Biocomputing 2013 by Will Bush.

From the post:

For 18 years now, computational biologists have convened on the beautiful islands of Hawaii to present and discuss research emerging from new areas of biomedicine. PSB Conference Chairs Teri Klein (@teriklein), Keith Dunker, Russ Altman (@Rbaltman) and Larry Hunter (@ProfLHunter) organize innovative sessions and tutorials that are always interactive and thought-provoking. This year, sessions included Computational Drug Repositioning, Epigenomics, Aberrant Pathway and Network Activity, Personalized Medicine, Phylogenomics and Population Genomics, Post-Next Generation Sequencing, and Text and Data Mining. The Proceedings are available online here, and a few of the highlights are:

See Will’s post for the highlights. Or browse the proceedings. You are almost certainly going to find something relevant to you.

Do note Will’s use of Twiiter IDs as identifiers. Unique, persistent (I assume Twitter doesn’t re-assign them), easy to access.

It wasn’t clear from Will’s post if the following image was from Biocomputing 2013 or if he stopped by a markup conference. Hard to tell. 😉

Biocomputing 2013

January 18, 2013

PPInterFinder

Filed under: Associations,Bioinformatics,Biomedical — Patrick Durusau @ 7:18 pm

PPInterFinder—a mining tool for extracting causal relations on human proteins from literature by Kalpana Raja, Suresh Subramani and Jeyakumar Natarajan. (Database (2013) 2013 : bas052 doi: 10.1093/database/bas052)

Abstract:

One of the most common and challenging problem in biomedical text mining is to mine protein–protein interactions (PPIs) from MEDLINE abstracts and full-text research articles because PPIs play a major role in understanding the various biological processes and the impact of proteins in diseases. We implemented, PPInterFinder—a web-based text mining tool to extract human PPIs from biomedical literature. PPInterFinder uses relation keyword co-occurrences with protein names to extract information on PPIs from MEDLINE abstracts and consists of three phases. First, it identifies the relation keyword using a parser with Tregex and a relation keyword dictionary. Next, it automatically identifies the candidate PPI pairs with a set of rules related to PPI recognition. Finally, it extracts the relations by matching the sentence with a set of 11 specific patterns based on the syntactic nature of PPI pair. We find that PPInterFinder is capable of predicting PPIs with the accuracy of 66.05% on AIMED corpus and outperforms most of the existing systems.

Database URL: http://www.biomining-bu.in/ppinterfinder/

I thought the shortened form of the title would catch your eye. 😉

Important work for bioinformatics but it is also an example of domain specific association mining.

By focusing on a specific domain and forswearing designs on being a universal association solution, PPInterFinder produces useful results today.

A lesson that should be taken and applied to semantic mappings more generally.

January 17, 2013

UniChem…[How Much Precision Can You Afford?]

Filed under: Bioinformatics,Biomedical,Cheminformatics,Topic Maps — Patrick Durusau @ 7:26 pm

UniChem: a unified chemical structure cross-referencing and identifier tracking system by Jon Chambers, Mark Davies, Anna Gaulton, Anne Hersey, Sameer Velankar, Robert Petryszak, Janna Hastings, Louisa Bellis, Shaun McGlinchey and John P Overington. (Journal of Cheminformatics 2013, 5:3 doi:10.1186/1758-2946-5-3)

Abstract:

UniChem is a freely available compound identifier mapping service on the internet, designed to optimize the efficiency with which structure-based hyperlinks may be built and maintained between chemistry-based resources. In the past, the creation and maintenance of such links at EMBL-EBI, where several chemistry-based resources exist, has required independent efforts by each of the separate teams. These efforts were complicated by the different data models, release schedules, and differing business rules for compound normalization and identifier nomenclature that exist across the organization. UniChem, a large-scale, non-redundant database of Standard InChIs with pointers between these structures and chemical identifiers from all the separate chemistry resources, was developed as a means of efficiently sharing the maintenance overhead of creating these links. Thus, for each source represented in UniChem, all links to and from all other sources are automatically calculated and immediately available for all to use. Updated mappings are immediately available upon loading of new data releases from the sources. Web services in UniChem provide users with a single simple automatable mechanism for maintaining all links from their resource to all other sources represented in UniChem. In addition, functionality to track changes in identifier usage allows users to monitor which identifiers are current, and which are obsolete. Lastly, UniChem has been deliberately designed to allow additional resources to be included with minimal effort. Indeed, the recent inclusion of data sources external to EMBL-EBI has provided a simple means of providing users with an even wider selection of resources with which to link to, all at no extra cost, while at the same time providing a simple mechanism for external resources to link to all EMBL-EBI chemistry resources.

From the background section:

Since these resources are continually developing in response to largely distinct active user communities, a full integration solution, or even the imposition of a requirement to adopt a common unifying chemical identifier, was considered unnecessarily complex, and would inhibit the freedom of each of the resources to successfully evolve in future. In addition, it was recognized that in the future more small molecule-containing databases might reside at EMBL-EBI, either because existing databases may begin to annotate their data with chemical information, or because entirely new resources are developed or adopted. This would make a full integration solution even more difficult to sustain. A need was therefore identified for a flexible integration solution, which would create, maintain and manage links between the resources, with minimal maintenance costs to the participant resources, whilst easily allowing the inclusion of additional sources in the future. Also, since the solution should allow different resources to maintain their own identifier systems, it was recognized as important for the system to have some simple means of tracking identifier usage, at least in the sense of being able to archive obsolete identifiers and assignments, and indicate when obsolete assignments were last in use.

The UniChem project highlights an important aspect of mapping identifiers: How much mapping can you afford?

Or perhaps even better: What is the cost/benefit ratio for a complete mapping?

The mapping in question isn’t a academic exercise in elegance and completeness.

It’s users have immediate need for the mapping data and it is it not quite right, human users are in the best position to correct it and suggest corrections.

Not to mention that new identifiers are likely to arrive before the old ones are completely mapped.

Suggestive that evolving mappings may be an appropriate paradigm for topic maps.

« Newer PostsOlder Posts »

Powered by WordPress