Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 12, 2013

Manual Alignment of Anatomy Ontologies

Filed under: Alignment,Bioinformatics,Biomedical,Ontology — Patrick Durusau @ 7:00 pm

Matching arthropod anatomy ontologies to the Hymenoptera Anatomy Ontology: results from a manual alignment by Matthew A. Bertone, István Mikó, Matthew J. Yoder, Katja C. Seltmann, James P. Balhoff, and Andrew R. Deans. (Database (2013) 2013 : bas057 doi: 10.1093/database/bas057)

Abstract:

Matching is an important step for increasing interoperability between heterogeneous ontologies. Here, we present alignments we produced as domain experts, using a manual mapping process, between the Hymenoptera Anatomy Ontology and other existing arthropod anatomy ontologies (representing spiders, ticks, mosquitoes and Drosophila melanogaster). The resulting alignments contain from 43 to 368 mappings (correspondences), all derived from domain-expert input. Despite the many pairwise correspondences, only 11 correspondences were found in common between all ontologies, suggesting either major intrinsic differences between each ontology or gaps in representing each group’s anatomy. Furthermore, we compare our findings with putative correspondences from Bioportal (derived from LOOM software) and summarize the results in a total evidence alignment. We briefly discuss characteristics of the ontologies and issues with the matching process.

Database URL: http://purl.obolibrary.org/obo/hao/2012-07-18/arthropod-mappings.obo.

A great example of the difficulty of matching across ontologies, particularly when the granularity or subjects of ontologies vary.

January 10, 2013

Ontology Alert! Molds are able to reproduce sexually

Filed under: Biomedical,Ontology — Patrick Durusau @ 1:45 pm

Unlike we thought for 100 years: Molds are able to reproduce sexually

For over 100 years, it was assumed that the penicillin-producing mould fungus Penicillium chrysogenum only reproduced asexually through spores. An international research team led by Prof. Dr. Ulrich Kück and Julia Böhm from the Chair of General and Molecular Botany at the Ruhr-Universität has now shown for the first time that the fungus also has a sexual cycle, i.e. two “genders”. Through sexual reproduction of P. chrysogenum, the researchers generated fungal strains with new biotechnologically relevant properties – such as high penicillin production without the contaminating chrysogenin. The team from Bochum, Göttingen, Nottingham (England), Kundl (Austria) and Sandoz GmbH reports in PNAS. The article will be published in this week’s Online Early Edition and was selected as a cover story.

J. Böhm, B. Hoff, C.M. O’Gorman, S. Wolfers, V. Klix, D. Binger, I. Zadra, H. Kürnsteiner, S. Pöggeler, P.S. Dyer, U. Kück (2013): Sexual reproduction and mating-type – mediated strain development in the penicillin-producing fungus Penicillium chrysogenum, PNAS, DOI: 10.1073/pnas.1217943110

If you have hard coded asexual reproduction into your ontology, time to reconsider that decision. And get agreement on reworking all the dependent relationships.

January 8, 2013

PLOS Computational Biology: Translational Bioinformatics

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 11:42 am

PLOS Computational Biology: Translational Bioinformatics. Maricel Kann, Guest Editor, and Fran Lewitter, PLOS Computational Biology Education Editor.

Following up on the collection where Biomedical Knowledge Integration appears, only to find:

Introduction to Translational Bioinformatics Collection by Russ B. Altman. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002796

Chapter 1: Biomedical Knowledge Integration by Philip R. O. Payne. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002826

Chapter 2: Data-Driven View of Disease Biology by Casey S. Greene and Olga G. Troyanskaya. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002816

Chapter 3: Small Molecules and Disease by David S. Wishart. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002805

Chapter 4: Protein Interactions and Disease by Mileidy W. Gonzalez by Maricel G. Kann. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002819

Chapter 5: Network Biology Approach to Complex Diseases by Dong-Yeon Cho, Yoo-Ah Kim and Teresa M. Przytycka. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002820

Chapter 6: Structural Variation and Medical Genomics by Benjamin J. Raphael. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002821

Chapter 7: Pharmacogenomics by Konrad J. Karczewski, Roxana Daneshjou and Russ B. Altman. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002817

Chapter 8: Biological Knowledge Assembly and Interpretation by Han Kim. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002858

Chapter 9: Analyses Using Disease Ontologies by Nigam H. Shah, Tyler Cole and Mark A. Musen. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002827

Chapter 10: Mining Genome-Wide Genetic Markers by Xiang Zhang, Shunping Huang, Zhaojun Zhang and Wei Wang. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002828

Chapter 11: Genome-Wide Association Studies by William S. Bush and Jason H. Moore. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002822

Chapter 12: Human Microbiome Analysis by Xochitl C. Morgan and Curtis Huttenhower. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002808

Chapter 13: Mining Electronic Health Records in the Genomics Era by Joshua C. Denny. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002823

Chapter 14: Cancer Genome AnalysisMiguel Vazquez, Victor de la Torre and Alfonso Valencia. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002824

An example of scholarship at its best!

Biomedical Knowledge Integration

Filed under: Bioinformatics,Biomedical,Data Integration,Medical Informatics — Patrick Durusau @ 11:41 am

Biomedical Knowledge Integration by Philip R. O. Payne.

Abstract:

The modern biomedical research and healthcare delivery domains have seen an unparalleled increase in the rate of innovation and novel technologies over the past several decades. Catalyzed by paradigm-shifting public and private programs focusing upon the formation and delivery of genomic and personalized medicine, the need for high-throughput and integrative approaches to the collection, management, and analysis of heterogeneous data sets has become imperative. This need is particularly pressing in the translational bioinformatics domain, where many fundamental research questions require the integration of large scale, multi-dimensional clinical phenotype and bio-molecular data sets. Modern biomedical informatics theory and practice has demonstrated the distinct benefits associated with the use of knowledge-based systems in such contexts. A knowledge-based system can be defined as an intelligent agent that employs a computationally tractable knowledge base or repository in order to reason upon data in a targeted domain and reproduce expert performance relative to such reasoning operations. The ultimate goal of the design and use of such agents is to increase the reproducibility, scalability, and accessibility of complex reasoning tasks. Examples of the application of knowledge-based systems in biomedicine span a broad spectrum, from the execution of clinical decision support, to epidemiologic surveillance of public data sets for the purposes of detecting emerging infectious diseases, to the discovery of novel hypotheses in large-scale research data sets. In this chapter, we will review the basic theoretical frameworks that define core knowledge types and reasoning operations with particular emphasis on the applicability of such conceptual models within the biomedical domain, and then go on to introduce a number of prototypical data integration requirements and patterns relevant to the conduct of translational bioinformatics that can be addressed via the design and use of knowledge-based systems.

A chapter in “Translational Bioinformatics” collection for PLOS Computational Biology.

A very good survey of the knowledge integration area, which alas does not include topic maps. 🙁

Well, but it does include use cases at the end of the chapter that are biomedical specific.

Thinking those would be good cases to illustrate the use of topic maps for biomedical knowledge integration.

Yes?

January 5, 2013

Semantically enabling a genome-wide association study database

Filed under: Bioinformatics,Biomedical,Genomics,Medical Informatics,Ontology — Patrick Durusau @ 2:20 pm

Semantically enabling a genome-wide association study database by Tim Beck, Robert C Free, Gudmundur A Thorisson and Anthony J Brookes. Journal of Biomedical Semantics 2012, 3:9 doi:10.1186/2041-1480-3-9.

Abstract:

Background

The amount of data generated from genome-wide association studies (GWAS) has grown rapidly, but considerations for GWAS phenotype data reuse and interchange have not kept pace. This impacts on the work of GWAS Central — a free and open access resource for the advanced querying and comparison of summary-level genetic association data. The benefits of employing ontologies for standardising and structuring data are widely accepted. The complex spectrum of observed human phenotypes (and traits), and the requirement for cross-species phenotype comparisons, calls for reflection on the most appropriate solution for the organisation of human phenotype data. The Semantic Web provides standards for the possibility of further integration of GWAS data and the ability to contribute to the web of Linked Data.

Results

A pragmatic consideration when applying phenotype ontologies to GWAS data is the ability to retrieve all data, at the most granular level possible, from querying a single ontology graph. We found the Medical Subject Headings (MeSH) terminology suitable for describing all traits (diseases and medical signs and symptoms) at various levels of granularity and the Human Phenotype Ontology (HPO) most suitable for describing phenotypic abnormalities (medical signs and symptoms) at the most granular level. Diseases within MeSH are mapped to HPO to infer the phenotypic abnormalities associated with diseases. Building on the rich semantic phenotype annotation layer, we are able to make cross-species phenotype comparisons and publish a core subset of GWAS data as RDF nanopublications.

Conclusions

We present a methodology for applying phenotype annotations to a comprehensive genome-wide association dataset and for ensuring compatibility with the Semantic Web. The annotations are used to assist with cross-species genotype and phenotype comparisons. However, further processing and deconstructions of terms may be required to facilitate automatic phenotype comparisons. The provision of GWAS nanopublications enables a new dimension for exploring GWAS data, by way of intrinsic links to related data resources within the Linked Data web. The value of such annotation and integration will grow as more biomedical resources adopt the standards of the Semantic Web.

Rather than:

The benefits of employing ontologies for standardising and structuring data are widely accepted.

I would rephrase that to read:

The benefits and limitations of employing ontologies for standardising and structuring data are widely known.

Decades of use of relational database schemas, informal equivalents of ontologies, leave no doubt governing structures for data have benefits.

Less often acknowledged is those same governing structures impose limitations on data and what may be represented.

That’s not a dig at relational databases.

Just an observation that ontologies and their equivalents aren’t unalloyed precious metals.

December 27, 2012

Utopia Documents

Filed under: Annotation,Bioinformatics,Biomedical,Medical Informatics,PDF — Patrick Durusau @ 3:58 pm

Checking the “sponsored by” link for pdfx v1.0 and discovered: Utopia Documents.

From the homepage:

Reading, redefined.

Utopia Documents brings a fresh new perspective to reading the scientific literature, combining the convenience and reliability of the PDF with the flexibility and power of the web. Free for Linux, Mac and Windows.

Building Bridges

The scientific article has been described as a Story That Persuades With Data, but all too often the link between data and narrative is lost somewhere in the modern publishing process. Utopia Documents helps to rebuild these connections, linking articles to underlying datasets, and making it easy to access online resources relating to an article’s content.

A Living Resource

Published articles form the ‘minutes of science‘, creating a stable record of ideas and discoveries. But no idea exists in isolation, and just because something has been published doesn’t mean that the story is over. Utopia Documents reconnects PDFs with the ongoing discussion, keeping you up-to-date with the latest knowledge and metrics.

Comment

Make private notes for yourself, annotate a document for others to see or take part in an online discussion.

Explore article content

Looking for clarification of given terms? Or more information about them? Do just that, with integrated semantic search.

Interact with live data

Interact directly with curated database entries- play with molecular structures; edit sequence and alignment data; even plot and export tabular data.

A finger on the pulse

Stay up to date with the latest news. Utopia connects what you read with live data from Altmetric, Mendeley, CrossRef, Scibite and others.

A user can register for an account (enabling comments on documents) or use the application anonymously.

Presently focused on the life sciences but no impediment to expansion into computer science for example.

It doesn’t solve semantic diversity issues so an opportunity for topic maps there.

Doesn’t address the issue of documents being good at information delivery but not so good for information storage.

But issues of semantic diversity and information storage, are growth areas for Utopia Documents, not reservations about its use.

Suggest you start using and exploring Utopia Documents sooner rather than later!

December 26, 2012

EOL Classification Providers [Encyclopedia of Life]

Filed under: Bioinformatics,Biomedical,Classification — Patrick Durusau @ 7:17 pm

EOL Classification Providers

From the webpage:

The information on EOL is organized using hierarchical classifications of taxa (groups of organisms) from a number of different classification providers. You can explore these hierarchies in the Names tab of EOL taxon pages. Many visitors would expect to see a single classification of life on EOL. However, we are still far from having a classification scheme that is universally accepted.

Biologists all over the world are studying the genetic relationships between organisms in order to determine each species’ place in the hierarchy of life. While this research is underway, there will be differences in opinion on how to best classify each group. Therefore, we present our visitors with a number of alternatives. Each of these hierarchies is supported by a community of scientists, and all of them feature relationships that are controversial or unresolved.

How far from universally accepted?

Consider the sources for classification:

AntWeb
AntWeb is generally recognized as the most advanced biodiversity information system at species level dedicated to ants. Altogether, its acceptance by the ant research community, the number of participating remote curators that maintain the site, number of pictures, simplicity of web interface, and completeness of species, make AntWeb the premier reference for dissemination of data, information, and knowledge on ants. AntWeb is serving information on tens of thousands of ant species through the EOL.

Avibase
Avibase is an extensive database information system about all birds of the world, containing over 6 million records about 10,000 species and 22,000 subspecies of birds, including distribution information, taxonomy, synonyms in several languages and more. This site is managed by Denis Lepage and hosted by Bird Studies Canada, the Canadian copartner of Birdlife International. Avibase has been a work in progress since 1992 and it is offered as a free service to the bird-watching and scientific community. In addition to links, Avibase helped us install Gill, F & D Donsker (Eds). 2012. IOC World Bird Names (v 3.1). Available at http://www.worldbirdnames.org as of 2 May 2012.  More bird classifications are likely to follow

CoL
The Catalogue of Life Partnership (CoLP) is an informal partnership dedicated to creating an index of the world’s organisms, called the Catalogue of Life (CoL). The CoL provides different forms of access to an integrated, quality, maintained, comprehensive consensus species checklist and taxonomic hierarchy, presently covering more than one million species, and intended to cover all know species in the near future. The Annual Checklist EOL uses contains substantial contributions of taxonomic expertise from more than fifty organizations around the world, integrated into a single work by the ongoing work of the CoLP partners. 

FishBase
FishBase is a global information system with all you ever wanted to know about fishes. FishBase is a relational database with information to cater to different professionals such as research scientists, fisheries managers, zoologists and many more. The FishBase Website contains data on practically every fish species known to science. The project was developed at the WorldFish Center in collaboration with the Food and Agriculture Organization of the United Nations and many other partners, and with support from the European Commission. FishBase is serving information on more than 30,000 fish species through EOL.

Index Fungorum
The Index Fungorum, the global fungal nomenclator coordinated and supported by the Index Fungorum Partnership (CABI, CBS, Landcare Research-NZ), contains names of fungi (including yeasts, lichens, chromistan fungal analogues, protozoan fungal analogues and fossil forms) at all ranks.

ITIS
The Integrated Taxonomic Information System (ITIS) is a partnership of federal agencies and other organizations from the United States, Canada, and Mexico, with data stewards and experts from around the world (see http://www.itis.gov). The ITIS database is an automated reference of scientific and common names of biota of interest to North America . It contains more than 600,000 scientific and common names in all kingdoms, and is accessible via the World Wide Web in English, French, Spanish, and Portuguese (http://itis.gbif.net). ITIS is part of the US National Biological Information Infrastructure (http://www.nbii.gov).

IUCN
International Union for Conservation of Nature (IUCN) helps the world find pragmatic solutions to our most pressing environment and development challenges. IUCN supports scientific research; manages field projects all over the world; and brings governments, non-government organizations, United Nations agencies, companies and local communities together to develop and implement policy, laws and best practice. EOL partnered with the IUCN to indicate status of each species according to the Red List of Threatened Species.

Metalmark Moths of the World
Metalmark moths (Lepidoptera: Choreutidae) are a poorly known, mostly tropical family of microlepidopterans. The Metalmark Moths of the World LifeDesk provides species pages and an updated classification for the group.

NCBI
As a U.S. national resource for molecular biology information, NCBI’s mission is to develop new information technologies to aid in the understanding of fundamental molecular and genetic processes that control health and disease. The NCBI taxonomy database contains the names of all organisms that are represented in the genetic databases with at least one nucleotide or protein sequence.

The Paleobiology Database
The Paleobiology Database is a public resource for the global scientific community. It has been organized and operated by a multi-disciplinary, multi-institutional, international group of paleobiological researchers. Its purpose is to provide global, collection-based occurrence and taxonomic data for marine and terrestrial animals and plants of any geological age, as well as web-based software for statistical analysis of the data. The project’s wider, long-term goal is to encourage collaborative efforts to answer large-scale paleobiological questions by developing a useful database infrastructure and bringing together large data sets.

The Reptile Database 
This database provides information on the classification of all living reptiles by listing all species and their pertinent higher taxa. The database therefore covers all living snakes, lizards, turtles, amphisbaenians, tuataras, and crocodiles. It is a source of taxonomic data, thus providing primarily (scientific) names, synonyms, distributions and related data. The database is currently supported by the Systematics working group of the German Herpetological Society (DGHT)

WoRMS
The aim of a World Register of Marine Species (WoRMS) is to provide an authoritative and comprehensive list of names of marine organisms, including information on synonymy. While highest priority goes to valid names, other names in use are included so that this register can serve as a guide to interpret taxonomic literature.

Those are “current” classifications, which don’t reflect historical classifications (used by our ancestors), nor future classifications.

The four states of matter becoming > 500 states of matter for example.

Instead of “universal acceptance,” how does “working agreement for a specific purpose” sound?

December 21, 2012

New Public-Access Source With 3-D Information for Protein Interactions

Filed under: Bioinformatics,Biomedical,Genome,Genomics — Patrick Durusau @ 5:24 pm

New Public-Access Source With 3-D Information for Protein Interactions

From the post:

Researchers have developed a platform that compiles all the atomic data, previously stored in diverse databases, on protein structures and protein interactions for eight organisms of relevance. They apply a singular homology-based modelling procedure.

The scientists Roberto Mosca, Arnaud Ceol and Patrick Aloy provide the international biomedical community with Interactome3D (interactome3d.irbbarcelona.org), an open-access and free web platform developed entirely by the Institute for Research in Biomedicine (IRB Barcelona). Interactome 3D offers for the first time the possibility to anonymously access and add molecular details of protein interactions and to obtain the information in 3D models. For researchers, atomic level details about the reactions are fundamental to unravel the bases of biology, disease development, and the design of experiments and drugs to combat diseases.

Interactome 3D provides reliable information about more than 12,000 protein interactions for eight model organisms, namely the plant Arabidopsis thaliana, the worm Caenorhabditis elegans, the fly Drosophila melanogaster, the bacteria Escherichia coli and Helicobacter pylori, the brewer’s yeast Saccharomyces cerevisiae, the mouse Mus musculus, and Homo sapiens. These models are considered the most relevant in biomedical research and genetic studies. The journal Nature Methods presents the research results and accredits the platform on the basis of it high reliability and precision in modelling interactions, which reaches an average of 75%.

Further details can be found at:

Interactome3D: adding structural details to protein networks by Roberto Mosca, Arnaud Céol and Patrick Aloy. (Nature Methods (2012) doi:10.1038/nmeth.2289)

Abstract:

Network-centered approaches are increasingly used to understand the fundamentals of biology. However, the molecular details contained in the interaction networks, often necessary to understand cellular processes, are very limited, and the experimental difficulties surrounding the determination of protein complex structures make computational modeling techniques paramount. Here we present Interactome3D, a resource for the structural annotation and modeling of protein-protein interactions. Through the integration of interaction data from the main pathway repositories, we provide structural details at atomic resolution for over 12,000 protein-protein interactions in eight model organisms. Unlike static databases, Interactome3D also allows biologists to upload newly discovered interactions and pathways in any species, select the best combination of structural templates and build three-dimensional models in a fully automated manner. Finally, we illustrate the value of Interactome3D through the structural annotation of the complement cascade pathway, rationalizing a potential common mechanism of action suggested for several disease-causing mutations.

Interesting not only for its implications for bioinformatics but for the development of homology modeling (superficially, similar proteins have similar interaction sites) to assist in their work.

The topic map analogy would be to show a subject domain, different identifications of the same subject tend to have the same associations or to fall into other patterns.

Then constructing a subject identity test based upon a template of associations or other values.

December 18, 2012

Bio-Linux 7 – Released November 2012

Filed under: Bio-Linux,Bioinformatics,Biomedical,Linux OS — Patrick Durusau @ 5:24 pm

Bio-Linux 7 – Released November 2012

From the webpage:

Bio-Linux 7 is a fully featured, powerful, configurable and easy to maintain bioinformatics workstation. Bio-Linux provides more than 500 bioinformatics programs on an Ubuntu Linux 12.04 LTS base. There is a
graphical menu for bioinformatics programs, as well as easy access to the Bio-Linux bioinformatics documentation system and sample data useful for testing programs. 

Bio-Linux 7 adds many improvements over previous versions, including the Galaxy analysis environment.  There are also various packages to handle new generation sequence data types.

You can install Bio-Linux on your machine, either as the only operating system, or as part of a dual-boot setup which allows you to use your current system and Bio-Linux on the same hardware.

Bio-Linux also runs Live from the DVD or a USB stick. This runs in the memory of your machine and does not involve installing anything. This is a great, no-hassle way to try out Bio-Linux, demonstrate or teach with it, or to work with when you are on the move.

Bio-Linux is built on open source systems and software, and so is free to to install and use. See What’s new on Bio-Linux 7. Also, check out the  2006 paper on Bio-Linux and open source systems for biologists.

Useful for exploring bioinformatics tools for Ubuntu.

But useful as well for considering how those tools could be used in data/text mining for other domains.

Not to mention the packaging for installation to DVD or USB stick.

Are there any topic map engines that are setup for burning to DVD or USB stick?

Packaging them that way with more than a minimal set of maps and/or data sets might be a useful avenue to explore.

December 17, 2012

taxize: Taxonomic search and phylogeny retrieval [R]

Filed under: Bioinformatics,Biomedical,Phylogenetic Trees,Searching,Taxonomy — Patrick Durusau @ 4:58 pm

taxize: Taxonomic search and phylogeny retrieval by Scott Chamberlain, Eduard Szoecs and Carl Boettiger.

From the documentation:

We are developing taxize as a package to allow users to search over many websites for species names (scientific and common) and download up- and downstream taxonomic hierarchical information – and many other things. The functions in the package that hit a specific API have a prefix and suffix separated by an underscore. They follow the format of service_whatitdoes. For example, gnr_resolve uses the Global Names Resolver API to resolve species names. General functions in the package that don’t hit a specific API don’t have two words separated by an underscore, e.g., classification. You need API keys for Encyclopedia of Life (EOL), the Universal Biological Indexer and Organizer (uBio), Tropicos, and Plantminer.

Just in case you need species names and/or taxonomic hierarchy information for your topic map.

Go3R [Searching for Alternatives to Animal Testing]

Go3R

A semantic search engine for finding alternatives to animal testing.

I mention it as an example of a search interface that assists the user in searching.

The help documentation is a bit sparse if you are looking for an opportunity to contribute to such a project.

I did locate some additional information on the project, all usefully with the same title to make locating it “easy.” 😉

[Introduction] Knowledge-based semantic search engine for alternative methods to animal experiments

[PubMed – entry] Go3R – semantic Internet search engine for alternative methods to animal testing by Sauer UG, Wächter T, Grune B, Doms A, Alvers MR, Spielmann H, Schroeder M. (ALTEX. 2009;26(1):17-31).

Abstract:

Consideration and incorporation of all available scientific information is an important part of the planning of any scientific project. As regards research with sentient animals, EU Directive 86/609/EEC for the protection of laboratory animals requires scientists to consider whether any planned animal experiment can be substituted by other scientifically satisfactory methods not entailing the use of animals or entailing less animals or less animal suffering, before performing the experiment. Thus, collection of relevant information is indispensable in order to meet this legal obligation. However, no standard procedures or services exist to provide convenient access to the information required to reliably determine whether it is possible to replace, reduce or refine a planned animal experiment in accordance with the 3Rs principle. The search engine Go3R, which is available free of charge under http://Go3R.org, runs up to become such a standard service. Go3R is the world-wide first search engine on alternative methods building on new semantic technologies that use an expert-knowledge based ontology to identify relevant documents. Due to Go3R’s concept and design, the search engine can be used without lengthy instructions. It enables all those involved in the planning, authorisation and performance of animal experiments to determine the availability of non-animal methodologies in a fast, comprehensive and transparent manner. Thereby, Go3R strives to significantly contribute to the avoidance and replacement of animal experiments.

[ALTEX entry – full text available] Go3R – Semantic Internet Search Engine for Alternative Methods to Animal Testing

December 15, 2012

Neuroscience Information Framework (NIF)

Filed under: Bioinformatics,Biomedical,Medical Informatics,Neuroinformatics,Searching — Patrick Durusau @ 8:21 pm

Neuroscience Information Framework (NIF)

From the about page:

The Neuroscience Information Framework is a dynamic inventory of Web-based neuroscience resources: data, materials, and tools accessible via any computer connected to the Internet. An initiative of the NIH Blueprint for Neuroscience Research, NIF advances neuroscience research by enabling discovery and access to public research data and tools worldwide through an open source, networked environment.

Example of a subject specific information resource that provides much deeper coverage than possible with Google, for example.

If you aren’t trying to index everything, you can out perform more general search solutions.

November 26, 2012

Collaborative biocuration… [Pre-Topic Map Tasks]

Filed under: Authoring Topic Maps,Bioinformatics,Biomedical,Curation,Genomics,Searching — Patrick Durusau @ 9:22 am

Collaborative biocuration—text-mining development task for document prioritization for curation by Thomas C. Wiegers, Allan Peter Davis and Carolyn J. Mattingly. (Database (2012) 2012 : bas037 doi: 10.1093/database/bas037)

Abstract:

The Critical Assessment of Information Extraction systems in Biology (BioCreAtIvE) challenge evaluation is a community-wide effort for evaluating text mining and information extraction systems for the biological domain. The ‘BioCreative Workshop 2012’ subcommittee identified three areas, or tracks, that comprised independent, but complementary aspects of data curation in which they sought community input: literature triage (Track I); curation workflow (Track II) and text mining/natural language processing (NLP) systems (Track III). Track I participants were invited to develop tools or systems that would effectively triage and prioritize articles for curation and present results in a prototype web interface. Training and test datasets were derived from the Comparative Toxicogenomics Database (CTD; http://ctdbase.org) and consisted of manuscripts from which chemical–gene–disease data were manually curated. A total of seven groups participated in Track I. For the triage component, the effectiveness of participant systems was measured by aggregate gene, disease and chemical ‘named-entity recognition’ (NER) across articles; the effectiveness of ‘information retrieval’ (IR) was also measured based on ‘mean average precision’ (MAP). Top recall scores for gene, disease and chemical NER were 49, 65 and 82%, respectively; the top MAP score was 80%. Each participating group also developed a prototype web interface; these interfaces were evaluated based on functionality and ease-of-use by CTD’s biocuration project manager. In this article, we present a detailed description of the challenge and a summary of the results.

The results:

“Top recall scores for gene, disease and chemical NER were 49, 65 and 82%, respectively; the top MAP score was 80%.”

indicate there is plenty of room for improvement. Perhaps even commercially viable improvement.

In hindsight, not talking about how to make a topic map along with ISO 13250, may have been a mistake. Even admitting there are multiple ways to get there, a technical report outlining one or two ways would have made the process more transparent.

Answering the question: “What can you say with a topic map?” with “Anything you want.” was, a truthful answer but not a helpful one.

I should try to crib something from one of those “how to write a research paper” guides. I haven’t looked at one in years but the process is remarkably similar to what would result in a topic map.

Some of the mechanics are different but the underlying intellectual process is quite similar. Everyone who has been to college (at least of my age), had a course that talked about writing research papers. So it should be familiar terminology.

Thoughts/suggestions?

November 25, 2012

FluxMap: visual exploration of flux distributions in biological networks [Import/Graphs]

Filed under: Bioinformatics,Biomedical,Graphs,Networks,Visualization — Patrick Durusau @ 2:29 pm

FluxMap: visual exploration of flux distributions in biological networks.

From the webpage:

FluxMap is an easy to use tool for the advanced visualisation of simulated or measured flux data in biological networks. Flux data import is achieved via a structured template basing on intuitive reaction equations. Flux data is mapped onto any network and visualised using edge thickness. Various visualisation options and interaction possibilities enable comparison and visual analysis of complex experimental setups in an interactive way.

Manuals and tutorials here.

Another easy to create graphs from data application. This one importing spreadsheet based data.

Wonder why some highly touted commercial graph databases don’t offer the same ease of use?

HIVE: Handy Integration and Visualisation of multimodal Experimental Data

Filed under: Bioinformatics,Biomedical,Graphs,Mapping,Merging,Visualization — Patrick Durusau @ 2:05 pm

HIVE: Handy Integration and Visualisation of multimodal Experimental Data

From the webpage:

HIVE is an Add-on for the VANTED system. VANTED is a graph editor extended for the visualisation and analysis of biological experimental data in context of pathways/networks. HIVE stands for

Handy Integration and Visualisation of multimodal Experimental Data

and extends the functionality of VANTED by adding the handling of volumes and images, together with a workspace approach, allowing one to integrate data of different biological data domains.

You need to see the demo video to appreciate this application!

It offers import of data, mapping rules to merge data from different data sets, easy visualization as a graph and other features.

Did I mention it also has 3-D image techniques as well?

PS: Yes, it is another example of “Who moved my acronym?”

November 24, 2012

BIOKDD 2013 :…Biological Knowledge Discovery and Data Mining

Filed under: Bioinformatics,Biomedical,Conferences,Data Mining,Knowledge Discovery — Patrick Durusau @ 11:24 am

BIOKDD 2013 : 4th International Workshop on Biological Knowledge Discovery and Data Mining

When Aug 26, 2013 – Aug 30, 2013
Where Prague, Czech Republic
Abstract Registration Due Apr 3, 2013
Submission Deadline Apr 10, 2013
Notification Due May 10, 2013
Final Version Due May 20, 2013

From the call for papers:

With the development of Molecular Biology during the last decades, we are witnessing an exponential growth of both the volume and the complexity of biological data. For example, the Human Genome Project provided the sequence of the 3 billion DNA bases that constitute the human genome. And, consequently, we are provided too with the sequences of about 100,000 proteins. Therefore, we are entering the post-genomic era: after having focused so many efforts on the accumulation of data, we have now to focus as much effort, and even more, on the analysis of these data. Analyzing this huge volume of data is a challenging task because, not only, of its complexity and its multiple and numerous correlated factors, but also, because of the continuous evolution of our understanding of the biological mechanisms. Classical approaches of biological data analysis are no longer efficient and produce only a very limited amount of information, compared to the numerous and complex biological mechanisms under study. From here comes the necessity to use computer tools and develop new in silico high performance approaches to support us in the analysis of biological data and, hence, to help us in our understanding of the correlations that exist between, on one hand, structures and functional patterns of biological sequences and, on the other hand, genetic and biochemical mechanisms. Knowledge Discovery and Data Mining (KDD) are a response to these new trends.

Topics of BIOKDD’13 workshop include, but not limited to:

Data Preprocessing: Biological Data Storage, Representation and Management (data warehouses, databases, sequences, trees, graphs, biological networks and pathways, …), Biological Data Cleaning (errors removal, redundant data removal, completion of missing data, …), Feature Extraction (motifs, subgraphs, …), Feature Selection (filter approaches, wrapper approaches, hybrid approaches, embedded approaches, …)

Data Mining: Biological Data Regression (regression of biological sequences…), Biological data clustering/biclustering (microarray data biclustering, clustering/biclustering of biological sequences, …), Biological Data Classification (classification of biological sequences…), Association Rules Learning from Biological Data, Text mining and Application to Biological Sequences, Web mining and Application to Biological Data, Parallel, Cloud and Grid Computing for Biological Data Mining

Data Postprocessing: Biological Nuggets of Knowledge Filtering, Biological Nuggets of Knowledge Representation and Visualization, Biological Nuggets of Knowledge Evaluation (calculation of the classification error rate, evaluation of the association rules via numerical indicators, e.g. measurements of interest, … ), Biological Nuggets of Knowledge Integration

Being held in conjunction with 24th International Conference on Database and Expert Systems Applications – DEXA 2013.

In case you are wondering about BIOKDD, consider the BIOKDD Programme for 2012.

Or the DEXA program for 2012.

Looks like a very strong set of conferences and workshops.

November 22, 2012

Developing a biocuration workflow for AgBase… [Authoring Interfaces]

Filed under: Bioinformatics,Biomedical,Curation,Genomics,Text Mining — Patrick Durusau @ 9:50 am

Developing a biocuration workflow for AgBase, a non-model organism database by Lakshmi Pillai, Philippe Chouvarine, Catalina O. Tudor, Carl J. Schmidt, K. Vijay-Shanker and Fiona M. McCarthy.

Abstract:

AgBase provides annotation for agricultural gene products using the Gene Ontology (GO) and Plant Ontology, as appropriate. Unlike model organism species, agricultural species have a body of literature that does not just focus on gene function; to improve efficiency, we use text mining to identify literature for curation. The first component of our annotation interface is the gene prioritization interface that ranks gene products for annotation. Biocurators select the top-ranked gene and mark annotation for these genes as ‘in progress’ or ‘completed’; links enable biocurators to move directly to our biocuration interface (BI). Our BI includes all current GO annotation for gene products and is the main interface to add/modify AgBase curation data. The BI also displays Extracting Genic Information from Text (eGIFT) results for each gene product. eGIFT is a web-based, text-mining tool that associates ranked, informative terms (iTerms) and the articles and sentences containing them, with genes. Moreover, iTerms are linked to GO terms, where they match either a GO term name or a synonym. This enables AgBase biocurators to rapidly identify literature for further curation based on possible GO terms. Because most agricultural species do not have standardized literature, eGIFT searches all gene names and synonyms to associate articles with genes. As many of the gene names can be ambiguous, eGIFT applies a disambiguation step to remove matches that do not correspond to this gene, and filtering is applied to remove abstracts that mention a gene in passing. The BI is linked to our Journal Database (JDB) where corresponding journal citations are stored. Just as importantly, biocurators also add to the JDB citations that have no GO annotation. The AgBase BI also supports bulk annotation upload to facilitate our Inferred from electronic annotation of agricultural gene products. All annotations must pass standard GO Consortium quality checking before release in AgBase.

Database URL: http://www.agbase.msstate.edu/

Another approach to biocuration. I will be posting on eGift separately but do note this is a domain specific tool.

The authors did not set out to create the universal curation tool but one suited to their specific data and requirements.

I think there is an important lesson here for semantic authoring interfaces. Word processors offer very generic interfaces but consequently little in the way of structure. Authoring annotated information requires more structure and that requires domain specifics.

Now there is an idea, create topic map authoring interfaces on top of a common skeleton, instead of hard coding interfaces as users “should” use the tool.

November 17, 2012

Visualising associations between paired `omics’ data sets

Visualising associations between paired `omics’ data sets by Ignacio González, Kim-Anh Lê Cao, Melissa J Davis and Sébastien Déjean.

Abstract:

Background

Each omics platform is now able to generate a large amount of data. Genomics, proteomics, metabolomics, interactomics are compiled at an ever increasing pace and now form a core part of the fundamental systems biology framework. Recently, several integrative approaches have been proposed to extract meaningful information. However, these approaches lack of visualisation outputs to fully unravel the complex associations between different biological entities.

Results

The multivariate statistical approaches ‘regularized Canonical Correlation Analysis’ and ‘sparse Partial Least Squares regression’ were recently developed to integrate two types of highly dimensional ‘omics’ data and to select relevant information. Using the results of these methods, we propose to revisit few graphical outputs to better understand the relationships between two ‘omics’ data and to better visualise the correlation structure between the different biological entities. These graphical outputs include Correlation Circle plots, Relevance Networks and Clustered Image Maps. We demonstrate the usefulness of such graphical outputs on several biological data sets and further assess their biological relevance using gene ontology analysis.

Conclusions

Such graphical outputs are undoubtedly useful to aid the interpretation of these promising integrative analysis tools and will certainly help in addressing fundamental biological questions and understanding systems as a whole.

Availability

The graphical tools described in this paper are implemented in the freely available R package mixOmics and in its associated web application.

Just in case you are looking for something a little more challenging this weekend than political feeds on Twitter. 😉

Is “higher dimensional” data everywhere? Just more obvious in the biological sciences?

If so, there are lessons here for manipulation/visualization of higher dimensional data in other areas as well.

RMol: …SD/Molfile structure information into R Objects

Filed under: Bioinformatics,Biomedical,Cheminformatics — Patrick Durusau @ 2:17 pm

RMol: A Toolset for Transforming SD/Molfile structure information into R Objects by Martin Grabner, Kurt Varmuza and Matthias Dehmer.

Abstract:

Background

The graph-theoretical analysis of molecular networks has a long tradition in chemoinformatics. As demonstrated frequently, a well designed format to encode chemical structures and structure-related information of organic compounds is the Molfile format. But when it comes to use modern programming languages for statistical data analysis in Bio- and Chemoinformatics, R as one of the most powerful free languages lacks tools to process R Molfile data collections and import molecular network data into R.

Results

We design an R object which allows a lossless information mapping of structural information from Molfiles into R objects. This provides the basis to use the RMol object as an anchor for connecting Molfile data collections with R libraries for analyzing graphs. Associated with the RMol objects, a set of R functions completes the toolset to organize, describe and manipulate the converted data sets. Further, we bypass R-typical limits for manipulating large data sets by storing R objects in bz-compressed serialized files instead of employing RData files.

Conclusions

By design, RMol is a R tool set without dependencies to other libraries or programming languages. It is useful to integrate into pipelines for serialized batch analysis by using network data and, therefore, helps to process sdf-data sets in R effeciently. It is freely available under the BSD licence. The script source can be downloaded from http://sourceforge.net/p/rmol-toolset.

Important work, not the least because of the explosion of interest in bio/cheminformatics.

If I understand the rationale for the software, it:

  1. enables use of existing R tools for graph/network analysis
  2. fits well into workflows with serialized pipelines
  3. dependencies are reduced by extraction of SD-File information
  4. storing chemical and molecular network information in R objects avoids repetitive transformations

All of which are true but I have a nagging concern about the need for transformation.

Knowing the structure of Molfiles and the requirements of R tools for graph/network analysis, how are the results of transformation different from R tools viewing Molfiles “as if” they were composed of R objects?

The mapping is already well known because that is what RMol uses to create the results of transformation. More over, for any particular use, more data may be transformed that is required for a particular analysis.

Not to take anything away from very useful work but the days of transformation of data are numbered. As data sets grow in size, there will be fewer and fewer places to store a “transformed” data set.

BTW, pay particular attention to the bibliography in this paper. Numerous references to follow if you are interested in this area.

November 13, 2012

The Power of Graphs for Analyzing Biological Datasets

Filed under: Bioinformatics,Biomedical,Graphs — Patrick Durusau @ 6:23 am

The Power of Graphs for Analyzing Biological Datasets by Davy Suvee.

Very good slide deck on using graphs with biological datasets.

May give you some ideas of what capabilities you need to offer in this area.

November 12, 2012

KEGG: Kyoto Encyclopedia of Genes and Genomes

Filed under: Bioinformatics,Biomedical,Genome — Patrick Durusau @ 8:01 pm

KEGG: Kyoto Encyclopedia of Genes and Genomes

From the webpage:

KEGG is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies (See Release notes for new and updated features).

Anyone in biological research is probably already using KEGG. Take the opportunity to educate yourself about this resource. In particular how to use it with other resources.

The KEGG project, like any other project, needs funding. Consider passing this site along to funders interested in biological research resources.

I first saw this in a tweet by Anita de Waard.

An Ontological Representation of Biomedical Data Sources and Records [Data & Record as Subjects]

Filed under: Bioinformatics,Biomedical,Medical Informatics,Ontology,RDF — Patrick Durusau @ 7:27 pm

An Ontological Representation of Biomedical Data Sources and Records by Michael Bada, Kevin Livingston, and Lawrence Hunter.

Abstract:

Large RDF-triple stores have been the basis of prominent recent attempts to integrate the vast quantities of data in semantically divergent databases. However, these repositories often conflate data-source records, which are information content entities, and the biomedical concepts and assertions denoted by them. We propose an ontological model for the representation of data sources and their records as an extension of the Information Artifact Ontology. Using this model, we have consistently represented the contents of 17 prominent biomedical databases as a 5.6-billion RDF-triple knowledge base, enabling querying and inference over this large store of integrated data.

Recognition of the need to treat data containers as subjects, along with the data they contain, is always refreshing.

In particular because the evolution of data sources can be captured, as the authors remark:

Our ontology is fully capable of handling the evolution of data sources: If the schema of a given data set is changed, a new instance of the schema is simply created, along with the instances of the fields of the new schema. If the data sets of a data source change (or a new set is made available), an instance for each new data set can be created, along with instances for its schema and fields. (Modeling of incremental change rather than creation of new instances may be desirable but poses significant representational challenges.) Additionally, using our model, if a researcher wishes to work with multiple versions of a given data source (e.g., to analyze some aspect of multiple versions of a given database), an instance for each version of the data source can be created. If different versions of a data source consist of different data sets (e.g., different file organizations) and/or different schemas and fields, the explicit representation of all of these elements and their linkages will make the respective structures of the disparate data-source versions unambiguous. Furthermore, it may be the case that only a subset of a data source needs to be represented; in such a case, only instances of the data sets, schemas, and fields of interest are created.

I first saw this in a tweet by Anita de Waard.

November 9, 2012

Semantic Technologies — Biomedical Informatics — Individualized Medicine

Filed under: Bioinformatics,Biomedical,Medical Informatics,Ontology,Semantic Web — Patrick Durusau @ 11:14 am

Joint Workshop on Semantic Technologies Applied to Biomedical Informatics and Individualized Medicine (SATBI+SWIM 2012) (In conjunction with International Semantic Web Conference (ISWC 2012) Boston, Massachusetts, U.S.A. November 11-15, 2012)

If you are at ISWC, consider attending.

To help with that choice, the accepted papers:

Jim McCusker, Jeongmin Lee, Chavon Thomas and Deborah L. McGuinness. Public Health Surveillance Using Global Health Explorer. [PDF]

Anita de Waard and Jodi Schneider. Formalising Uncertainty: An Ontology of Reasoning, Certainty and Attribution (ORCA). [PDF]

Alexander Baranya, Luis Landaeta, Alexandra La Cruz and Maria-Esther Vidal. A Workflow for Improving Medical Visualization of Semantically Annotated CT-Images. [PDF]

Derek Corrigan, Jean Karl Soler and Brendan Delaney. Development of an Ontological Model of Evidence for TRANSFoRm Utilizing Transition Project Data. [PDF]

Amina Chniti, Abdelali BOUSSADI, Patrice DEGOULET, Patrick Albert and Jean Charlet. Pharmaceutical Validation of Medication Orders Using an OWL Ontology and Business Rules. [PDF]

October 19, 2012

Random Forest Methodology – Bioinformatics

Filed under: Bioinformatics,Biomedical,Random Forests — Patrick Durusau @ 3:47 pm

Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics by Anne-Laure Boulesteix, Silke Janitza, Jochen Kruppa, Inke R. König

(Boulesteix, A.-L., Janitza, S., Kruppa, J. and König, I. R. (2012), Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. WIREs Data Mining Knowl Discov, 2: 493–507. doi: 10.1002/widm.1072)

Abstract:

The random forest (RF) algorithm by Leo Breiman has become a standard data analysis tool in bioinformatics. It has shown excellent performance in settings where the number of variables is much larger than the number of observations, can cope with complex interaction structures as well as highly correlated variables and return measures of variable importance. This paper synthesizes 10 years of RF development with emphasis on applications to bioinformatics and computational biology. Special attention is paid to practical aspects such as the selection of parameters, available RF implementations, and important pitfalls and biases of RF and its variable importance measures (VIMs). The paper surveys recent developments of the methodology relevant to bioinformatics as well as some representative examples of RF applications in this context and possible directions for future research.

Something to expand your horizons a bit.

And a new way to say “curse of dimensionality,” to-wit,

‘n ≪ p curse’

New to me anyway.

I was amused to read at the Wikipedia article on random forests that its disadvantages include:

Unlike decision trees, the classifications made by Random Forests are difficult for humans to interpret.

Turn about is fair play since many classifications made by humans are difficult for computers to interpret. 😉

October 16, 2012

The “O” Word (Ontology) Isn’t Enough

Filed under: Bioinformatics,Biomedical,Gene Ontology,Genome,Medical Informatics,Ontology — Patrick Durusau @ 10:36 am

The Units Ontology makes reference to the Gene Ontology as an example of a successful web ontology effort.

As it should. The Gene Ontology (GO) is the only successful web ontology effort. A universe with one (1) inhabitant.

The GO has a number of differences from wannabe successful ontology candidates. (see the article below)

The first difference echoes loudly across the semantic engineering universe:

One of the factors that account for GO’s success is that it originated from within the biological community rather than being created and subsequently imposed by external knowledge engineers. Terms were created by those who had expertise in the domain, thus avoiding the huge effort that would have been required for a computer scientist to learn and organize large amounts of biological functional information. This also led to general acceptance of the terminology and its organization within the community. This is not to say that there have been no disagreements among biologists over the conceptualization, and there is of course a protocol for arriving at a consensus when there is such a disagreement. However, a model of a domain is more likely to conform to the shared view of a community if the modelers are within or at least consult to a large degree with members of that community.

Did you catch that first line?

One of the factors that account for GO’s success is that it originated from within the biological community rather than being created and subsequently imposed by external knowledge engineers.

Saying the “O” word, ontology, that will benefit everyone if they will just listen to you, isn’t enough.

There are other factors to consider:

A Short Study on the Success of the Gene Ontology by Michael Bada, Robert Stevens, Carole Goble, Yolanda Gil, Michael Ashburner, Judith A. Blake, J. Michael Cherry, Midori Harris, Suzanna Lewis.

Abstract:

While most ontologies have been used only by the groups who created them and for their initially defined purposes, the Gene Ontology (GO), an evolving structured controlled vocabulary of nearly 16,000 terms in the domain of biological functionality, has been widely used for annotation of biological-database entries and in biomedical research. As a set of learned lessons offered to other ontology developers, we list and briefly discuss the characteristics of GO that we believe are most responsible for its success: community involvement; clear goals; limited scope; simple, intuitive structure; continuous evolution; active curation; and early use.

October 15, 2012

Information needs of public health practitioners: a review of the literature [Indexing Needs]

Filed under: Biomedical,Indexing,Medical Informatics,Searching — Patrick Durusau @ 4:37 am

Information needs of public health practitioners: a review of the literature by Jennifer Ford and Helena Korjonen.

Abstract:

Objective

To review published literature covering the information needs of public health practitioners and papers highlighting gaps and potential solutions in order to summarise what is already known about this population and models tested to support them.

Methods

The search strategy included bibliographic databases LISTA, LISA, PubMed and Web of Knowledge. The results of this literature review were used to create two tables displaying published literature.

Findings

The literature highlighted that some research has taken place into different public health subgroups with consistent findings. Gaps in information provision have also been identified by looking at the information services provided.

Conclusion

There is a need for further research into information needs in subgroups of public health practitioners as this group is diverse, has different needs and needs varying information. Models of informatics that can support public health must be developed and published so that the public health information community can share experiences and solutions and begin to build an evidence-base to produce superior information systems for the goal of a healthier society.

One of the key points for topic map advocates:

The need for improved indexing of public health information was highlighted by Alpi, discussing the role of expert searching in public health information retrieval.2 Existing taxonomies such as the MeSH system used by PubMed/Medline are perceived as inadequate for indexing the breadth of public health literature and are seen to be too clinically focussed.2 There is also concern at the lack of systematic indexing of grey literature.2 Given that more than one study has highlighted the high level of use of grey literature by public health practitioners, this lack of indexing should be of real concern to public health information specialists and practitioners. LaPelle also found that participants in her research had experienced difficulties with search terms for public health which is indicative of the current inadequacy of public health indexing.1

Other opportunities for topic maps are apparent in the literature review but inadequate indexing should be topic maps bread and butter.

October 13, 2012

Bio4j 0.8 is here!

Filed under: Bio4j,Bioinformatics,Biomedical,Genome — Patrick Durusau @ 1:57 pm

Bio4j 0.8 is here! by Pablo Pareja Tobes.

You will find “5.488.000 new proteins and 3.233.000 genes” and other improvements!

Whether you are interested in graph databases (Neo4j), bioinformatics or both, this is welcome news!

October 12, 2012

PathNet: A tool for pathway analysis using topological information

Filed under: Bioinformatics,Biomedical,Genome,Graphs,Networks — Patrick Durusau @ 3:12 pm

PathNet: A tool for pathway analysis using topological information by Bhaskar Dutta, Anders Wallqvist and Jaques Reifman. (Source Code for Biology and Medicine 2012, 7:10 doi:10.1186/1751-0473-7-10)

Abstract:

Background

Identification of canonical pathways through enrichment of differentially expressed genes in a given pathway is a widely used method for interpreting gene lists generated from highthroughput experimental studies. However, most algorithms treat pathways as sets of genes, disregarding any inter- and intra-pathway connectivity information, and do not provide insights beyond identifying lists of pathways.

Results

We developed an algorithm (PathNet) that utilizes the connectivity information in canonical pathway descriptions to help identify study-relevant pathways and characterize non-obvious dependencies and connections among pathways using gene expression data. PathNet considers both the differential expression of genes and their pathway neighbors to strengthen the evidence that a pathway is implicated in the biological conditions characterizing the experiment. As an adjunct to this analysis, PathNet uses the connectivity of the differentially expressed genes among all pathways to score pathway contextual associations and statistically identify biological relations among pathways. In this study, we used PathNet to identify biologically relevant results in two Alzheimers disease microarray datasets, and compared its performance with existing methods. Importantly, PathNet identified deregulation of the ubiquitin-mediated proteolysis pathway as an important component in Alzheimers disease progression, despite the absence of this pathway in the standard enrichment analyses.

Conclusions

PathNet is a novel method for identifying enrichment and association between canonical pathways in the context of gene expression data. It takes into account topological information present in pathways to reveal biological information. PathNet is available as an R workspace image from http://www.bhsai.org/downloads/pathnet/.

Important work for genomics but also a reminder that a list of paths is just that, a list of paths.

The value-add and creative aspect of data analysis is in the scoring of those paths in order to wring more information from them.

How is it for you? Just lists of paths or something a bit more clever?

September 23, 2012

Working More Effectively With Statisticians

Filed under: Bioinformatics,Biomedical,Data Quality,Statistics — Patrick Durusau @ 10:33 am

Working More Effectively With Statisticians by Deborah M. Anderson. (Fall 2012 Newsletter of Society for Clinical Data Management, pages 5-8)

Abstract:

The role of the clinical trial biostatistician is to lend scientific expertise to the goal of demonstrating safety and efficacy of investigative treatments. Their success, and the outcome of the clinical trial, is predicated on adequate data quality, among other factors. Consequently, the clinical data manager plays a critical role in the statistical analysis of clinical trial data. In order to better fulfill this role, data managers must work together with the biostatisticians and be aligned in their understanding of data quality. This article proposes ten specific recommendations for data managers in order to facilitate more effective collaboration with biostatisticians.

See the article for the details but the recommendations are generally applicable to all data collection projects:

Recommendation #1: Communicate early and often with the biostatistician and provide frequent data extracts for review.

Recommendation #2: Employ caution when advising sites or interactive voice/web recognition (IVR/IVW) vendors on handling of randomization errors.

Recommendation #3: Collect the actual investigational treatment and dose group for each subject.

Recommendation #4: Think carefully and consult the biostatistician about the best way to structure investigational treatment exposure and accountability data.

Recommendation #5: Clarify in electronic data capture (EDC) specifications whether a question is only a “prompt” screen or whether the answer to the question will be collected explicitly in the database.

Recommendation #6: Recognize the most critical data items from a statistical analysis perspective and apply the highest quality standards to them.

Recommendation #7: Be alert to protocol deviations/violations (PDVs).

Recommendation #8: Plan for a database freeze and final review before database lock.

Recommendation #9: Archive a snapshot of the clinical database at key analysis milestones and at the end of the study.

Recommendation #10: Educate yourself about fundamental statistical principles whenever the opportunity arises.

I first saw this at John Johnson’s Data cleaning is harder than statistical analysis.

September 10, 2012

Mapping solution to heterogeneous data sources

Filed under: Bioinformatics,Biomedical,Genome,Heterogeneous Data,Mapping — Patrick Durusau @ 2:21 pm

dbSNO: a database of cysteine S-nitrosylation by Tzong-Yi Lee, Yi-Ju Chen, Cheng-Tsung Lu, Wei-Chieh Ching, Yu-Chuan Teng, Hsien-Da Huang and Yu-Ju Chen. (Bioinformatics (2012) 28 (17): 2293-2295. doi: 10.1093/bioinformatics/bts436)

OK, the title doesn’t jump out and say “mapping solution here!” 😉

Reading a bit further, you discover that text mining is used to locate sequences and that data is then mapped to “UniProtKB protein entries.”

The data set provides access to:

  • UniProt ID
  • Organism
  • Position
  • PubMed Id
  • Sequence

My concern is what happens when X is mapped to a UniProtKB protein entry to:

  • The prior identifier for X (in the article or source), and
  • The mapping from X to the UniProtKB protein entry?

If both of those are captured, then prior literature can be annotated upon rendering to point to later aggregation of information on a subject.

If the prior identifier, place of usage, the mapping, etc., are not captured, then prior literature, when we encounter it, remains frozen in time.

Mapping solutions work, but repay the effort several times over if the prior identifier and its mapping to the “new” identifier are captured as part of the process.

Abstract

Summary: S-nitrosylation (SNO), a selective and reversible protein post-translational modification that involves the covalent attachment of nitric oxide (NO) to the sulfur atom of cysteine, critically regulates protein activity, localization and stability. Due to its importance in regulating protein functions and cell signaling, a mass spectrometry-based proteomics method rapidly evolved to increase the dataset of experimentally determined SNO sites. However, there is currently no database dedicated to the integration of all experimentally verified S-nitrosylation sites with their structural or functional information. Thus, the dbSNO database is created to integrate all available datasets and to provide their structural analysis. Up to April 15, 2012, the dbSNO has manually accumulated >3000 experimentally verified S-nitrosylated peptides from 219 research articles using a text mining approach. To solve the heterogeneity among the data collected from different sources, the sequence identity of these reported S-nitrosylated peptides are mapped to the UniProtKB protein entries. To delineate the structural correlation and consensus motif of these SNO sites, the dbSNO database also provides structural and functional analyses, including the motifs of substrate sites, solvent accessibility, protein secondary and tertiary structures, protein domains and gene ontology.

Availability: The dbSNO is now freely accessible via http://dbSNO.mbc.nctu.edu.tw. The database content is regularly updated upon collecting new data obtained from continuously surveying research articles.

Contacts: francis@saturn.yu.edu.tw or yujuchen@gate.sinica.edu.tw.

« Newer PostsOlder Posts »

Powered by WordPress