Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 16, 2011

“VCF annotation” with the NHLBI GO Exome Sequencing Project (JAX-WS)

Filed under: Annotation,Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 8:17 pm

“VCF annotation” with the NHLBI GO Exome Sequencing Project (JAX-WS) by Pierre Lindenbaum.

From the post:

The NHLBI Exome Sequencing Project (ESP) has released a web service to query their data. “The goal of the NHLBI GO Exome Sequencing Project (ESP) is to discover novel genes and mechanisms contributing to heart, lung and blood disorders by pioneering the application of next-generation sequencing of the protein coding regions of the human genome across diverse, richly-phenotyped populations and to share these datasets and findings with the scientific community to extend and enrich the diagnosis, management and treatment of heart, lung and blood disorders.“.

In the current post, I’ll show how I’ve used this web service to annotate a VCF file with this information.

The web service provided by the ESP is based on the SOAP protocol.

Important news/post for several reasons:

First and foremost, “for the potential to extend and enrich the diagnosis, management and treatment of heart, lung and blood disorders.”

Second, thanks to Pierre, we have a fully worked example of how to perform the annotation.

Last but not least, the NHLBI Exome Sequencing Project (ESP) did not try to go it alone for the annotations. It did what it does well and then offered the data up for other to use/extend it, hopefully to be used/extended by others.

I can’t count the number of projects of varying sorts that I have seen that tried to do every feature, every annotation, every imaging, every transcription, on their own. All of which resulted in being less than they could have been with greater openness.

I am not suggesting that vendors need to give away data. Vendors for the most part support all of us. It is disingenuous to pretend otherwise. So vendors making money means we get to pay our bills, buy books and computers, etc.

What I am suggesting is that vendors, researches and users need to work towards (yelling at each other doesn’t count) towards commercially viable solutions that enable greater collaboration with regard to research and data.

Otherwise we will have impoverished data sets that are never quite what they could be and vendors will be many many times over the real cost of developing data. Those two conditions don’t benefit anyone. “You, me, them.” (Blues Brothers) 😉

November 7, 2011

When Gamers Innovate

When Gamers Innovate

The problem (partially):

Typically, proteins have only one correct configuration. Trying to virtually simulate all of them to find the right one would require enormous computational resources and time.

On top of that there are factors concerning translational-regulation. As the protein chain is produced in a step-wise fashion on the ribosome, one end of a protein might start folding quicker and dictate how the opposite end should fold. Other factors to consider are chaperones (proteins which guide its misfolded partner into the right shape) and post-translation modifications (bits and pieces removed and/or added to the amino acids), which all make protein prediction even harder. That is why homology modelling or “machine learning” techniques tend to be more accurate. However, they all require similar proteins to be already analysed and cracked in the first place.

The solution:

Rather than locking another group of structural shamans in a basement to perform their biophysical black magic, the “Fold It” team created a game. It uses human brainpower, which is fuelled by high-octane logic and catalysed by giving it a competitive edge. Players challenge their three-dimensional problem-solving skills by trying to: 1) pack the protein 2) hide the hydrophobics and 3) clear the clashes.

Read the post or jump to the Foldit site.

Seems to me there are a lot of subject identity and relationship (association) issues that are a lot less complex that protein folding. Not that topic mappers should shy away from protein folding but we should be more imaginative about our authoring interfaces. Yes?

November 5, 2011

Expression cartography of human tissues using self organizing maps

Filed under: Bioinformatics,Biomedical,Self Organizing Maps (SOMs),Self-Organizing — Patrick Durusau @ 6:39 pm

Expression cartography of human tissues using self organizing maps by Henry Wirth; Markus Löffler; Martin von Bergen; Hans Binder. (BMC Bioinformatics. 2011;12:306)

Abstract:

Parallel high-throughput microarray and sequencing experiments produce vast quantities of multidimensional data which must be arranged and analyzed in a concerted way. One approach to addressing this challenge is the machine learning technique known as self organizing maps (SOMs). SOMs enable a parallel sample- and gene-centered view of genomic data combined with strong visualization and second-level analysis capabilities. The paper aims at bridging the gap between the potency of SOM-machine learning to reduce dimension of high-dimensional data on one hand and practical applications with special emphasis on gene expression analysis on the other hand.

A nice introduction to self organizing maps (SOMs) in a bioinformatics context. Think of them as being yet another way to discover subjects about which people want to make statements and to attach data and analysis.

November 4, 2011

Paper about “BioStar” published in PLoS Computational Biology

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 6:09 pm

Paper about “BioStar” published in PLoS Computational Biology by Pierre Lindenbaum.

I have mentioned Biostar.

Pierre links to the paper, a blog entry about the paper and has collected tweets about it.

Be forewarned about the slides if you are sensitive to remarks comparing twelve year olds and politicians. Personally I think twelve year olds have just been insulted. 😉

October 28, 2011

Network Modeling and Analysis in Health Informatics and Bioinformatics (NetMAHIB)

Filed under: Bioinformatics,Biomedical,Health care — Patrick Durusau @ 3:14 pm

Network Modeling and Analysis in Health Informatics and Bioinformatics (NetMAHIB) Editor-in-Chief: Reda Alhajj, University of Calgary.

From Springer, a new journal of health informatics and bioinformatics.

From the announcement:

NetMAHIB publishes original research articles and reviews reporting how graph theory, statistics, linear algebra and machine learning techniques can be effectively used for modelling and knowledge discovery in health informatics and bioinformatics. It aims at creating a synergy between these disciplines by providing a forum for disseminating the latest developments and research findings; hence results can be shared with readers across institutions, governments, researchers, students, and the industry. The journal emphasizes fundamental contributions on new methodologies, discoveries and techniques that have general applicability and which form the basis for network based modelling and knowledge discovery in health informatics and bioinformatics.

The NetMAHIB journal is proud to have an outstanding group of editors who widely and rigorously cover the multidisciplinary score of the journal. They are known to be research leaders in the field of Health Informatics and Bioinformatics. Further, the NetMAHIB journal is characterized by providing thorough constructive reviews by experts in the field and by the reduced turn-around time which allows research results to be disseminated and shared on timely basis. The target of the editors is to complete the first round of the refereeing process within about 8 to 10 weeks of submission. Accepted papers go to the online first list and are immediately made available for access by the research community.

October 25, 2011

Adding bed/wig data to dalliance genome browser

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 7:34 pm

Adding bed/wig data to dalliance genome browser

From the post:

I have been playing a bit with the dalliance genome browser. It is quite useful and I have started using it to generate links to send to researchers to show regions of interest we find from bioinformatics analyses.

I added a document to my github repo describing how to display a bed file in the browser. That rst is here and displayed in inline below.

It uses the UCSC binaries for creating BigWig/BigBed files because dalliance can request a subset of the data without downloading the entire file given the correct apache configuration (also described below).

This will require a recent version of dalliance because there was a bug in the BigBed parsing until recently.

Dalliance Data Tutorial

dalliance is a web-based scrolling genome-browser. It can display data from remote DAS servers or local or remote BigWig or BigBed files.

This will cover how to set up an html page that links to remote DAS services. It will also show how to create and serve BigWig and BigBed files.

Obviously of interest to the bioinformatics community (who are no doubt already aware of it) but I wanted to point out the ability to display data from remote servers/data sets.

Humanizing Bioinformatics

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 7:33 pm

Humanizing Bioinformatics by Saaien Tist.

From the post:

I was invited last week to give a talk at this year’s meeting of the Graduate School Structure and Function of Biological Macromolecules, Bioinformatics and Modeling (SFMBBM). It ended up being a day with great talks, by some bright PhD students and postdocs. There were 2 keynotes (one by Prof Bert Poolman from Groningen (NL) and one by myself), and a panel discussion on what the future holds for people nearing the end of their PhDs.

My talk was titled “Humanizing Bioinformatics” and received quite well (at least some people still laughed at my jokes (if you can call them that); even at the end). I put the slides up on slideshare, but I thought I’d explain things here as well, because those slides will probably not convey the complete story.

Let’s ruin the plot by mentioning it here: we need data visualization to counteract the alienation that’s happening between bioinformaticians and bright data miners on the one hand, and the user/clinician/biologist on the other. We need to make bioinformatics human again. (emphasis in original)

I just wish there had been a video recording of this presentation!

Questions:

  1. Do you agree with the issues that Saalen raises? Are there more that you would raise? 2-3 pages (no citations)
  2. Have “semantics” become what can be evaluated by a computer? Pick yes, no, undecided and cite web examples for your position. 2-3 pages
  3. How much do you trust the answers to your searches? (Classroom discussion question.)

October 19, 2011

MyBioSoftware

Filed under: Bioinformatics,Biomedical,Software — Patrick Durusau @ 3:16 pm

MyBioSoftware: Bioinformatics Software Blog

From the blog:

My Biosoftware Blog supplies free bioinformatics software for biology scientist, every day.

Impressive listing of bioinformatics software. Not my area (by training). It is one in which I am interested because of the rapid development of data analysis techniques, which may be applicable more broadly.

Question/Task: Select any two software packages in a category and document the output formats that they support. Thinking it would be useful to have a chart of formats supported for each category. May uncover places where interchange isn’t easy or perhaps even possible.

Knime4Bio:…Next Generation Sequencing data with KNIME

Filed under: Bioinformatics,Biomedical,Data Mining — Patrick Durusau @ 3:15 pm

Knime4Bio:…Next Generation Sequencing data with KNIME by # Pierre Lindenbaum, Solena Le Scouarnec, Vincent Portero and Richard Redon.

Abstract:

Analysing large amounts of data generated by next-generation sequencing (NGS) technologies is difficult for researchers or clinicians without computational skills. They are often compelled to delegate this task to computer biologists working with command line utilities. The availability of easy-to-use tools will become essential with the generalisation of NGS in research and diagnosis. It will enable investigators to handle much more of the analysis. Here, we describe Knime4Bio, a set of custom nodes for the KNIME (The Konstanz Information Miner) interactive graphical workbench, for the interpretation of large biological datasets. We demonstrate that this tool can be utilised to quickly retrieve previously published scientific findings.

Code: http://code.google.com/p/knime4bio/

While I applaud the trend towards “easy-to-use” software, I do worry about results that are returned by automated analysis, which of course “must be true.”

I am mindful of the four-year old whose name was on a terrorist watch list and so delayed the departure of a plane. The ground personnel lacked the moral courage or judgement to act on what was clearly a case of mistaken identity.

As “bigdata” grows ever larger, I wonder if “easy” interfaces will really be facile interfaces, that we lack the courage (skill?) to question?

October 18, 2011

Computational Omics and Systems Biology Group

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 2:41 pm

Computational Omics and Systems Biology Group

From the webpage:

>Introduction

The Computational Omics and Systems Biology Group, headed by Prof. Dr. Lennart Martens, is part of the Department of Biochemistry of the Faculty of Medicine and Health Sciences of Ghent University, and the Department of Medical Protein Research of VIB, both in Ghent, Belgium.

The group has its roots in Ghent, but has active members all over Europe, and specializes in the management, analysis and integration of high-throughput data (as obtained from various Omics approaches) with an aim towards establishing solid data stores, processing methods and tools to enable downstream systems biology research.

A major source of open source software, standards and other work.

October 15, 2011

Making Sense of Unstructured Data in Medicine Using Ontologies – October 19th

Filed under: Bioinformatics,Biomedical,Ontology — Patrick Durusau @ 4:30 pm

From the email announcement:

The next NCBO Webinar will be presented by Dr. Nigam Shah from Stanford University on “Making Sense of Unstructured Data in Medicine Using Ontologies” at 10:00am PT, Wednesday, October 19. Below is information on how to join the online meeting via WebEx and accompanying teleconference. For the full schedule of the NCBO Webinar presentations see: http://www.bioontology.org/webinar-series.

ABSTRACT:

Changes in biomedical science, public policy, information technology, and electronic heath record (EHR) adoption have converged recently to enable a transformation in the delivery, efficiency, and effectiveness of health care. While analyzing structured electronic records have proven useful in many different contexts, the true richness and complexity of health records—roughly 80 percent—lies within the clinical notes, which are free-text reports written by doctors and nurses in their daily practice. We have developed a scalable annotation and analysis workflow that uses public biomedical ontologies and is based on the term recognition tools developed by the National Center for Biomedical Ontology (NCBO). This talk will discuss the applications of this workflow to 9.5 million clinical documents—from the electronic health records of approximately one million adult patients from the STRIDE Clinical Data Warehouse—to identify statistically significant patterns of drug use and to conduct drug safety surveillance. For the patterns of drug use, we validate the usage patterns learned from the data against FDA-approved indications as well as external sources of known off-label use such as Medi-Span. For drug safety surveillance, we show that drug–disease co-occurrences and the temporal ordering of drugs and disease mentions in clinical notes can be examined for statistical enrichment and used to detect potential adverse events.

WEBEX DETAILS:
——————————————————-
To join the online meeting (Now from mobile devices!)
——————————————————-
1. Go to https://stanford.webex.com/stanford/j.php?ED=108527772&UID=0&PW=NZDdmNWNjOGMw&RT=MiM0
2. If requested, enter your name and email address.
3. If a password is required, enter the meeting password: ncbo
4. Click “Join”.

——————————————————-
To join the audio conference only
——————————————————-
To receive a call back, provide your phone number when you join the meeting, or call the number below and enter the access code.
Call-in toll number (US/Canada): 1-650-429-3300
Global call-in numbers: https://stanford.webex.com/stanford/globalcallin.php?serviceType=MC&ED=108527772&tollFree=0

Access code:929 613 752

October 10, 2011

Bio4jExplorer

Filed under: Bio4j,Bioinformatics,Biomedical,Cloud Computing,Graphs — Patrick Durusau @ 6:17 pm

Bio4jExplorer: familiarize yourself with Bio4j nodes and relationships

From the post:

I just uploaded a new tool aimed to be used both as a reference manual and initial contact for Bio4j domain model: Bio4jExplorer

Bio4jExplorer allows you to:

  • Navigate through all nodes and relationships
  • Access the javadocs of any node or relationship
  • Graphically explore the neighbourhood of a node/relationship
  • Look up for the different indexes that may serve as an entry point for a node
  • Check incoming/outgoing relationships of a specific node
  • Check start/end nodes of a specific relationship

And take note:

For those interested on how this was done, on the server side I created an AWS SimpleDB database holding all the information about the model of Bio4j, i.e. everything regarding nodes, relationships, indexes
 (here you can check the program used for creating this database using java aws sdk)

Meanwhile, in the client side I used Flare prefuse AS3 library for the graph visualization.

When people are this productive as well as a benefit to the community, I am deeply envious but glad for them (and the rest of us) at the same time. Simply must work harder. 😉

October 6, 2011

KDD and MUCMD 2011

Filed under: Bioinformatics,Biomedical,Data Mining,Knowledge Discovery — Patrick Durusau @ 5:33 pm

KDD and MUCMD 2011

An interesting review of KDD and MUCMD (Meaningful Use of Complex Medical Data) 2011:

At KDD I enjoyed Stephen Boyd’s invited talk about optimization quite a bit. However, the most interesting talk for me was David Haussler’s. His talk started out with a formidable load of biological complexity. About half-way through you start wondering, “can this be used to help with cancer?” And at the end he connects it directly to use with a call to arms for the audience: cure cancer. The core thesis here is that cancer is a complex set of diseases which can be distentangled via genetic assays, allowing attacking the specific signature of individual cancers. However, the data quantity and complex dependencies within the data require systematic and relatively automatic prediction and analysis algorithms of the kind that we are best familiar with.

Cites a number their favorite papers. Which ones are yours?

October 3, 2011

Automated extraction of domain-specific clinical ontologies – Weds Oct. 5th

Filed under: Bioinformatics,Biomedical,Ontology,SNOMED — Patrick Durusau @ 7:09 pm

Automated extraction of domain-specific clinical ontologies by Chimezie Ogbuji from Case Western Research University School of Medicine. 10 AM PT Weds Oct. 5, 2011.

Full NCBO Webinar schedule: http://www.bioontology.org/webinar-series

ABSTRACT:

A significant set of challenges in the use of large, source ontologies in the medical domain include: automated translation, customization of source ontologies, and performance issues associated with the use of logical reasoning systems to interpret the meaning of a domain captured in a formal knowledge representation.

SNOMED-CT and FMA are two reference ontologies that cover much of the domain of clinical medicine and motivate a better means for the re-use of such ontologies. In this presentation, the author will present a set of automated methods (and tools) for segmenting, merging, and surveying modules extracted from these ontologies for a specific domain.

I’m interested generally but in particular about the merging aspects, for obvious reasons. Another reason to be interested is some research I encountered recently on “outliers” in reasoning systems. Apparently there is a class of reasoning systems that simply “fall over” if they encounter a concept they recognize (or “think” they do) only to find it has some property (what makes it an “outlier”) that they don’t. Seems rather fragile to me but I haven’t finished running it to ground. Curious how these methods and tools handle the “outlier” issue.

SPEAKER BIO:

Chimezie is a senior research associate in the Clinical Investigations Department of the Case Western Research University School of Medicine where he is responsible for managing, developing, and implementing Clinical and Translational Science Collaborative (CTSC) projects as well as clinical, biomedical, and administrative informatics projects for the Case Comprehensive Cancer Center.

His research interests are in applied ontology, knowledge representation, content repository infrastructure, and medical informatics. He has a BS in computer engineering from the University of Illinois and is a part-time PhD student in the Case Western School of Engineering. He most recently appeared as a guest editor in IEEE Internet Computing’s special issue on Personal Health Records in the August 2011 edition.

DETAILS:

——————————————————-
To join the online meeting (Now from mobile devices!)
——————————————————-
1. Go to https://stanford.webex.com/stanford/j.php?ED=107799137&UID=0&PW=NNjE3OWYzODk3&RT=MiM0
2. If requested, enter your name and email address.
3. If a password is required, enter the meeting password: ncbo
4. Click “Join”.

——————————————————-
To join the audio conference only
——————————————————-
To receive a call back, provide your phone number when you join the meeting, or call the number below and enter the access code.
Call-in toll number (US/Canada): 1-650-429-3300
Global call-in numbers: https://stanford.webex.com/stanford/globalcallin.php?serviceType=MC&ED=107799137&tollFree=0

Access code:926 719 478

September 27, 2011

A Faster LZ77-Based Index

Filed under: Bioinformatics,Biomedical,Indexing — Patrick Durusau @ 7:19 am

A Faster LZ77-Based Index by Travis Gagie and Pawel Gawrychowski.

Abstract:

Suppose we are given an AVL-grammar with $r$ rules for a string (S [1..n]) whose LZ77 parse consists of $z$ phrases. Then we can add $\Oh{z \log \log z}$ words and obtain a compressed self-index for $S$ such that, given a pattern (P [1..m]), we can list the occurrences of $P$ in $S$ in $\Oh{m^2 + (m + \occ) \log \log n}$ time.

Not the best abstract I have ever read. At least in terms of attracting the most likely audience to be interested.

I would have started with: “Indexing of genomes, which are 99.9% same, can be improved in terms of searching, response times and reporting of secondary occurrences.” Then follow with the technical description of the contribution. Don’t make people work for a reason to read the paper.

Any advancement in indexing, but particularly in an area like genomics, is important to topic maps.


Update: See the updated version of this paper: A Faster Grammar-Based Self-Index.

September 23, 2011

Top Scoring Pairs for Feature Selection in Machine Learning and Applications to Cancer Outcome Prediction

Filed under: Bioinformatics,Biomedical,Classifier,Machine Learning,Prediction — Patrick Durusau @ 6:15 pm

Top Scoring Pairs for Feature Selection in Machine Learning and Applications to Cancer Outcome Prediction by Ping Shi, Surajit Ray, Qifu Zhu and Mark A Kon.

BMC Bioinformatics 2011, 12:375 doi:10.1186/1471-2105-12-375 Published: 23 September 2011

Abstract:

Background

The widely used k top scoring pair (k-TSP) algorithm is a simple yet powerful parameter-free classifier. It owes its success in many cancer microarray datasets to an effective feature selection algorithm that is based on relative expression ordering of gene pairs. However, its general robustness does not extend to some difficult datasets, such as those involving cancer outcome prediction, which may be due to the relatively simple voting scheme used by the classifier. We believe that the performance can be enhanced by separating its effective feature selection component and combining it with a powerful classifier such as the support vector machine (SVM). More generally the top scoring pairs generated by the k-TSP ranking algorithm can be used as a dimensionally reduced subspace for other machine learning classifiers.

Results

We developed an approach integrating the k-TSP ranking algorithm (TSP) with other machine learning methods, allowing combination of the computationally efficient, multivariate feature ranking of k-TSP with multivariate classifiers such as SVM. We evaluated this hybrid scheme (k-TSP+SVM) in a range of simulated datasets with known data structures. As compared with other feature selection methods, such as a univariate method similar to Fisher’s discriminant criterion (Fisher), or a recursive feature elimination embedded in SVM (RFE), TSP is increasingly more effective than the other two methods as the informative genes become progressively more correlated, which is demonstrated both in terms of the classification performance and the ability to recover true informative genes. We also applied this hybrid scheme to four cancer prognosis datasets, in which k-TSP+SVM outperforms k-TSP classifier in all datasets, and achieves either comparable or superior performance to that using SVM alone. In concurrence with what is observed in simulation, TSP appears to be a better feature selector than Fisher and RFE in some of the cancer datasets

Conclusions

The k-TSP ranking algorithm can be used as a computationally efficient, multivariate filter method for feature selection in machine learning. SVM in combination with k-TSP ranking algorithm outperforms k-TSP and SVM alone in simulated datasets and in some cancer prognosis datasets. Simulation studies suggest that as a feature selector, it is better tuned to certain data characteristics, i.e. correlations among informative genes, which is potentially interesting as an alternative feature ranking method in pathway analysis.

Knowing the tools that are already in use in bioinformatics will help you design topic map applications of interest to those in that field. And this is a very nice combination of methods to study on its own.

September 21, 2011

CITRIS – Center for Information Technology Research in the Interest of Society

Filed under: Biomedical,Environment,Funding,Health care,Information Retrieval — Patrick Durusau @ 7:08 pm

CITRIS – Center for Information Technology Research in the Interest of Society

The mission statement:

The Center for Information Technology Research in the Interest of Society (CITRIS) creates information technology solutions for many of our most pressing social, environmental, and health care problems.

CITRIS was created to “shorten the pipeline” between world-class laboratory research and the creation of start-ups, larger companies, and whole industries. CITRIS facilitates partnerships and collaborations among more than 300 faculty members and thousands of students from numerous departments at four University of California campuses (Berkeley, Davis, Merced, and Santa Cruz) with industrial researchers from over 60 corporations. Together the groups are thinking about information technology in ways its never been thought of before.

CITRIS works to find solutions to many of the concerns that face all of us today, from monitoring the environment and finding viable, sustainable energy alternatives to simplifying health care delivery and developing secure systems for electronic medical records and remote diagnosis, all of which will ultimately boost economic productivity. CITRIS represents a bold and exciting vision that leverages one of the top university systems in the world with highly successful corporate partners and government resources.

I mentioned CITRIS as an aside (News: Summarization and Visualization) yesterday but then decided it needed more attention.

Its grants are limited the four University of California campuses mentioned above. Shades of EU funding restrictions. Location has a hand in the selection process.

Still, the projects funded by CITRIS could likely profit from the use of topic maps and as they say, a rising tide lifts all boats.

September 17, 2011

Got Hadoop?

Filed under: Bioinformatics,Biomedical,Hadoop — Patrick Durusau @ 8:12 pm

Got Hadoop?

This is going to require free registration at Genomeweb but I think it will be worth it. (Genomeweb also offers $premium content but I haven’t tried any of it, yet.)

Nice overview of Hadoop in genome research.

Annoying in that it lists the following projects, sans hyperlinks. I have supplied the project listing with hyperlinks, just in case you are interested in Hadoop and genome research.

Crossbow: Whole genome resequencing analysis; SNP genotyping from short reads
Contrail: De novo assembly from short sequencing reads
Myrna: Ultrafast short read alignment and differential gene expression from large RNA-seq eakRanger: Cloud-enabled peak caller for ChIP-seq data
Quake: Quality-aware detection and sequencing error correction tool
BlastReduce: High-performance short read mapping (superceded by CloudBurst)
CloudBLAST*: Hadoop implementation of NCBI’s Blast
MrsRF: Algorithm for analyzing large evolutionary trees

*CloudBLAST was the only project without a webpage or similar source of information. This is a paper, perhaps the original paper on the technique. Searching for any of these techniques reveals a wealth of material on using Hadoop in bioinformatics.

Topic maps can capture your path through data (think of bread crumbs or string). So when today you think, “I should have gone left, rather than right”, you can retrace your steps and take a another path. Try that with a Google search. If you are lucky, you may get the same ads. 😉

You can also share your bread crumbs or string with others, but that is a story for another time.

September 13, 2011

3rd Canadian Semantic Web Symposium

Filed under: Biomedical,Concept Detection,Ontology,Semantic Web — Patrick Durusau @ 7:17 pm

CSWS2011: The 3rd Canadian Semantic Web Symposium Proceedings of the 3rd Canadian Semantic Web Symposium
Vancouver, British Columbia, Canada, August 5, 2011

An interesting set of papers! I suppose I can be forgiven for looking at the text mining (Hassanpour & Das) and heterogeneous information systems (Khan, Doucette, and Cohen) papers first. 😉 More comments to follow on those.

What are your favorite papers in this batch and why?

The whole proceedings can also be downloaded as a single PDF file.

Edited by:

Christopher J. O. Baker *
Helen Chen **
Ebrahim Bagheri ***
Weichang Du ****

* University of New Brunswick, Saint John, NB, Canada, Department of Computer Science & Applied Statistics
** University of Waterloo, Waterloo, ON, Canada, School of Public Health and Health Systems
*** Athabasca University, School of Computing and Information Systems
**** University of New Brunswick, NB, Canada, Faculty of Computer Science

Table of Contents

Full Paper

  1. The Social Semantic Subweb of Virtual Patient Support Groups
    Harold Boley, Omair Shafiq, Derek Smith, Taylor Osmun
  2. Leveraging SADI Semantic Web Services to Exploit Fish Ecotoxicology Data
    Matthew M. Hindle, Alexandre Riazanov, Edward S. Goudreau, Christopher J. Martyniuk, Christopher J. O. Baker
  3. Short Paper

  4. Towards Evaluating the Impact of Semantic Support for Curating the Fungus Scientic Literature
    Marie-Jean Meurs, Caitlin Murphy, Nona Naderi, Ingo Morgenstern, Carolina Cantu, Shary Semarjit, Greg Butler, Justin Powlowski, Adrian Tsang, René Witte
  5. Ontology based Text Mining of Concept Definitions in Biomedical Literature
    Saeed Hassanpour, Amar K. Das
  6. Social and Semantic Computing in Support of Citizen Science
    Joel Sachs, Tim Finin
  7. Unresolved Issues in Ontology Learning
    Amal Zouaq, Dragan GaĆĄevic, Marek Hatala
  8. Poster

  9. Towards Integration of Semantically Enabled Service Families in the Cloud
    Marko BoĆĄkovic, Ebrahim Bagheri, Georg Grossmann, Dragan GaĆĄevic, Markus Stumptner
  10. SADI for GMOD: Semantic Web Services for Model Organism Databases
    Ben Vandervalk, Michel Dumontier, E Luke McCarthy, Mark D Wilkinson
  11. An Ontological Approach for Querying Distributed Heterogeneous Information Systems
    Atif Khan, John A. Doucette, Robin Cohen

Please see the CSWS2011 website for further details.

September 8, 2011

Bioportal 3.2

Filed under: Bioinformatics,Biomedical,Ontology — Patrick Durusau @ 5:50 pm

Bioportal 3.2

From the announcement:

The National Center for Biomedical Ontology is pleased to announce the release of BioPortal 3.2.

New features include updates to the Web interface and Web services:

Added Ontology Recommender feature, http://bioportal.bioontology.org/recommender
Added support for access control for viewing ontologies
Added link to subscribe to BioPortal Notes emails
Synchronized “Jump To” feature with ontology parsing and display
Added documentation on Ontology Groups
Annotator Web service – disabled use of “longest only” parameter when also selecting “ontologies to expand” parameter
Removed the metric “Number of classes without an author”
Handling of obsolete terms, part 1 – term name is grayed out and element is returned in Web service response for obsolete terms from OBO and RRF ontologies. This feature will be extended to cover OWL ontologies in a subsequent release.

Bug Fix

Fixed calculation of “Classes with no definition” metric
Added re-direct from old BioPortal URL format to new URL format to provide working links from archived search results

Firefox Extension for NCBO API Key:

To make it easier to test Web service calls from your browser, we have released the NCBO API Key Firefox Extension. This extension will automatically add your API Key to NCBO REST URLs any time you visit them in Firefox. The extension is available at Mozilla’s Add-On site. To use the extension, follow the installation directions, restart Firefox, and add your API Key into the “Options” dialog menu on the Add-Ons management screen. After that, the extension will automatically append your stored API Key any time you visit http://rest.bioontology.org.

Upcoming software license change:

The next release of NCBO software will be under the two-clause BSD license rather than under the currently used three-clause BSD license. This change should not affect anyone’s use of NCBO software and this change is to a less restrictive license. More information about these licenses is available at the site: http://www.opensource.org/licenses. Please contact support at bioontology.org with any questions concerning this change.

Even if you aren’t active in the bioontology area, you need to spend some time with this site.

September 6, 2011

Sage Bionetworks Synapse Project – Webinar – Weds. 7 Sept. 2011

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 7:02 pm

Sage Bionetworks Synapse Project – Webinar – Weds. 7 Sept. 2011

Call-in Details:

——————————————————-
To join the online meeting (Now from mobile devices!)
——————————————————-
1. Go to https://stanford.webex.com/stanford/j.php?ED=107799137&UID=0&PW=NNjE3OWYzODk3&RT=MiM0
2. If requested, enter your name and email address.
3. If a password is required, enter the meeting password: ncbo
4. Click “Join”.

——————————————————-
To join the audio conference only
——————————————————-
To receive a call back, provide your phone number when you join the meeting, or call the number below and enter the access code.
Call-in toll number (US/Canada): 1-650-429-3300
Global call-in numbers: https://stanford.webex.com/stanford/globalcallin.php?serviceType=MC&ED=107799137&tollFree=0

Access code:926 719 478

Abstract:

The recent exponential growth of biological “omics” data has occurred concurrently with a decline in the number of New Molecular Entities approved by the FDA, proving that biological research productivity does not scale with biological data generation and the analysis and interpretation of genomic data is a bottleneck in the development of new treatments. Sage Bionetworks’ mission is to catalyze a cultural transition from the traditional single lab, single-company, and single-therapy R&D paradigm to a model with broad precompetitive collaboration on the analysis of large scale data in medical sciences. Part of Sage’s solution is Synapse, a platform for open, reproducible data-driven science, which will support the reusability of information facilitated by ontology-based services and applications directed at scientific researchers and data curators. Sage Bionetworks is actively pursuing the acquisition, curation, statistical quality control, and hosting of datasets that integrate both clinical phenotype and genomic data along with an intermediate molecular layer such as gene expression or proteomic data. We expect hosting these sorts of unique, integrative, high value datasets in the public domain on Synapse will seed a variety of analytical approaches to drive new treatments based on better understanding of disease states and the biological effects of existing drugs. In this webinar, Dr. Michael Kellen, Director of Technology at Sage Bionetworks will provide a demonstration of an alpha version of the Synapse platform, and discuss its application to clinical science.

Interesting claim about the decline in the number of New Molecular Entities (NMEs) approved by the FDA, see: NMEs approved by CDER. Approvals are on average about the same. But then applications for NMEs have to be filed in order to be approved.

Just for background reading, you might want to look at: New Chemical Entity over at Wikipedia.

Or, The Scope of New Chemical Entity Exclusivity and FDA’s “Umbrella” Exclusivity Policy

I don’t disagree that better data analysis tools are needed but remain puzzled what the FDA approval rate for NMEs has to do with the problem.

August 22, 2011

Bio-recipes (Bioinformatics recipes) in Darwin

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 7:43 pm

Bio-recipes (Bioinformatics recipes) in Darwin

If you are working on topic maps and bioinformatics, you are likely to find this a useful resource.

From the webpage:

Bio-recipes are a collection of Darwin example programs. They show how to solve standard problems in Bioinformatics. Each bio-recipe consists of an introduction, explanations, graphs, figures, and most importantly, Darwin commands (the input commands and the output that they produce) that solve the given problem.

Darwin is an interactive language of the same lineage as Maple designed to solve problems in Bioinformatics. It relies on a simple language for the interactive user, plus the infrastructure necessary for writing object oriented libraries, plus very efficient primitive operations. The primitive operations of Darwin are the most common and time consuming operations typical of bioinformatics, including linear algebra operations.

The reasons behind this particular format are the following.

  1. It is much easier to understand an algorithm or a procedure or even a theorem, when it is illustrated with a running example.
  2. The procedures, as written, may be run on different data and hence serve a useful purpose.
  3. It is an order of magnitude easier to modify a correct, existing program, than to write a new one from scratch. This is particularly true for non-computer scientists.
  4. The full examples show some features of the language and of the system that may not known to the casual user of the Darwin, hence they serve a tutorial purpose.

BTW, see also:

DARWIN – A Genetic Algorithm Programming Language

The Darwin Manual

August 18, 2011

BMC Bioinformatics

Filed under: Bioinformatics,Biomedical,Clustering — Patrick Durusau @ 6:49 pm

BMC Bioinformatics

From the webpage:

BMC Bioinformatics is an open access journal publishing original peer-reviewed research articles in all aspects of the development, testing and novel application of computational and statistical methods for the modeling and analysis of all kinds of biological data, as well as other areas of computational biology. BMC Bioinformatics (ISSN 1471-2105) is indexed/tracked/covered by PubMed, MEDLINE, BIOSIS, CAS, EMBASE, Scopus, ACM, CABI, Thomson Reuters (ISI) and Google Scholar.

Let me give you a sample of what you will find here:

MINE: Module Identification in Networks by Kahn Rhrissorrakrai and Kristin C Gunsalus. BMC Bioinformatics 2011, 12:192 doi:10.1186/1471-2105-12-192.

Abstract:

Graphical models of network associations are useful for both visualizing and integrating multiple types of association data. Identifying modules, or groups of functionally related gene products, is an important challenge in analyzing biological networks. However, existing tools to identify modules are insufficient when applied to dense networks of experimentally derived interaction data. To address this problem, we have developed an agglomerative clustering method that is able to identify highly modular sets of gene products within highly interconnected molecular interaction networks.

Medicine isn’t my field by profession (although I enjoy reading about it) but it doesn’t take much to see the applicability of an “agglomerative clustering method” to other highly interconnected networks.

Reading across domain specific IR publications can help keep you from re-inventing the wheel or perhaps sparking an idea for a better wheel of your own making.

August 17, 2011

Virtual Cell Software Repository

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 6:49 pm

Virtual Cell Software Repository

From the webpage:

Developing large volume multi-scale systems dynamics interpretation technology is very important source for making virtual cell application systems. Also, this technology is focused on the core research topics in a post-genome era in order to maintain the national competitive power. It is new analysis technology which can analyze multi-scale from nano level to physiological level in system level. Therefore, if using excellent information technology and super computing power in our nation, we can hold a dominant position in the large volume multi-scale systems dynamics interpretation technology. In order to take independent technology, we need to research a field of study which have been not well known in the bio system informatics technology like the large volume multi-scale systems dynamics interpretation technology.

The purpose of virtual cell application systems is developing the analysis technology and service which can model bio application circuits based on super computing technology. For success of virtual cell application systems based on super computing power, we have researched large volume multi-scale systems dynamics technology as a core sub technology.

  • Developing analysis and modeling technology of multi-scale convergence information from nano level to physiological level
  • Developing protein structure modeling algorithm using multi-scale bio information
  • Developing quality and quantity character analysis technology of multi-scale networks
  • Developing protein modification search algorithm
  • Developing large volume multi-scale systems dynamics interpretation technology interpreting possible circumstances in complex parameter spaces

Amazing set of resources available here:

PSExplorer: Parameter Space Explorer

Mathematical models of biological systems often have a large number of parameters whose combinational variations can yield distinct qualitative behaviors. Since it is intractable to examine all possible combinations of parameters for nontrivial biological pathways, it is required to have a systematic way to explore the parameter space in a computational way so that distinct dynamic behaviors of a given pathway are estimated.

We present PSExplorer, an efficient computational tool to explore high dimensional parameter space of computational models for identifying qualitative behaviors and key parameters. The software supports input models in SBML format. It provides a friendly graphical user interface allowing users to vary model parameters and perform time-course simulations at ease. Various graphical plotting features helps users analyze the model dynamics conveniently. Its output is a tree structure that encapsulates the parameter space partitioning results in a form that is easy to visualize and provide users with additional information about important parameters and sub-regions with robust behaviors.

MONET: MOdularized NETwork learning

Although gene expression data has been continuously accumulated and meta-analysis approaches have been developed to integrate independent expression profiles into larger datasets, the amount of information is still insufficient to infer large scale genetic networks. In addition, global optimization such as Bayesian network inference, one of the most representative techniques for genetic network inference, requires tremendous computational load far beyond the capacity of moderate workstations.

MONET is a Cytoscape plugin to infer genome-scale networks from gene expression profiles. It alleviates the shortage of information by incorporating pre-existing annotations. The current version of MONET utilizes thousands of parallel computational cores in the supercomputing center in KISTI, Korea, to cope with the computational requirement for large scale genetic network inference.

RBSDesigner

RBS Designer was developed to computationally design synthetic ribosome binding sites (RBS) to control gene expression levels. Generally transcription processes are the major target for gene expression control, however, without considering translation processes the control could lead to unexpected expression results since translation efficiency is highly affected by nucleotide sequences nearby RBS such as coding sequences leading to distortion of RBS secondary structure. Such problems obscure the intuitive design of RBS nucleotides with a desired level of protein expression. We developed RBSDesigner based on a mathematical model on translation initiation to design synthetic ribosome binding sites that yield a desired level of expression of user-specified coding sequences.

SBN simulator: Switching Boolean Networks Simulator

Switching Boolean Networks Simulator(SBNsimulator) was developed to simulate large-scale signaling network. Boolean Networks is widely used in modeling signaling networks because of its straightforwardness, robustness, and compatibility with qualitative data. Signaling networks are not completely known yet in Biology. Because of this, there are gaps between biological reality and modeling such as inhibitor-only or activator-only in signaling networks. Synchronous update algorithm in threshold Boolean network has limitation which cannot sample differences in the speed of signal propagation. To overcome these limitation which are modeling anomaly and Limitation of synchronous update algorithm, we developed SBNsimulator. It can simulate how each node effect to target node. Therefore, It can say which node is important for signaling network.

MKEM: Multi-level Knowledge Emergence Model

Since Swanson proposed the Undiscovered Public Knowledge (UPK) model, there have been many approaches to uncover UPK by mining the biomedical literature. These earlier works, however, required substantial manual intervention to reduce the number of possible connections and are mainly applied to disease-effect relation. With the advancement in biomedical science, it has become imperative to extract and combine information from multiple disjoint researches, studies and articles to infer new hypotheses and expand knowledge. We propose MKEM, a Multi-level Knowledge Emergence Model, to discover implicit relationships using Natural Language Processing techniques such as Link Grammar and Ontologies such as Unified Medical Language System (UMLS) MetaMap. The contribution of MKEM is as follows: First, we propose a flexible knowledge emergence model to extract implicit relationships across different levels such as molecular level for gene and protein and Phenomic level for disease and treatment. Second, we employ MetaMap for tagging biological concepts. Third, we provide an empirical and systematic approach to discover novel relationships.

The system constitutes of two parts, tagger and the extractor (may require compilation)

A sentence of interest is given to the tagger which then proceeds to the creation of rule sets. The tagger stores this in a folder by the name of “ruleList”. These rule sets are then given by copying this folder to the extractor directory.

I blogged about an article on this project at: MKEM: a Multi-level Knowledge Emergence Model for mining undiscovered public knowledge.

Biodata Mining

Filed under: Bioinformatics,Biomedical,Data Mining — Patrick Durusau @ 6:47 pm

Biodata Mining

From the webpage:

BioData Mining is an open access, peer reviewed, online journal encompassing research on all aspects of data mining applied to high-dimensional biological and biomedical data, focusing on computational aspects of knowledge discovery from large-scale genetic, transcriptomic, genomic, proteomic, and metabolomic data.

What you would have seen since 1 July 2011:

An R Package Implementation of Multifactor Dimensionality Reduction

Hill-Climbing Search and Diversification within an Evolutionary Approach to Protein Structure Prediction

Detection of putative new mutacins by bioinformatic analysis using available web tools

Evolving hard problems: Generating human genetics datasets with a complex etiology

Taxon ordering in phylogenetic trees by means of evolutionary algorithms

Enjoy!

August 4, 2011

NCBI Handbook

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 6:20 pm

NCBI Handbook

From the website:

Bioinformatics consists of a computational approach to biomedical information management and analysis. It is being used increasingly as a component of research within both academic and industrial settings and is becoming integrated into both undergraduate and postgraduate curricula. The new generation of biology graduates is emerging with experience in using bioinformatics resources and, in some cases, programming skills.

The National Center for Biotechnology Information (NCBI) is one of the world’s premier Web sites for biomedical and bioinformatics research. Based within the National Library of Medicine at the National Institutes of Health, USA, the NCBI hosts many databases used by biomedical and research professionals. The services include PubMed, the bibliographic database; GenBank, the nucleotide sequence database; and the BLAST algorithm for sequence comparison, among many others.

Although each NCBI resource has online help documentation associated with it, there is no cohesive approach to describing the databases and search engines, nor any significant information on how the databases work or how they can be leveraged, for bioinformatics research on a larger scale. The NCBI Handbook is designed to address this information gap.

An extraordinary resource for learning about bioinformatics information sources.

July 31, 2011

Journal of Biomedical Semantics

Filed under: Bioinformatics,Biomedical,Searching,Semantics — Patrick Durusau @ 7:49 pm

Journal of Biomedical Semantics

From the webpage:

Journal of Biomedical Semantics addresses issues of semantic enrichment and semantic processing in the biomedical domain. The scope of the journal covers two main areas:

Infrastructure for biomedical semantics: focusing on semantic resources and repositoires, meta-data management and resource description, knowledge representation and semantic frameworks, the Biomedical Semantic Web, and semantic interoperability.

Semantic mining, annotation, and analysis: focusing on approaches and applications of semantic resources; and tools for investigation, reasoning, prediction, and discoveries in biomedicine.

As of 31 July 2011, here are the titles of the “latest” articles:

A shortest-path graph kernel for estimating gene product semantic similarity Alvarez MA, Qi X and Yan C Journal of Biomedical Semantics 2011, 2:3 (29 July 2011)

Semantic validation of the use of SNOMED CT in HL7 clinical documents Heymans S, McKennirey M and Phillips J Journal of Biomedical Semantics 2011, 2:2 (15 July 2011)

Protein interaction sentence detection using multiple semantic kernels Polajnar T, Damoulas T and Girolami M Journal of Biomedical Semantics 2011, 2:1 (14 May 2011)

Foundations for a realist ontology of mental disease Ceusters W and Smith B Journal of Biomedical Semantics 2010, 1:10 (9 December 2010)

Simple tricks for improving pattern-based information extraction from the biomedical literature Nguyen QL, Tikk D and Leser U Journal of Biomedical Semantics 2010, 1:9 (24 September 2010)

The DBCLS BioHackathon: standardization and interoperability for bioinformatics web services and workflows Katayama T, Arakawa K, Nakao M, Ono K, Aoki-Kinoshita KF, Yamamoto Y, Yamaguchi A, Kawashima S et al. Journal of Biomedical Semantics 2010, 1:8 (21 August 2010)

Oh, did I mention this is an open access journal?

July 6, 2011

The Neo4j Rest API. My Notebook

Filed under: Bioinformatics,Biomedical,Java,Neo4j — Patrick Durusau @ 2:14 pm

The Neo4j Rest API. My Notebook

From the post:

Neo4j is a open-source graph engine implemented in Java. This post is my notebook for the Neo4J-server, a server combining a REST API and a webadmin application into a single stand-alone server.

Nothing new in this Neo4j summary but Pierre Lindenbaum profiles himself: “PhD in Virology, bioinformatics, genetics, science, geek, java.”

Someone worth watching in the Neo4j/topic map universe.

June 28, 2011

Big Data Genomics – How to efficiently store and retrieve mutation

Filed under: Bioinformatics,Biomedical,Cassandra — Patrick Durusau @ 9:49 am

Big Data Genomics – How to efficiently store and retrieve mutation data by David Suvee.

About the post:

This blog post is the first one in a series of articles that describe the use of NoSQL databases to efficiently store and retrieve mutation data. Part one introduces the notion of mutation data and describes the conceptual use of the Cassandra NoSQL datastore.

From the post:

The only way to learn a new technology is by putting it into practice. Just try to find a suitable use case in your immediate working environment and give it go. In my case, it was trying to efficiently store and retrieve mutation data through a variety of NoSQL data stores, including Cassandra, MongoDB and Neo4J.

Promises to be an interesting series of posts that focus on a common data set and problem!

June 23, 2011

Bio4j – as an AWS snapshot

Filed under: Bio4j,Bioinformatics,Biomedical — Patrick Durusau @ 1:54 pm

Bio4j current release now available as an AWS snapshot

From the post:

For those using AWS (or willing to…) I just created a public snapshot containing the last version of Bio4j DB.

The snapshot details are the following:

  • Snapshot id: snap-25192d4c
  • Snapshot region: EU West (Ireland)
  • Snapshot size: 90 GB

The whole DB is under the folder ‘bio4jdb’.
In order to use it, just create a Bio4jManager instance and start navigating the graph!

Very cool!

« Newer PostsOlder Posts »

Powered by WordPress