Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 2, 2012

Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform

Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform by Anthony J. Cox, Markus J. Bauer, Tobias Jakobi, and Giovanna Rosone.

Abstract:

Motivation

The Burrows-Wheeler transform (BWT) is the foundation of many algorithms for compression and indexing of text data, but the cost of computing the BWT of very large string collections has prevented these techniques from being widely applied to the large sets of sequences often encountered as the outcome of DNA sequencing experiments. In previous work, we presented a novel algorithm that allows the BWT of human genome scale data to be computed on very moderate hardware, thus enabling us to investigate the BWT as a tool for the compression of such datasets.

Results

We first used simulated reads to explore the relationship between the level of compression and the error rate, the length of the reads and the level of sampling of the underlying genome and compare choices of second-stage compression algorithm.

We demonstrate that compression may be greatly improved by a particular reordering of the sequences in the collection and give a novel `implicit sorting’ strategy that enables these benefits to be realised without the overhead of sorting the reads. With these techniques, a 45x coverage of real human genome sequence data compresses losslessly to under 0.5 bits per base, allowing the 135.3Gbp of sequence to fit into only 8.2Gbytes of space (trimming a small proportion of low-quality bases from the reads improves the compression still further).

This is more than 4 times smaller than the size achieved by a standard BWT-based compressor (bzip2) on the untrimmed reads, but an important further advantage of our approach is that it facilitates the building of compressed full text indexes such as the FM-index on large-scale DNA sequence collections.

Important work for several reasons.

First, if the human genome is thought of as “big data,” it opens the possibility that compressed full text indexes can be build for other instances of “big data.”

Second, indexing is similar to topic mapping in the sense that pointers to information about a particular subject are gathered to a common location. Indexes often account for synonyms (see also) and distinguish the use of the same word for different subjects (polysemy).

Third, depending on the granularity of tokenizing and indexing, index entries should be capable of recombination to create new index entries.

Source code for this approach:

Code to construct the BWT and SAP-array on large genomic data sets is part of the BEETL library, available as a github respository at git@github.com:BEETL/BEETL.git.

Comments?

April 17, 2012

Using the Disease ontology (DO) to map the genes involved in a category of disease

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 7:11 pm

Using the Disease ontology (DO) to map the genes involved in a category of disease by Pierre Lindenbaum.

Of particular interest if you are developing topic maps for bioinformatics.

The medical community has created a number of mapping term resources. In this particular case a mapping from the DO (disease ontology) to OMIM (Online Mendelian Inheritance in Man) and to NCBI Gene (Gene).

April 15, 2012

Constructing Case-Control Studies With Hadoop

Filed under: Bioinformatics,Biomedical,Giraph,Hadoop,Medical Informatics — Patrick Durusau @ 7:13 pm

Constructing Case-Control Studies With Hadoop by Josh Wills.

From the post:

San Francisco seems to be having an unusually high number of flu cases/searches this April, and the Cloudera Data Science Team has been hit pretty hard. Our normal activities (working on Crunch, speaking at conferences, finagling a job with the San Francisco Giants) have taken a back seat to bed rest, throat lozenges, and consuming massive quantities of orange juice. But this bit of downtime also gave us an opportunity to focus on solving a large-scale data science problem that helps some of the people who help humanity the most: epidemiologists.

Case-Control Studies

A case-control study is a type of observational study in which a researcher attempts to identify the factors that contribute to a medical condition by comparing a set of subjects who have that condition (the ‘cases’) to a set of subjects who do not have the condition, but otherwise resemble the case subjects (the ‘controls’). They are useful for exploratory analysis because they are relatively cheap to perform, and have led to many important discoveries- most famously, the link between smoking and lung cancer.

Epidemiologists and other researchers now have access to data sets that contain tens of millions of anonymized patient records. Tens of thousands of these patient records may include a particular disease that a researcher would like to analyze. In order to find enough unique control subjects for each case subject, a researcher may need to execute tens of thousands of queries against a database of patient records, and I have spoken to researchers who spend days performing this laborious task. Although they would like to parallelize these queries across multiple machines, there is a constraint that makes this problem a bit more interesting: each control subject may only be matched with at most one case subject. If we parallelize the queries across the case subjects, we need to check to be sure that we didn’t assign a control subject to multiple cases. If we parallelize the queries across the control subjects, we need to be sure that each case subject ends up with a sufficient number of control subjects. In either case, we still need to query the data an arbitrary number of times to ensure that the matching of cases and controls we come up with is feasible, let alone optimal.

Analyzing a case-control study is a problem for a statistician. Constructing a case-control study is a problem for a data scientist.

Great walk through on constructing a case-control study, including the use of the Apache Giraph library.

April 13, 2012

Operations, machine learning and premature babies

Filed under: Bioinformatics,Biomedical,Machine Learning — Patrick Durusau @ 4:40 pm

Operations, machine learning and premature babies: An astonishing connection between web ops and medical care. By Mike Loukides.

From the post:

Julie Steele and I recently had lunch with Etsy’s John Allspaw and Kellan Elliott-McCrea. I’m not sure how we got there, but we made a connection that was (to me) astonishing between web operations and medical care for premature infants.

I’ve written several times about IBM’s work in neonatal intensive care at the University of Toronto. In any neonatal intensive care unit (NICU), every baby is connected to dozens of monitors. And each monitor is streaming hundreds of readings per second into various data systems. They can generate alerts if anything goes severely out of spec, but in normal operation, they just generate a summary report for the doctor every half hour or so.

IBM discovered that by applying machine learning to the full data stream, they were able to diagnose some dangerous infections a full day before any symptoms were noticeable to a human. That’s amazing in itself, but what’s more important is what they were looking for. I expected them to be looking for telltale spikes or irregularities in the readings: perhaps not serious enough to generate an alarm on their own, but still, the sort of things you’d intuitively expect of a person about to become ill. But according to Anjul Bhambhri, IBM’s Vice President of Big Data, the telltale signal wasn’t spikes or irregularities, but the opposite. There’s a certain normal variation in heart rate, etc., throughout the day, and babies who were about to become sick didn’t exhibit the variation. Their heart rate was too normal; it didn’t change throughout the day as much as it should.

That observation strikes me as revolutionary. It’s easy to detect problems when something goes out of spec: If you have a fever, you know you’re sick. But how do you detect problems that don’t set off an alarm? How many diseases have early symptoms that are too subtle for a human to notice, and only accessible to a machine learning system that can sift through gigabytes of data?

The post goes on to discuss how our servers may exhibit behaviors that machine learning could recognize but that we can’t specify.

That may be Rumsfeld’s “unknown unknowns,” however we all laughed at the time.

There are “unknown unknown’s” and tireless machine learning may be the only way to identify them.

In topic map lingo, I would say there are subjects that we haven’t yet learned to recognize.

April 12, 2012

From Beaker to Bits: Graph Theory Yields Computational Model of Human Tissue

Filed under: Bioinformatics,Biomedical,Graphs — Patrick Durusau @ 7:04 pm

From Beaker to Bits: Graph Theory Yields Computational Model of Human Tissue

An all too seldom example of how reaching across disciplinary lines can lead to fundamental breakthroughs in more than one area.

First step, alert any graph or data store people you know, along with any medical research types.

Second step, if you are in CS/Math, think about another department that interests you. If you are in other sciences or humanities, strike up a conversation with the CS/Math department types.

In both cases, don’t take “no” or lack of interest as an answer. Talk to the newest faculty or even faculty at other institutions. Or even established companies.

No guarantees that you will strike up a successful collaboration, much less have a successful result. But, we all know how successful a project that never begins will be, don’t we?

Here is a story of a collaborative project that persisted and succeeded:

Computer scientists and biologists in the Data Science Research Center at Rensselaer Polytechnic Institute have developed a rare collaboration between the two very different fields to pick apart a fundamental roadblock to progress in modern medicine. Their unique partnership has uncovered a new computational model called “cell graphs” that links the structure of human tissue to its corresponding biological function. The tool is a promising step in the effort to bring the power of computational science together with traditional biology to the fight against human diseases, such as cancer.

The discovery follows a more than six-year collaboration, breaking ground in both fields. The work will serve as a new method to understand and predict relationships between the cells and tissues in the human body, which is essential to detect, diagnose and treat human disease. It also serves as an important reminder of the power of collaboration in the scientific process.

The new research led by Professor of Biology George Plopper and Professor of Computer Science Bulent Yener is published in the March 30, 2012, edition of the journal PLoS One in a paper titled, “ Coupled Analysis of in Vitro and Histology Tissue Samples to Quantify Structure-Function Relationship.” They were joined in the research by Evrim Acar, a graduate student at Rensselaer in Yener’s lab currently at the University of Copenhagen. The research is funded by the National Institutes of Health and the Villum Foundation.

The new, purely computational tool models the relationship between the structure and function of different tissues in body. As an example of this process, the new paper analyzes the structure and function of healthy and cancerous brain, breast and bone tissues. The model can be used to determine computationally whether a tissue sample is cancerous or not, rather than relying on the human eye as is currently done by pathologists around the world each day. The objective technique can be used to eliminate differences of opinion between doctors and as a training tool for new cancer pathologists, according to Yener and Plopper. The tool also helps fill an important gap in biological knowledge, they said.

BTW, if you want to see all the details: Coupled Analysis of in Vitro and Histology Tissue Samples to Quantify Structure-Function Relationship

April 8, 2012

Indexing the content of Gene Ontology with apache SOLR

Filed under: Bioinformatics,Biomedical,Gene Ontology,Solr — Patrick Durusau @ 4:21 pm

Indexing the content of Gene Ontology with apache SOLR by Pierre Lindenbaum.

Pierre walks you through the use of Solr to index GeneOntology. As with all of his work, impressive!

Of course, one awesome post deserves another! So Pierre follows with:

Apache SOLR and GeneOntology: Creating the JQUERY-UI client (with autocompletion)

So you get to learn JQuery/UI stuff as well.

April 2, 2012

The 1000 Genomes Project

The 1000 Genomes Project

If Amazon is hosting a single dataset > 200 TB, is your data “big data?” 😉

This merits quoting in full:

We're very pleased to welcome the 1000 Genomes Project data to Amazon S3. 

The original human genome project was a huge undertaking. It aimed to identify every letter of our genetic code, 3 billion DNA bases in total, to help guide our understanding of human biology. The project ran for over a decade, cost billions of dollars and became the corner stone of modern genomics. The techniques and tools developed for the human genome were also put into practice in sequencing other species, from the mouse to the gorilla, from the hedgehog to the platypus. By comparing the genetic code between species, researchers can identify biologically interesting genetic regions for all species, including us.

A few years ago there was a quantum leap in the technology for sequencing DNA, which drastically reduced the time and cost of identifying genetic code. This offered the promise of being able to compare full genomes from individuals, rather than entire species, leading to a much more detailed genetic map of where we, as individuals, have genetic similarities and differences. This will ultimately give us better insight into human health and disease.

The 1000 Genomes Project, initiated in 2008, is an international public-private consortium that aims to build the most detailed map of human genetic variation available, ultimately with data from the genomes of over 2,661 people from 26 populations around the world. The project began with three pilot studies that assessed strategies for producing a catalog of genetic variants that are present at one percent or greater in the populations studied. We were happy to host the initial pilot data on Amazon S3 in 2010, and today we're making the latest dataset available to all, including results from sequencing the DNA of approximately 1,700 people.

The data is vast (the current set weighs in at over 200Tb), so hosting the data on S3 which is closely located to the computational resources of EC2 means that anyone with an AWS account can start using it in their research, from anywhere with internet access, at any scale, whilst only paying for the compute power they need, as and when they use it. This enables researchers from laboratories of all sizes to start exploring and working with the data straight away. The Cloud BioLinux AMIs are ready to roll with the necessary tools and packages, and are a great place to get going.

Making the data available via a bucket in S3 also means that customers can crunch the information using Hadoop via Elastic MapReduce, and take advantage of the growing collection of tools for running bioinformatics job flows, such as CloudBurst and Crossbow

You can find more information, the location of the data and how to get started using it on our 1000 Genomes web page, or from the project pages.

If that sounds like a lot of data, just imagine all of the recorded mathematical texts and the relationships between the concepts represented in such texts?

It is in our view that data looks smooth or simple. Or complex.

March 28, 2012

GWAS Central

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 4:22 pm

GWAS Central

From the website:

GWAS Central (previously the Human Genome Variation database of Genotype-to-Phenotype information) is a database of summary level findings from genetic association studies, both large and small. We actively gather datasets from public domain projects, and encourage direct data submission from the community.

GWAS Central is built upon a basal layer of Markers that comprises all known SNPs and other variants from public databases such as dbSNP and the DBGV. Allele and genotype frequency data, plus genetic association significance findings, are added on top of the Marker data, and organised the same way that investigations are reported in typical journal manuscripts. Critically, no individual level genotypes or phenotypes are presented in GWAS Central – only group level aggregated (summary level) data. The largest unit in a data submission is a Study, which can be thought of as being equivalent to one journal article. This may contain one or more Experiments, one or more Sample Panels of test subjects, and one or more Phenotypes. Sample Panels may be characterised in terms of various Phenotypes, and they also may be combined and/or split into Assayed Panels. The Assayed Panels are used as the basis for reporting allele/genotype frequencies (in `Genotype Experiments`) and/or genetic association findings (in ‘Analysis Experiments’). Environmental factors are handled as part of the Sample Panel and Assayed Panel data structures.

Although I mentioned GWAS some time ago, I saw it mentioned in Christophe Lalanne’s Bag of Tweets for March 2012 and on taking a another look, thought I should mention it again.

In part because as the project reports above, this is an aggregation level site, not one that reaches into the details of studies, that may or may not be important for some researchers. That aggregation leaves a gap for aggregation or analysis of the underlying data, plus mapping it to other data!

Openfmri.org

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 4:22 pm

Openfmri.org

From the webpage:

OpenfMRI.org is a project dedicated to the free and open sharing of functional magnetic resonance imaging (fMRI) datasets, including raw data.

Now that’s a data set you don’t see everyday!

Not to mention being one that would be ripe to link into medical literature, hospital/physician records, etc.

First seen in Christophe Lalanne’s Bag of Tweets for March, 2012.

March 18, 2012

Drug data reveal sneaky side effects

Filed under: Bioinformatics,Biomedical,Knowledge Economics,Medical Informatics — Patrick Durusau @ 8:54 pm

Drug data reveal sneaky side effects

From the post:

An algorithm designed by US scientists to trawl through a plethora of drug interactions has yielded thousands of previously unknown side effects caused by taking drugs in combination.

The work, published today in Science Translational Medicine [Tatonetti, N. P., Ye, P. P., Daneshjou, R. and Altman, R. B. Sci. Transl. Med. 4, 125ra31 (2012).], provides a way to sort through the hundreds of thousands of ‘adverse events’ reported to the US Food and Drug Administration (FDA) each year. “It’s a step in the direction of a complete catalogue of drug–drug interactions,” says the study’s lead author, Russ Altman, a bioengineer at Stanford University in California.

From later in the post:

The team then used this method to compile a database of 1,332 drugs and possible side effects that were not listed on the labels for those drugs. The algorithm came up with an average of 329 previously unknown adverse events for each drug — far surpassing the average of 69 side effects listed on most drug labels.

Double trouble

The team also compiled a similar database looking at interactions between pairs of drugs, which yielded many more possible side effects than could be attributed to either drug alone. When the data were broken down by drug class, the most striking effect was seen when diuretics called thiazides, often prescribed to treat high blood pressure and oedema, were used in combination with a class of drugs called selective serotonin reuptake inhibitors, used to treat depression. Compared with people who used either drug alone, patients who used both drugs were significantly more likely to experience a heart condition known as prolonged QT, which is associated with an increased risk of irregular heartbeats and sudden death.

A search of electronic medical records from Stanford University Hospital confirmed the relationship between these two drug classes, revealing a roughly 1.5-fold increase in the likelihood of prolonged QT when the drugs were combined, compared to when either drug was taken alone. Altman says that the next step will be to test this finding further, possibly by conducting a clinical trial in which patients are given both drugs and then monitored for prolonged QT.

This data could be marketed to drug companies, trial lawyers (both sides), medical malpractice insurers, etc. This is an example of the data marketing I mentioned in Knowledge Economics II.

March 12, 2012

Bio4jExplorer, new features and design!

Filed under: Bio4j,Bioinformatics,Medical Informatics — Patrick Durusau @ 8:04 pm

Bio4jExplorer, new features and design!

Pablo Pareja Tobes writes:

I’m happy to announce a new set of features for our tool Bio4jExplorer plus some changes in its design. I hope this may help both potential and current users to get a better understanding of Bio4j DB structure and contents.

Among the new features:

  • Node & Relationship Properties
  • Node & Relationship Data Source
  • Relationships Name Property

It may take time but even with “big data,” the source of data (as an aspect of validity or trust) is going to become a requirement.

March 6, 2012

Extending the GATK for custom variant comparisons using Clojure

Filed under: Bioinformatics,Biomedical,Clojure,MapReduce — Patrick Durusau @ 8:09 pm

Extending the GATK for custom variant comparisons using Clojure by Brad Chapman.

From the post:

The Genome Analysis Toolkit (GATK) is a full-featured library for dealing with next-generation sequencing data. The open-source Java code base, written by the Genome Sequencing and Analysis Group at the Broad Institute, exposes a Map/Reduce framework allowing developers to code custom tools taking advantage of support for: BAM Alignment files through Picard, BED and other interval file formats through Tribble, and variant data in VCF format.

Here I’ll show how to utilize the GATK API from Clojure, a functional, dynamic programming language that targets the Java Virtual Machine. We’ll:

  • Write a GATK walker that plots variant quality scores using the Map/Reduce API.
  • Create a custom annotation that adds a mean neighboring base quality metric using the GATK VariantAnnotator.
  • Use the VariantContext API to parse and access variant information in a VCF file.

The Clojure variation library is freely available and is part of a larger project to provide variant assessment capabilities for the Archon Genomics XPRIZE competition.

Interesting data, commercial potential, cutting edge technology and subject identity issues galore. What more could you want?

March 5, 2012

Java Remote Method Invocation (RMI) for Bioinformatics

Filed under: Bioinformatics,Java,Remote Method Invocation (RMI) — Patrick Durusau @ 7:53 pm

Java Remote Method Invocation (RMI) for Bioinformatics by Pierre Lindenbaum.

From the post:

Java Remote Method Invocation (Java RMI) enables the programmer to create distributed Java technology-based to Java technology-based applications, in which the methods of remote Java objects can be invoked from other Java virtual machines*, possibly on different hosts.“[Oracle] In the current post a java client will send a java class to the server that will analyze a DNA sequence fetched from the NCBI, using the RMI technology.

Distributed computing, both to the client and server, is likely to form part of a topic map solution. This example is one drawn from bioinformatics but the principles are generally applicable.

February 23, 2012

Finding the lowest common ancestor of a set of NCBI taxonomy nodes with Bio4j

Filed under: Bio4j,Bioinformatics,Common Ancestor,Taxonomy — Patrick Durusau @ 4:51 pm

Finding the lowest common ancestor of a set of NCBI taxonomy nodes with Bio4j

Pablo Pareja writes:

I don’t know if you have ever heard of the lowest common ancestor problem in graph theory and computer science but it’s actually pretty simple. As its name says, it consists of finding the common ancestor for two different nodes which has the lowest level possible in the tree/graph.

Even though it is normally defined for only two nodes given it can easily be extended for a set of nodes with an arbitrary size. This is a quite common scenario that can be found across multiple fields and taxonomy is one of them.

The reason I’m talking about all this is because today I ran into the need to make use of such algorithm as part of some improvements in our metagenomics MG7 method. After doing some research looking for existing solutions, I came to the conclusion that I should implement my own, – I couldn’t find any applicable implementation that was thought for more than just two nodes.

Important for its use with NCBI taxonomy nodes but another use case comes readily to mind.

What about overlapping markup?

Traditionally we represent markup elements as single nodes, despite their composition of and events for each “well-formed” element in the text stream.

But what if we represent and events as nodes in a graph with relationships both to each other and other nodes in the markup stream?

Can we then ask the question, Which pair of / nodes are the ancestor of either a or element?

If they have the same ancestor then we have the uninteresting case of well-formed markup.

But what if they don’t have the same ancestor? What can the common ancestor method tell us about the structure of the markup?

Definitely a research topic.

February 16, 2012

Effectopedia

Filed under: Bioinformatics,Biomedical,Collaboration — Patrick Durusau @ 7:03 pm

Effectopedia – An Open Data Project for Collaborative Scientific Research, with the aim of reducing Animal Testing by Velichka Dimitrova, Coordinator of the Open Economics Working Group and Hristo Alajdov, Associate Professor at Institute of Biomedical Engineering at the Bulgarian Academy of Sciences.

From the post:

One of the key problems in natural science research is the lack of effective collaboration. A lot of research is conducted by scientists from different disciplines, yet cross-discipline collaboration is rare. Even within a discipline, research is often duplicated, which wastes resources and valuable scientific potential. Furthermore, without a common framework and context, research that involves animal testing often becomes phenomenological and little or no general knowledge can be gained from it. The peer reviewed publishing process is also not very effective in stimulating scientific collaboration, mainly due to the loss of an underlying machine readable structure for the data and the duration of the process itself.

If research results were more effectively shared and re-used by a wider scientific community – including scientists with different disciplinary backgrounds – many of these problems could be addressed. We could hope to see a more efficient use of resources, an accelerated rate of academic publications, and, ultimately, a reduction in animal testing.

Effectopedia is a project of the International QSAR Foundation. Effectopedia itself is an open knowledge aggregation and collaboration tool that provides a means of describing adverse outcome pathways (AOPs)1 in an encyclopedic manner. Effectopedia defines internal organizational space which helps scientist with different backgrounds to know exactly where their knowledge belongs and aids them in identifying both the larger context of their research and the individual experts who might be actively interested in it. Using automated notifications when researchers create causal linkage between parts of the pathways, they can simultaneously create a valuable contact with a fellow researcher interested in the same topic who might have a different background or perspective towards the subject. Effectopedia allows creation of live scientific documents which are instantly open for focused discussions and feedback whilst giving credit to the original authors and reviewers involved. The review process is never closed and if new evidence arises it can be presented immediately, allowing the information in Effectopedia to remain current, while keeping track of its complete evolution.

Sounds interesting but there is no link to the Effectopedia website. Followed links a bit and found: Effectopedia at SourceForge.

Apparently still in pre-alpha state.

I remember more than one workspace project so how do we decide whose identifications/terminology gets used?

Isn’t that the tough nut of collaboration? If scholars (given my background in biblical studies) decide to collaborate beyond their departments, they form projects, but that are less inclusive than all workers in a particular area. The end result being there are multiple projects with different identifications/terminologies. How do we bridge those gaps?

As you know, my suggestion is that everyone keeps their own identifications/terminologies.

Curious though if everyone does, keeps their own identifications/terminologies, if they will be able to read enough of another project’s content to understand that it is meaningful in their quest?

That is a topic map author deciding that two or more representatives represent the same subject may not carry over to users of the topic map having the same appreciation.

February 14, 2012

On Approximating String Selection Problems with Outliers

Filed under: Algorithms,Bioinformatics,String Matching — Patrick Durusau @ 5:07 pm

On Approximating String Selection Problems with Outliers by Christina Boucher, Gad M. Landau, Avivit Levy, David Pritchard and Oren Weimann.

Abstract:

Many problems in bioinformatics are about finding strings that approximately represent a collection of given strings. We look at more general problems where some input strings can be classified as outliers. The Close to Most Strings problem is, given a set S of same-length strings, and a parameter d, find a string x that maximizes the number of “non-outliers” within Hamming distance d of x. We prove this problem has no PTAS unless ZPP=NP, correcting a decade-old mistake. The Most Strings with Few Bad Columns problem is to find a maximum-size subset of input strings so that the number of non-identical positions is at most k; we show it has no PTAS unless P=NP. We also observe Closest to k Strings has no EPTAS unless W[1]=FPT. In sum, outliers help model problems associated with using biological data, but we show the problem of finding an approximate solution is computationally difficult.

Just in case you need a break from graph algorithms, intractable and otherwise. 😉

February 8, 2012

PSEUDOMARKER: a powerful program for joint linkage…

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 5:13 pm

PSEUDOMARKER: a powerful program for joint linkage and/or linkage disequilibrium analysis on mixtures of singletons and related individuals. By Hiekkalinna T, Schäffer AA, Lambert B, Norrgrann P, Göring HH, Terwilliger JD.

Abstract:

A decade ago, there was widespread enthusiasm for the prospects of genome-wide association studies to identify common variants related to common chronic diseases using samples of unrelated individuals from populations. Although technological advancements allow us to query more than a million SNPs across the genome at low cost, a disappointingly small fraction of the genetic portion of common disease etiology has been uncovered. This has led to the hypothesis that less frequent variants might be involved, stimulating a renaissance of the traditional approach of seeking genes using multiplex families from less diverse populations. However, by using the modern genotyping and sequencing technology, we can now look not just at linkage, but jointly at linkage and linkage disequilibrium (LD) in such samples. Software methods that can look simultaneously at linkage and LD in a powerful and robust manner have been lacking. Most algorithms cannot jointly analyze datasets involving families of varying structures in a statistically or computationally efficient manner. We have implemented previously proposed statistical algorithms in a user-friendly software package, PSEUDOMARKER. This paper is an announcement of this software package. We describe the motivation behind the approach, the statistical methods, and software, and we briefly demonstrate PSEUDOMARKER’s advantages over other packages by example.

I didn’t set out to find this particular article but was trying to update references on Cri-Map, which is now somewhat data software for:

… rapid, largely automated construction of multilocus linkage maps (and facilitate the attendant tasks of assessing support relative to alternative locus orders, generating LOD tables, and detecting data errors). Although originally designed to handle codominant loci (e.g. RFLPs) scored on pedigrees “without missing individuals”, such as CEPH or nuclear families, it can now (with some caveats described below) be used on general pedigrees, and some disease loci.

Just as background, you may wish to see:

CRI-MAP – Introduction

And, Multilocus linkage analysis

With multilocus linkage analysis, more than two loci are simultaneously considered for linkage. When mapping a disease gene relative to a group of markers with known intermarker recombination fractions, it is possible to perform parametric (lod score) as well as nonparametric analysis.

My interest being in the use of additional information (in the lead article “linkage and linkage disequilibrium”) in determining linkage issues.

Not that every issue of subject identification needs or should be probabilistic or richly nuanced.

In a prison there are “free men” and prisoners.

Rather sharp and useful distinction. Doesn’t require a URL. Or a subject identifier. What does your use case require?

February 4, 2012

Clojure and XNAT: Introduction

Filed under: Bioinformatics,Clojure,Neuroinformatics,Regexes,XNAT — Patrick Durusau @ 3:38 pm

Clojure and XNAT: Introduction

Over the last two years, I’ve been using Clojure quite a bit for managing, testing, and exploratory development in XNAT. Clojure is a new member of the Lisp family of languages that runs in the Java Virtual Machine. Two features of Clojure that I’ve found particularly useful are seamless Java interoperability and good support for interactive development.

“Interactive development” is a term that may need some explanation: With many languages — Java, C, and C++ come to mind — you write your code, compile it, and then run your program to test. Most Lisps, including Clojure, have a different model: you start the environment, write some code, test a function, make changes, and rerun your test with the new code. Any state necessary for the test stays in memory, so each write/compile/test iteration is fast. Developing in Clojure feels a lot like running an interpreted environment like Matlab, Mathematica, or R, but Clojure is a general-purpose language that compiles to JVM bytecode, with performance comparable to plain old Java.

One problem that comes up again and again on the XNAT discussion group and in our local XNAT support is that received DICOM files land in the unassigned prearchive rather than the intended project. Usually when this happens, there’s a custom rule for project identification where the regular expression doesn’t quite match what’s in the DICOM headers. Regular expressions are a wonderfully concise way of representing text patterns, but this sentence is equally true if you replace “wonderfully concise” with “maddeningly cryptic.”

Interesting “introduction” that focuses on regular expressions.

If you don’t know XNAT (I didn’t):

XNAT is an open source imaging informatics platform, developed by the Neuroinformatics Research Group at Washington University. It facilitates common management, productivity, and quality assurance tasks for imaging and associated data. Thanks to its extensibility, XNAT can be used to support a wide range of imaging-based projects.

Important neuroinformatics project based at Washington University, which has a history of very successful public technology projects.

Never hurts to learn more about any informatics project, particularly one in the medical sciences. With an introduction to Clojure as well, what more could you want?

February 3, 2012

Seal

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 4:53 pm

Seal

From the site:

Seal is a Hadoop-based distributed short read alignment and analysis toolkit. Currently Seal includes tools for: read demultiplexing, read alignment, duplicate read removal, sorting read mappings, and calculating statistics for empirical base quality recalibration. Seal scales, easily handling TB of data.

Features:

  • short read alignment (based on BWA)
  • duplicate read identification
  • sort read mappings
  • calculate empirical base quality recalibration tables
  • fast, scalable, reliable (runs on Hadoop)

Seal website with extensive documentation.

February 1, 2012

Bio4j: A pioneer graph based database…

Filed under: Bio4j,Bioinformatics,Neo4j — Patrick Durusau @ 4:35 pm

Bio4j: A pioneer graph based database for the integration of biological Big Data by Pablo Pareja Tobes.

Great slide deck by the principal developer for Bio4j.

Take a close look at slide 19 and tell me what it reminds you of?

😉

January 31, 2012

Inside the Variation Toolkit: Tools for Gene Ontology

Filed under: Bioinformatics,Biomedical,Gene Ontology — Patrick Durusau @ 4:33 pm

Inside the Variation Toolkit: Tools for Gene Ontology by Pierre Lindenbaum.

From the post:

GeneOntologyDbManager is a C++ tool that is part of my experimental Variation Toolkit.

This program is a set of tools for GeneOntology, it is based on the sqlite3 library.

Pierre walks through building and using his GeneOntologyDbManager.

Rather appropriate to mention an area (bioinformatics) that is exploding with information on the same day as GPU and database posts. Plus I am sure you will find the Gene Ontology useful for topic map purposes.

January 27, 2012

NOSQL for bioinformatics: Bio4j, a real world use case using Neo4j (Madrid, Spain)

Filed under: Bioinformatics,Neo4j,NoSQL — Patrick Durusau @ 4:35 pm

NOSQL for bioinformatics: Bio4j, a real world use case using Neo4j

Monday, January 30, 2012, 7:00 PM

From the meeting notice:

The world of data is changing. Big Data and NOSQL are bringing new ways of looking at and understanding your data. Prominent in the trend is Neo4j, a graph database that elevates relationships to first-class citizens, uniquely offering a way to model and query highly connected data.

This opens a whole new world of possibilities for a wide range of fields, and bioinformatics is no exception. Quite the opposite, this paradigm provides bioinformaticians with a powerful and intuitive framework for dealing with biological data which by nature is incredibly interconnected.

We’ll give a quick overview of the NOSQL world today, introducing then Neo4j in particular. Afterwards we’ll move to real use cases focusing in Bio4j project.

I would really love to see this presentation, particularly the Bio4j part.

But, I won’t be in Madrid this coming Monday.

If you are, don’t miss this presentation! Take good notes and blog about it. The rest of us would appreciate it!

January 11, 2012

Bio4j release 0.7 is out !

Filed under: Bioinformatics,Biomedical,Cypher,Graphs,Gremlin,Medical Informatics,Visualization — Patrick Durusau @ 8:02 pm

Bio4j release 0.7 is out !

A quick list of the new features:

  • Expasy Enzyme database integration
  • Node type indexing
  • Amazon web services Availability in all Regions
  • New CloudFormation templates
  • Bio4j REST server
  • Explore you database with the Data browser
  • Run queries with Cypher
  • Querying Bio4j with Gremlin

Wait! Did I say Cypher and Gremlin!?

Looks like this graph querying stuff is spreading. 🙂

Even if you are not working in bioinformatics, Bio4j is worth more than a quick look.

January 9, 2012

SIMI 2012 : Semantic Interoperability in Medical Informatics

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 1:48 pm

SIMI 2012 : Semantic Interoperability in Medical Informatics

Dates:

When May 27, 2012 – May 27, 2012
Where Heraklion (Crete), Greece
Submission Deadline Mar 4, 2012
Notification Due Apr 1, 2012
Final Version Due Apr 15, 2012

From the call for papers:

To gather data on potential application to new diseases and disorders is increasingly to be not only a means for evaluating the effectiveness of new medicine and pharmaceutical formulas but also for experimenting on existing drugs and their appliance to new diseases and disorders. Although the wealth of published non-clinical and clinical information is increasing rapidly, the overall number of new active substances undergoing regulatory review is gradually falling, whereas pharmaceutical companies tend to prefer launching modified versions of existing drugs, which present reduced risk of failure and can generate generous profits. In the meanwhile, market numbers depict the great difficulty faced by clinical trials in successfully translating basic research into effective therapies for the patients. In fact, success rates, from first dose in man in clinical trials to registration of the drug and release in the market, are only about 11% across indications. But, even if a treatment reaches the broad patient population through healthcare, it may prove not to be as effective and/or safe as indicated in the clinical research findings.

Within this context, bridging basic science to clinical practice comprises a new scientific challenge which can result in successful clinical applications with low financial cost. The efficacy of clinical trials, in combination with the mitigation of patients’ health risks, requires the pursuit of a number of aspects that need to be addressed ranging from the aggregation of data from various heterogeneous distributed sources (such as electronic health records – EHRs, disease and drug data sources, etc) to the intelligent processing of this data based on the study-specific requirements for choosing the “right” target population for the therapy and in the end selecting the patients eligible for recruitment.

Data collection poses a significant challenge for investigators, due to the non-interoperable heterogeneous distributed data sources involved in the life sciences domain. A great amount of medical information crucial to the success of a clinical trial could be hidden inside a variety of information systems that do not share the same semantics and/or structure or adhere to widely deployed clinical data standards. Especially in the case of EHRs, the wealth of information within them, which could provide important information and allow of knowledge enrichment in the clinical trial domain (during test of hypothesis generation and study design) as well as act as a fast and reliable bridge between study requirements for recruitment and patients who would like to participate in them, still remains unlinked from the clinical trial lifecycle posing restrictions in the overall process. In addition, methods for efficient literature search and hypothesis validation are needed, so that principal investigators can research efficiently on new clinical trial cases.

The goal of the proposed workshop is to foster exchange of ideas and offer a suitable forum for discussions among researchers and developers on great challenges that are posed in the effort of combining information underlying the large number of heterogeneous data sources and knowledge bases in life sciences, including: – Strong multi-level (semantic, structural, syntactic, interface) heterogeneity issues in clinical research and healthcare domains – Semantic interoperability both at schema and data/instance level – Handling of unstructured information, i.e., literature articles – Reasoning on the wealth of existing data (published findings, background knowledge on diseases, drugs, targets, Electronic Health Records) can boost and enhance clinical research and clinical care processes – Acquisition/extraction of new knowledge from published information and Electronic Health Records – Enhanced matching between clinicians as well as patients΅¦ needs and available informational content

Apologies for the length of the quote but this is a tough nut that simply saying “topic maps,” isn’t going to solve. As described above, there is a set of domains, each with its own information gathering, processing and storage practices, none of which are going to change rapidly, or consistently.

Although I think topic maps can play a role in solving this sort of issue, it will be by being the “integration rain drop” that starts with some obvious integration issue and solves it and only it. Does not try to be a solution for every issue or requirement. Having solved one, it then spreads out to solve another one.

The key is going to be the delivery of clear and practical advantages in concrete situations.

One approach could be to identify current semantic integration efforts (which tend to have global aspirations) and effect semantic mappings between those solutions. Which has the advantage of allowing the advocates of those systems to continue while a topic map can offer other systems an integration of data from those parts.

International Symposium on Bioinformatics Research and Applications

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 1:30 pm

ISBRA 2012 : International Symposium on Bioinformatics Research and Applications

Dates:

When May 21, 2012 – May 23, 2012
Where Dallas, Texas
Submission Deadline Feb 6, 2012
Notification Due Mar 5, 2012
Final Version Due Mar 15, 2012

From the call for papers:

The International Symposium on Bioinformatics Research and Applications (ISBRA) provides a forum for the exchange of ideas and results among researchers, developers, and practitioners working on all aspects of bioinformatics and computational biology and their applications. Submissions presenting original research are solicited in all areas of bioinformatics and computational biology, including the development of experimental or commercial systems.

January 7, 2012

The Variation Toolkit

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 4:00 pm

The Variation Toolkit by Pierre Lindenbaum.

From the post:

During the last weeks, I’ve worked on an experimental C++ package named The Variation Toolkit (varkit). It was originally designed to provide some command lines equivalent to knime4bio but I’ve added more tools over time. Some of those tools are very simple-and-stupid ( fasta2tsv) , reinvent the wheel (“numericsplit“), are part of an answer to biostar, are some old tools (e.g. bam2wig) that have been moved to this package, but some others like “samplepersnp“, “groupbygene” might be useful to people.

The package is available at : http://code.google.com/p/variationtoolkit/.

See the post for documentation.

January 5, 2012

Interoperability Driven Integration of Biomedical Data Sources

Interoperability Driven Integration of Biomedical Data Sources by Douglas Teodoro, Rémy Choquet, Daniel Schober, Giovanni Mels, Emilie Pasche, Patrick Ruch, and Christian Lovis.

Abstract:

In this paper, we introduce a data integration methodology that promotes technical, syntactic and semantic interoperability for operational healthcare data sources. ETL processes provide access to different operational databases at the technical level. Furthermore, data instances have they syntax aligned according to biomedical terminologies using natural language processing. Finally, semantic web technologies are used to ensure common meaning and to provide ubiquitous access to the data. The system’s performance and solvability assessments were carried out using clinical questions against seven healthcare institutions distributed across Europe. The architecture managed to provide interoperability within the limited heterogeneous grid of hospitals. Preliminary scalability result tests are provided.

Appears in:

Studies in Health Technology and Informatics
Volume 169, 2011
User Centred Networked Health Care – Proceedings of MIE 2011
Edited by Anne Moen, Stig Kjær Andersen, Jos Aarts, Petter Hurlen
ISBN 978-1-60750-805-2

I have been unable to find a copy online, well, other than the publisher’s copy, at $20 for four pages. I have written to one of the authors requesting a personal use copy as I would like to report back on what it proposes.

January 3, 2012

Topical Classification of Biomedical Research Papers – Details

Filed under: Bioinformatics,Biomedical,Medical Informatics,MeSH,PubMed,Topic Maps — Patrick Durusau @ 5:11 pm

OK, I registered both on the site and for the contest.

From the Task:

Our team has invested a significant amount of time and effort to gather a corpus of documents containing 20,000 journal articles from the PubMed Central open-access subset. Each of those documents was labeled by biomedical experts from PubMed with several MeSH subheadings that can be viewed as different contexts or topics discussed in the text. With a use of our automatic tagging algorithm, which we will describe in details after completion of the contest, we associated all the documents with the most related MeSH terms (headings). The competition data consists of information about strengths of those bonds, expressed as numerical value. Intuitively, they can be interpreted as values of a rough membership function that measures a degree in which a term is present in a given text. The task for the participants is to devise algorithms capable of accurately predicting MeSH subheadings (topics) assigned by the experts, based on the association strengths of the automatically generated tags. Each document can be labeled with several subheadings and this number is not fixed. In order to ensure that participants who are not familiar with biomedicine, and with the MeSH ontology in particular, have equal chances as domain experts, the names of concepts and topical classifications are removed from data. Those names and relations between data columns, as well as a dictionary translating decision class identifiers into MeSH subheadings, can be provided on request after completion of the challenge.

Data format: The data set is provided in a tabular form as two tab-separated values files, namely trainingData.csv (the training set) and testData.csv (the test set). They can be downloaded only after a successful registration to the competition. Each row of those data files represents a single document and, in the consecutive columns, it contains integers ranging from 0 to 1000, expressing association strengths to corresponding MeSH terms. Additionally, there is a trainingLables.txt file, whose consecutive rows correspond to entries in the training set (trainingData.csv). Each row of that file is a list of topic identifiers (integers ranging from 1 to 83), separated by commas, which can be regarded as a generalized classification of a journal article. This information is not available for the test set and has to be predicted by participants.

It is worth noting that, due to nature of the considered problem, the data sets are highly dimensional – the number of columns roughly corresponds to the MeSH ontology size. The data sets are also sparse, since usually only a small fraction of the MeSH terms is assigned to a particular document by our tagging algorithm. Finally, a large number of data columns have little (or even none) non-zero values (corresponding concepts are rarely assigned to documents). It is up to participants to decide which of them are still useful for the task.

I am looking at it as an opportunity to learn a good bit about automatic text classification and what, if any, role that topic maps can play in such a scenario.

Suggestions as well as team members are most welcome!

January 2, 2012

Topical Classification of Biomedical Research Papers

Filed under: Bioinformatics,Biomedical,Contest,Medical Informatics,MeSH,PubMed — Patrick Durusau @ 6:36 pm

JRS 2012 Data Mining Competition: Topical Classification of Biomedical Research Papers

From the webpage:

JRS 2012 Data Mining Competition: Topical Classification of Biomedical Research Papers, is a special event of Joint Rough Sets Symposium (JRS 2012, http://sist.swjtu.edu.cn/JRS2012/) that will take place in Chengdu, China, August 17-20, 2012. The task is related to the problem of predicting topical classification of scientific publications in a field of biomedicine. Money prizes worth 1,500 USD will be awarded to the most successful teams. The contest is funded by the organizers of the JRS 2012 conference, Southwest Jiaotong University, with support from University of Warsaw, SYNAT project and TunedIT.

Introduction: Development of freely available biomedical databases allows users to search for documents containing highly specialized biomedical knowledge. Rapidly increasing size of scientific article meta-data and text repositories, such as MEDLINE [1] or PubMed Central (PMC) [2], emphasizes the growing need for accurate and scalable methods for automatic tagging and classification of textual data. For example, medical doctors often search through biomedical documents for information regarding diagnostics, drugs dosage and effect or possible complications resulting from specific treatments. In the queries, they use highly sophisticated terminology, that can be properly interpreted only with a use of a domain ontology, such as Medical Subject Headings (MeSH) [3]. In order to facilitate the searching process, documents in a database should be indexed with concepts from the ontology. Additionally, the search results could be grouped into clusters of documents, that correspond to meaningful topics matching different information needs. Such clusters should not necessarily be disjoint since one document may contain information related to several topics. In this data mining competition, we would like to raise both of the above mentioned problems, i.e. we are interested in identification of efficient algorithms for topical classification of biomedical research papers based on information about concepts from the MeSH ontology, that were automatically assigned by our tagging algorithm. In our opinion, this challenge may be appealing to all members of the Rough Set Community, as well as other data mining practitioners, due to its strong relations to well-founded subjects, such as generalized decision rules induction [4], feature extraction [5], soft and rough computing [6], semantic text mining [7], and scalable classification methods [8]. In order to ensure scientific value of this challenge, each of participating teams will be required to prepare a short report describing their approach. Those reports can be used for further validation of the results. Apart from prizes for top three teams, authors of selected solutions will be invited to prepare a paper for presentation at JRS 2012 special session devoted to the competition. Chosen papers will be published in the conference proceedings.

Data sets became available today.

This is one of those “praxis” opportunities for topic maps.

Using Bio4j + Neo4j Graph-algo component…

Filed under: Bio4j,Bioinformatics,Biomedical,Neo4j — Patrick Durusau @ 3:00 pm

Using Bio4j + Neo4j Graph-algo component for finding protein-protein interaction paths

From the post:

Today I managed to find some time to check out the Graph-algo component from Neo4j and after playing with it plus Bio4j a bit, I have to say it seems pretty cool.

For those who don’t know what I’m talking about, here you have the description you can find in Neo4j wiki:

This is a component that offers implementations of common graph algorithms on top of Neo4j. It is mostly focused around finding paths, like finding the shortest path between two nodes, but it also contains a few different centrality measures, like betweenness centrality for nodes.

The algorithm for finding the shortest path between two nodes caught my attention and I started to wonder how could I give it a try applying it to the data included in Bio4j.

Suggestions of other data sets where shortest path would yield interesting results?

BTW, isn’t the shortest path an artifact of the basis for nearness between nodes? Thinking that shortest path when expressed between gene fragments as relatedness would be different than physical distance. (see: Nearness key in microbe DNA swaps: Proximity trumps relatedness in influencing how often bacteria pick up each other’s genes.)

« Newer PostsOlder Posts »

Powered by WordPress