Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 25, 2012

STAR: ultrafast universal RNA-seq aligner

Filed under: Bioinformatics,Genomics,String Matching — Patrick Durusau @ 9:32 am

STAR: ultrafast universal RNA-seq aligner
by Stephen Turner.

From the post:

There’s a new kid on the block for RNA-seq alignment.

Dobin, Alexander, et al. “STAR: ultrafast universal RNA-seq aligner.” Bioinformatics (2012).

Aligning RNA-seq data is challenging because reads can overlap splice junctions. Many other RNA-seq alignment algorithms (e.g. Tophat) are built on top of DNA sequence aligners. STAR (Spliced Transcripts Alignment to a Reference) is a standalone RNA-seq alignment algorithm that uses uncompressed suffix arrays and a mapping algorithm similar to those used in large-scale genome alignment tools to align RNA-seq reads to a genomic reference. STAR is over 50 times faster than any other previously published RNA-seq aligner, and outperforms other aligners in both sensitivity and specificity using both simulated and real (replicated) RNA-seq data.

I had a brief exchange of comments with Lars Marius Garshol on string matching recently. Another example of a string processing approach you may adapt to different circumstances.

November 24, 2012

BIOKDD 2013 :…Biological Knowledge Discovery and Data Mining

Filed under: Bioinformatics,Biomedical,Conferences,Data Mining,Knowledge Discovery — Patrick Durusau @ 11:24 am

BIOKDD 2013 : 4th International Workshop on Biological Knowledge Discovery and Data Mining

When Aug 26, 2013 – Aug 30, 2013
Where Prague, Czech Republic
Abstract Registration Due Apr 3, 2013
Submission Deadline Apr 10, 2013
Notification Due May 10, 2013
Final Version Due May 20, 2013

From the call for papers:

With the development of Molecular Biology during the last decades, we are witnessing an exponential growth of both the volume and the complexity of biological data. For example, the Human Genome Project provided the sequence of the 3 billion DNA bases that constitute the human genome. And, consequently, we are provided too with the sequences of about 100,000 proteins. Therefore, we are entering the post-genomic era: after having focused so many efforts on the accumulation of data, we have now to focus as much effort, and even more, on the analysis of these data. Analyzing this huge volume of data is a challenging task because, not only, of its complexity and its multiple and numerous correlated factors, but also, because of the continuous evolution of our understanding of the biological mechanisms. Classical approaches of biological data analysis are no longer efficient and produce only a very limited amount of information, compared to the numerous and complex biological mechanisms under study. From here comes the necessity to use computer tools and develop new in silico high performance approaches to support us in the analysis of biological data and, hence, to help us in our understanding of the correlations that exist between, on one hand, structures and functional patterns of biological sequences and, on the other hand, genetic and biochemical mechanisms. Knowledge Discovery and Data Mining (KDD) are a response to these new trends.

Topics of BIOKDD’13 workshop include, but not limited to:

Data Preprocessing: Biological Data Storage, Representation and Management (data warehouses, databases, sequences, trees, graphs, biological networks and pathways, …), Biological Data Cleaning (errors removal, redundant data removal, completion of missing data, …), Feature Extraction (motifs, subgraphs, …), Feature Selection (filter approaches, wrapper approaches, hybrid approaches, embedded approaches, …)

Data Mining: Biological Data Regression (regression of biological sequences…), Biological data clustering/biclustering (microarray data biclustering, clustering/biclustering of biological sequences, …), Biological Data Classification (classification of biological sequences…), Association Rules Learning from Biological Data, Text mining and Application to Biological Sequences, Web mining and Application to Biological Data, Parallel, Cloud and Grid Computing for Biological Data Mining

Data Postprocessing: Biological Nuggets of Knowledge Filtering, Biological Nuggets of Knowledge Representation and Visualization, Biological Nuggets of Knowledge Evaluation (calculation of the classification error rate, evaluation of the association rules via numerical indicators, e.g. measurements of interest, … ), Biological Nuggets of Knowledge Integration

Being held in conjunction with 24th International Conference on Database and Expert Systems Applications – DEXA 2013.

In case you are wondering about BIOKDD, consider the BIOKDD Programme for 2012.

Or the DEXA program for 2012.

Looks like a very strong set of conferences and workshops.

November 23, 2012

BLAST – Basic Local Alignment Search Tool

Filed under: Bioinformatics,BLAST,Genomics — Patrick Durusau @ 11:27 am

BLAST – Basic Local Alignment Search Tool (Wikipedia)

From Wikipedia:

In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences. A BLAST search enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold. Different types of BLASTs are available according to the query sequences. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence. The BLAST program was designed by Stephen Altschul, Warren Gish, Webb Miller, Eugene Myers, and David J. Lipman at the NIH and was published in the Journal of Molecular Biology in 1990.[1]

I found the uses of BLAST of particular interest:

Uses of BLAST

BLAST can be used for several purposes. These include identifying species, locating domains, establishing phylogeny, DNA mapping, and comparison.

Identifying species
With the use of BLAST, you can possibly correctly identify a species and/or find homologous species. This can be useful, for example, when you are working with a DNA sequence from an unknown species.
Locating domains
When working with a protein sequence you can input it into BLAST, to locate known domains within the sequence of interest.
Establishing phylogeny
Using the results received through BLAST you can create a phylogenetic tree using the BLAST web-page. Phylogenies based on BLAST alone are less reliable than other purpose-built computational phylogenetic methods, so should only be relied upon for “first pass” phylogenetic analyses.
DNA mapping
When working with a known species, and looking to sequence a gene at an unknown location, BLAST can compare the chromosomal position of the sequence of interest, to relevant sequences in the database(s).
Comparison
When working with genes, BLAST can locate common genes in two related species, and can be used to map annotations from one organism to another.

Not just for the many uses of BLAST in genomics, but what of using similar techniques with other data sets?

Are they not composed of “sequences?”

November 22, 2012

Developing a biocuration workflow for AgBase… [Authoring Interfaces]

Filed under: Bioinformatics,Biomedical,Curation,Genomics,Text Mining — Patrick Durusau @ 9:50 am

Developing a biocuration workflow for AgBase, a non-model organism database by Lakshmi Pillai, Philippe Chouvarine, Catalina O. Tudor, Carl J. Schmidt, K. Vijay-Shanker and Fiona M. McCarthy.

Abstract:

AgBase provides annotation for agricultural gene products using the Gene Ontology (GO) and Plant Ontology, as appropriate. Unlike model organism species, agricultural species have a body of literature that does not just focus on gene function; to improve efficiency, we use text mining to identify literature for curation. The first component of our annotation interface is the gene prioritization interface that ranks gene products for annotation. Biocurators select the top-ranked gene and mark annotation for these genes as ‘in progress’ or ‘completed’; links enable biocurators to move directly to our biocuration interface (BI). Our BI includes all current GO annotation for gene products and is the main interface to add/modify AgBase curation data. The BI also displays Extracting Genic Information from Text (eGIFT) results for each gene product. eGIFT is a web-based, text-mining tool that associates ranked, informative terms (iTerms) and the articles and sentences containing them, with genes. Moreover, iTerms are linked to GO terms, where they match either a GO term name or a synonym. This enables AgBase biocurators to rapidly identify literature for further curation based on possible GO terms. Because most agricultural species do not have standardized literature, eGIFT searches all gene names and synonyms to associate articles with genes. As many of the gene names can be ambiguous, eGIFT applies a disambiguation step to remove matches that do not correspond to this gene, and filtering is applied to remove abstracts that mention a gene in passing. The BI is linked to our Journal Database (JDB) where corresponding journal citations are stored. Just as importantly, biocurators also add to the JDB citations that have no GO annotation. The AgBase BI also supports bulk annotation upload to facilitate our Inferred from electronic annotation of agricultural gene products. All annotations must pass standard GO Consortium quality checking before release in AgBase.

Database URL: http://www.agbase.msstate.edu/

Another approach to biocuration. I will be posting on eGift separately but do note this is a domain specific tool.

The authors did not set out to create the universal curation tool but one suited to their specific data and requirements.

I think there is an important lesson here for semantic authoring interfaces. Word processors offer very generic interfaces but consequently little in the way of structure. Authoring annotated information requires more structure and that requires domain specifics.

Now there is an idea, create topic map authoring interfaces on top of a common skeleton, instead of hard coding interfaces as users “should” use the tool.

November 21, 2012

Prioritizing PubMed articles…

Prioritizing PubMed articles for the Comparative Toxicogenomic Database utilizing semantic information by Sun Kim, Won Kim, Chih-Hsuan Wei, Zhiyong Lu and W. John Wilbur.

Abstract:

The Comparative Toxicogenomics Database (CTD) contains manually curated literature that describes chemical–gene interactions, chemical–disease relationships and gene–disease relationships. Finding articles containing this information is the first and an important step to assist manual curation efficiency. However, the complex nature of named entities and their relationships make it challenging to choose relevant articles. In this article, we introduce a machine learning framework for prioritizing CTD-relevant articles based on our prior system for the protein–protein interaction article classification task in BioCreative III. To address new challenges in the CTD task, we explore a new entity identification method for genes, chemicals and diseases. In addition, latent topics are analyzed and used as a feature type to overcome the small size of the training set. Applied to the BioCreative 2012 Triage dataset, our method achieved 0.8030 mean average precision (MAP) in the official runs, resulting in the top MAP system among participants. Integrated with PubTator, a Web interface for annotating biomedical literature, the proposed system also received a positive review from the CTD curation team.

An interesting summary of entity recognition issues in bioinformatics occurs in this article:

The second problem is that chemical and disease mentions should be identified along with gene mentions. Named entity recognition (NER) has been a main research topic for a long time in the biomedical text-mining community. The common strategy for NER is either to apply certain rules based on dictionaries and natural language processing techniques (5–7) or to apply machine learning approaches such as support vector machines (SVMs) and conditional random fields (8–10). However, most NER systems are class specific, i.e. they are designed to find only objects of one particular class or set of classes (11). This is natural because chemical, gene and disease names have specialized terminologies and complex naming conventions. In particular, gene names are difficult to detect because of synonyms, homonyms, abbreviations and ambiguities (12,13). Moreover, there are no specific rules of how to name a gene that are actually followed in practice (14). Chemicals have systematic naming conventions, but finding chemical names from text is still not easy because there are various ways to express chemicals (15,16). For example, they can be mentioned as IUPAC names, brand names, generic names or even molecular formulas. However, disease names in literature are more standardized (17) compared with gene and chemical names. Hence, using terminological resources such as Medical Subject Headings (MeSH) and Unified Medical Language System (UMLS) Metathesaurus help boost the identification performance (17,18). But, a major drawback of identifying disease names from text is that they often use general English terms.

Having a common representative for a group of identifiers for a single entity, should simplify the creation of mappings between entities.

Yes?

November 19, 2012

Accelerating literature curation with text-mining tools:…

Filed under: Bioinformatics,Curation,Literature,Text Mining — Patrick Durusau @ 7:35 pm

Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts by Chih-Hsuan Wei, Bethany R. Harris, Donghui Li, Tanya Z. Berardini, Eva Huala, Hung-Yu Kao and Zhiyong Lu.

Abstract:

Today’s biomedical research has become heavily dependent on access to the biological knowledge encoded in expert curated biological databases. As the volume of biological literature grows rapidly, it becomes increasingly difficult for biocurators to keep up with the literature because manual curation is an expensive and time-consuming endeavour. Past research has suggested that computer-assisted curation can improve efficiency, but few text-mining systems have been formally evaluated in this regard. Through participation in the interactive text-mining track of the BioCreative 2012 workshop, we developed PubTator, a PubMed-like system that assists with two specific human curation tasks: document triage and bioconcept annotation. On the basis of evaluation results from two external user groups, we find that the accuracy of PubTator-assisted curation is comparable with that of manual curation and that PubTator can significantly increase human curatorial speed. These encouraging findings warrant further investigation with a larger number of publications to be annotated.

Database URL: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/

Presentation on PubTator (slides, PDF).

Hmmm, curating abstracts. That sounds like annotating subjects in documents doesn’t it? Or something very close. 😉

If we start off with a set of subjects, that eases topic map authoring because users are assisted by automatic creation of topic map machinery. Creation triggered by identification of subjects and associations.

Users don’t have to start with bare ground to build a topic map.

Clever users build (and sell) forms, frames, components and modules that serve as the scaffolding for other topic maps.

BioInformatics: A Data Deluge with Hadoop to the Rescue

Filed under: Bioinformatics,Cloudera,Hadoop,Impala — Patrick Durusau @ 4:10 pm

BioInformatics: A Data Deluge with Hadoop to the Rescue by Marty Lurie.

From the post:

Cloudera Cofounder and Chief Scientist Jeff Hammerbacher is leading a revolutionary project with Mount Sinai School of Medicine to apply the power of Cloudera’s Big Data platform to critical problems in predicting and understanding the process and treatment of disease.

“We are at the cutting edge of disease prevention and treatment, and the work that we will do together will reshape the landscape of our field,” said Dennis S. Charney, MD, Anne and Joel Ehrenkranz Dean, Mount Sinai School of Medicine and Executive Vice President for Academic Affairs, The Mount Sinai Medical Center. “Mount Sinai is thrilled to join minds with Cloudera.” (Please see http://www.cloudera.com/content/cloudera/en/about/press-center/press-releases/release.html?ReleaseID=1747809 for more details.)

Cloudera is active in many other areas of BioInformatics. Due to Cloudera’s market leadership in Big Data, many DNA mapping programs have specific installation instructions for CDH (Cloudera’s 100% open-source, enterprise-ready distribution of Hadoop and related projects). But rather than just tell you about Cloudera let’s do a worked example of BioInformatics data – specifically FAERS.

A sponsored piece by Cloudera but walks you through using Impala with the FDA data on adverse drug reactions.

Demonstrates getting started with Impala isn’t hard. Which is true.

What’s lacking is a measure of the difficulty of good results.

Any old result, good or bad, probably isn’t of interest to most users.

November 17, 2012

Visualising associations between paired `omics’ data sets

Visualising associations between paired `omics’ data sets by Ignacio González, Kim-Anh Lê Cao, Melissa J Davis and Sébastien Déjean.

Abstract:

Background

Each omics platform is now able to generate a large amount of data. Genomics, proteomics, metabolomics, interactomics are compiled at an ever increasing pace and now form a core part of the fundamental systems biology framework. Recently, several integrative approaches have been proposed to extract meaningful information. However, these approaches lack of visualisation outputs to fully unravel the complex associations between different biological entities.

Results

The multivariate statistical approaches ‘regularized Canonical Correlation Analysis’ and ‘sparse Partial Least Squares regression’ were recently developed to integrate two types of highly dimensional ‘omics’ data and to select relevant information. Using the results of these methods, we propose to revisit few graphical outputs to better understand the relationships between two ‘omics’ data and to better visualise the correlation structure between the different biological entities. These graphical outputs include Correlation Circle plots, Relevance Networks and Clustered Image Maps. We demonstrate the usefulness of such graphical outputs on several biological data sets and further assess their biological relevance using gene ontology analysis.

Conclusions

Such graphical outputs are undoubtedly useful to aid the interpretation of these promising integrative analysis tools and will certainly help in addressing fundamental biological questions and understanding systems as a whole.

Availability

The graphical tools described in this paper are implemented in the freely available R package mixOmics and in its associated web application.

Just in case you are looking for something a little more challenging this weekend than political feeds on Twitter. 😉

Is “higher dimensional” data everywhere? Just more obvious in the biological sciences?

If so, there are lessons here for manipulation/visualization of higher dimensional data in other areas as well.

RMol: …SD/Molfile structure information into R Objects

Filed under: Bioinformatics,Biomedical,Cheminformatics — Patrick Durusau @ 2:17 pm

RMol: A Toolset for Transforming SD/Molfile structure information into R Objects by Martin Grabner, Kurt Varmuza and Matthias Dehmer.

Abstract:

Background

The graph-theoretical analysis of molecular networks has a long tradition in chemoinformatics. As demonstrated frequently, a well designed format to encode chemical structures and structure-related information of organic compounds is the Molfile format. But when it comes to use modern programming languages for statistical data analysis in Bio- and Chemoinformatics, R as one of the most powerful free languages lacks tools to process R Molfile data collections and import molecular network data into R.

Results

We design an R object which allows a lossless information mapping of structural information from Molfiles into R objects. This provides the basis to use the RMol object as an anchor for connecting Molfile data collections with R libraries for analyzing graphs. Associated with the RMol objects, a set of R functions completes the toolset to organize, describe and manipulate the converted data sets. Further, we bypass R-typical limits for manipulating large data sets by storing R objects in bz-compressed serialized files instead of employing RData files.

Conclusions

By design, RMol is a R tool set without dependencies to other libraries or programming languages. It is useful to integrate into pipelines for serialized batch analysis by using network data and, therefore, helps to process sdf-data sets in R effeciently. It is freely available under the BSD licence. The script source can be downloaded from http://sourceforge.net/p/rmol-toolset.

Important work, not the least because of the explosion of interest in bio/cheminformatics.

If I understand the rationale for the software, it:

  1. enables use of existing R tools for graph/network analysis
  2. fits well into workflows with serialized pipelines
  3. dependencies are reduced by extraction of SD-File information
  4. storing chemical and molecular network information in R objects avoids repetitive transformations

All of which are true but I have a nagging concern about the need for transformation.

Knowing the structure of Molfiles and the requirements of R tools for graph/network analysis, how are the results of transformation different from R tools viewing Molfiles “as if” they were composed of R objects?

The mapping is already well known because that is what RMol uses to create the results of transformation. More over, for any particular use, more data may be transformed that is required for a particular analysis.

Not to take anything away from very useful work but the days of transformation of data are numbered. As data sets grow in size, there will be fewer and fewer places to store a “transformed” data set.

BTW, pay particular attention to the bibliography in this paper. Numerous references to follow if you are interested in this area.

November 13, 2012

The Power of Graphs for Analyzing Biological Datasets

Filed under: Bioinformatics,Biomedical,Graphs — Patrick Durusau @ 6:23 am

The Power of Graphs for Analyzing Biological Datasets by Davy Suvee.

Very good slide deck on using graphs with biological datasets.

May give you some ideas of what capabilities you need to offer in this area.

November 12, 2012

KEGG: Kyoto Encyclopedia of Genes and Genomes

Filed under: Bioinformatics,Biomedical,Genome — Patrick Durusau @ 8:01 pm

KEGG: Kyoto Encyclopedia of Genes and Genomes

From the webpage:

KEGG is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies (See Release notes for new and updated features).

Anyone in biological research is probably already using KEGG. Take the opportunity to educate yourself about this resource. In particular how to use it with other resources.

The KEGG project, like any other project, needs funding. Consider passing this site along to funders interested in biological research resources.

I first saw this in a tweet by Anita de Waard.

An Ontological Representation of Biomedical Data Sources and Records [Data & Record as Subjects]

Filed under: Bioinformatics,Biomedical,Medical Informatics,Ontology,RDF — Patrick Durusau @ 7:27 pm

An Ontological Representation of Biomedical Data Sources and Records by Michael Bada, Kevin Livingston, and Lawrence Hunter.

Abstract:

Large RDF-triple stores have been the basis of prominent recent attempts to integrate the vast quantities of data in semantically divergent databases. However, these repositories often conflate data-source records, which are information content entities, and the biomedical concepts and assertions denoted by them. We propose an ontological model for the representation of data sources and their records as an extension of the Information Artifact Ontology. Using this model, we have consistently represented the contents of 17 prominent biomedical databases as a 5.6-billion RDF-triple knowledge base, enabling querying and inference over this large store of integrated data.

Recognition of the need to treat data containers as subjects, along with the data they contain, is always refreshing.

In particular because the evolution of data sources can be captured, as the authors remark:

Our ontology is fully capable of handling the evolution of data sources: If the schema of a given data set is changed, a new instance of the schema is simply created, along with the instances of the fields of the new schema. If the data sets of a data source change (or a new set is made available), an instance for each new data set can be created, along with instances for its schema and fields. (Modeling of incremental change rather than creation of new instances may be desirable but poses significant representational challenges.) Additionally, using our model, if a researcher wishes to work with multiple versions of a given data source (e.g., to analyze some aspect of multiple versions of a given database), an instance for each version of the data source can be created. If different versions of a data source consist of different data sets (e.g., different file organizations) and/or different schemas and fields, the explicit representation of all of these elements and their linkages will make the respective structures of the disparate data-source versions unambiguous. Furthermore, it may be the case that only a subset of a data source needs to be represented; in such a case, only instances of the data sets, schemas, and fields of interest are created.

I first saw this in a tweet by Anita de Waard.

November 9, 2012

Semantic Technologies — Biomedical Informatics — Individualized Medicine

Filed under: Bioinformatics,Biomedical,Medical Informatics,Ontology,Semantic Web — Patrick Durusau @ 11:14 am

Joint Workshop on Semantic Technologies Applied to Biomedical Informatics and Individualized Medicine (SATBI+SWIM 2012) (In conjunction with International Semantic Web Conference (ISWC 2012) Boston, Massachusetts, U.S.A. November 11-15, 2012)

If you are at ISWC, consider attending.

To help with that choice, the accepted papers:

Jim McCusker, Jeongmin Lee, Chavon Thomas and Deborah L. McGuinness. Public Health Surveillance Using Global Health Explorer. [PDF]

Anita de Waard and Jodi Schneider. Formalising Uncertainty: An Ontology of Reasoning, Certainty and Attribution (ORCA). [PDF]

Alexander Baranya, Luis Landaeta, Alexandra La Cruz and Maria-Esther Vidal. A Workflow for Improving Medical Visualization of Semantically Annotated CT-Images. [PDF]

Derek Corrigan, Jean Karl Soler and Brendan Delaney. Development of an Ontological Model of Evidence for TRANSFoRm Utilizing Transition Project Data. [PDF]

Amina Chniti, Abdelali BOUSSADI, Patrice DEGOULET, Patrick Albert and Jean Charlet. Pharmaceutical Validation of Medication Orders Using an OWL Ontology and Business Rules. [PDF]

November 4, 2012

Manual Gene Ontology annotation workflow

Filed under: Annotation,Bioinformatics,Curation,Ontology — Patrick Durusau @ 9:00 pm

Manual Gene Ontology annotation workflow at the Mouse Genome Informatics Database by Harold J. Drabkin, Judith A. Blake and for the Mouse Genome Informatics Database. Database (2012) 2012 : bas045 doi: 10.1093/database/bas045.

Abstract:

The Mouse Genome Database, the Gene Expression Database and the Mouse Tumor Biology database are integrated components of the Mouse Genome Informatics (MGI) resource (http://www.informatics.jax.org). The MGI system presents both a consensus view and an experimental view of the knowledge concerning the genetics and genomics of the laboratory mouse. From genotype to phenotype, this information resource integrates information about genes, sequences, maps, expression analyses, alleles, strains and mutant phenotypes. Comparative mammalian data are also presented particularly in regards to the use of the mouse as a model for the investigation of molecular and genetic components of human diseases. These data are collected from literature curation as well as downloads of large datasets (SwissProt, LocusLink, etc.). MGI is one of the founding members of the Gene Ontology (GO) and uses the GO for functional annotation of genes. Here, we discuss the workflow associated with manual GO annotation at MGI, from literature collection to display of the annotations. Peer-reviewed literature is collected mostly from a set of journals available electronically. Selected articles are entered into a master bibliography and indexed to one of eight areas of interest such as ‘GO’ or ‘homology’ or ‘phenotype’. Each article is then either indexed to a gene already contained in the database or funneled through a separate nomenclature database to add genes. The master bibliography and associated indexing provide information for various curator-reports such as ‘papers selected for GO that refer to genes with NO GO annotation’. Once indexed, curators who have expertise in appropriate disciplines enter pertinent information. MGI makes use of several controlled vocabularies that ensure uniform data encoding, enable robust analysis and support the construction of complex queries. These vocabularies range from pick-lists to structured vocabularies such as the GO. All data associations are supported with statements of evidence as well as access to source publications.

Semantic uniformity is achievable, in a limited enough sphere, provided you are willing to pay the price for it.

It has a high rate of return over less carefully curated content.

The project is producing high quality results, although hampered by a lack of resources.

My question is whether a similar high quality of results could be achieved with less semantically consistent curation by distributed contributors?

Harnessing the community of those interested in such a resource. And refining those less semantically consistent entries into higher quality annotations.

Pointers to examples of such projects?

October 20, 2012

Bio4j 0.8, some numbers

Filed under: Bio4j,Bioinformatics,Genome,Graphs — Patrick Durusau @ 10:29 am

Bio4j 0.8, some numbers by Pablo Pareja Tobes.

Bio4j 0.8 was recently released and now it’s time to have a deeper look at its numbers (as you can see we are quickly approaching the 1 billion relationships and 100M nodes):

  • Number of Relationships: 717.484.649
  • Number of Nodes: 92.667.745
  • Relationship types: 144
  • Node types: 42

If Pablo gets tired of his brilliant career in bioinformatics he can always run for office in the United States with claims like: “…we are quickly approaching the 1 billion relationships….” 😉

Still, a stunning achievement!

See Pablo’s post for more analysis.

Pass the project along to anyone with doubts about graph databases.

October 19, 2012

Random Forest Methodology – Bioinformatics

Filed under: Bioinformatics,Biomedical,Random Forests — Patrick Durusau @ 3:47 pm

Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics by Anne-Laure Boulesteix, Silke Janitza, Jochen Kruppa, Inke R. König

(Boulesteix, A.-L., Janitza, S., Kruppa, J. and König, I. R. (2012), Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. WIREs Data Mining Knowl Discov, 2: 493–507. doi: 10.1002/widm.1072)

Abstract:

The random forest (RF) algorithm by Leo Breiman has become a standard data analysis tool in bioinformatics. It has shown excellent performance in settings where the number of variables is much larger than the number of observations, can cope with complex interaction structures as well as highly correlated variables and return measures of variable importance. This paper synthesizes 10 years of RF development with emphasis on applications to bioinformatics and computational biology. Special attention is paid to practical aspects such as the selection of parameters, available RF implementations, and important pitfalls and biases of RF and its variable importance measures (VIMs). The paper surveys recent developments of the methodology relevant to bioinformatics as well as some representative examples of RF applications in this context and possible directions for future research.

Something to expand your horizons a bit.

And a new way to say “curse of dimensionality,” to-wit,

‘n ≪ p curse’

New to me anyway.

I was amused to read at the Wikipedia article on random forests that its disadvantages include:

Unlike decision trees, the classifications made by Random Forests are difficult for humans to interpret.

Turn about is fair play since many classifications made by humans are difficult for computers to interpret. 😉

October 16, 2012

The “O” Word (Ontology) Isn’t Enough

Filed under: Bioinformatics,Biomedical,Gene Ontology,Genome,Medical Informatics,Ontology — Patrick Durusau @ 10:36 am

The Units Ontology makes reference to the Gene Ontology as an example of a successful web ontology effort.

As it should. The Gene Ontology (GO) is the only successful web ontology effort. A universe with one (1) inhabitant.

The GO has a number of differences from wannabe successful ontology candidates. (see the article below)

The first difference echoes loudly across the semantic engineering universe:

One of the factors that account for GO’s success is that it originated from within the biological community rather than being created and subsequently imposed by external knowledge engineers. Terms were created by those who had expertise in the domain, thus avoiding the huge effort that would have been required for a computer scientist to learn and organize large amounts of biological functional information. This also led to general acceptance of the terminology and its organization within the community. This is not to say that there have been no disagreements among biologists over the conceptualization, and there is of course a protocol for arriving at a consensus when there is such a disagreement. However, a model of a domain is more likely to conform to the shared view of a community if the modelers are within or at least consult to a large degree with members of that community.

Did you catch that first line?

One of the factors that account for GO’s success is that it originated from within the biological community rather than being created and subsequently imposed by external knowledge engineers.

Saying the “O” word, ontology, that will benefit everyone if they will just listen to you, isn’t enough.

There are other factors to consider:

A Short Study on the Success of the Gene Ontology by Michael Bada, Robert Stevens, Carole Goble, Yolanda Gil, Michael Ashburner, Judith A. Blake, J. Michael Cherry, Midori Harris, Suzanna Lewis.

Abstract:

While most ontologies have been used only by the groups who created them and for their initially defined purposes, the Gene Ontology (GO), an evolving structured controlled vocabulary of nearly 16,000 terms in the domain of biological functionality, has been widely used for annotation of biological-database entries and in biomedical research. As a set of learned lessons offered to other ontology developers, we list and briefly discuss the characteristics of GO that we believe are most responsible for its success: community involvement; clear goals; limited scope; simple, intuitive structure; continuous evolution; active curation; and early use.

October 13, 2012

Bio4j 0.8 is here!

Filed under: Bio4j,Bioinformatics,Biomedical,Genome — Patrick Durusau @ 1:57 pm

Bio4j 0.8 is here! by Pablo Pareja Tobes.

You will find “5.488.000 new proteins and 3.233.000 genes” and other improvements!

Whether you are interested in graph databases (Neo4j), bioinformatics or both, this is welcome news!

October 12, 2012

PathNet: A tool for pathway analysis using topological information

Filed under: Bioinformatics,Biomedical,Genome,Graphs,Networks — Patrick Durusau @ 3:12 pm

PathNet: A tool for pathway analysis using topological information by Bhaskar Dutta, Anders Wallqvist and Jaques Reifman. (Source Code for Biology and Medicine 2012, 7:10 doi:10.1186/1751-0473-7-10)

Abstract:

Background

Identification of canonical pathways through enrichment of differentially expressed genes in a given pathway is a widely used method for interpreting gene lists generated from highthroughput experimental studies. However, most algorithms treat pathways as sets of genes, disregarding any inter- and intra-pathway connectivity information, and do not provide insights beyond identifying lists of pathways.

Results

We developed an algorithm (PathNet) that utilizes the connectivity information in canonical pathway descriptions to help identify study-relevant pathways and characterize non-obvious dependencies and connections among pathways using gene expression data. PathNet considers both the differential expression of genes and their pathway neighbors to strengthen the evidence that a pathway is implicated in the biological conditions characterizing the experiment. As an adjunct to this analysis, PathNet uses the connectivity of the differentially expressed genes among all pathways to score pathway contextual associations and statistically identify biological relations among pathways. In this study, we used PathNet to identify biologically relevant results in two Alzheimers disease microarray datasets, and compared its performance with existing methods. Importantly, PathNet identified deregulation of the ubiquitin-mediated proteolysis pathway as an important component in Alzheimers disease progression, despite the absence of this pathway in the standard enrichment analyses.

Conclusions

PathNet is a novel method for identifying enrichment and association between canonical pathways in the context of gene expression data. It takes into account topological information present in pathways to reveal biological information. PathNet is available as an R workspace image from http://www.bhsai.org/downloads/pathnet/.

Important work for genomics but also a reminder that a list of paths is just that, a list of paths.

The value-add and creative aspect of data analysis is in the scoring of those paths in order to wring more information from them.

How is it for you? Just lists of paths or something a bit more clever?

September 23, 2012

Working More Effectively With Statisticians

Filed under: Bioinformatics,Biomedical,Data Quality,Statistics — Patrick Durusau @ 10:33 am

Working More Effectively With Statisticians by Deborah M. Anderson. (Fall 2012 Newsletter of Society for Clinical Data Management, pages 5-8)

Abstract:

The role of the clinical trial biostatistician is to lend scientific expertise to the goal of demonstrating safety and efficacy of investigative treatments. Their success, and the outcome of the clinical trial, is predicated on adequate data quality, among other factors. Consequently, the clinical data manager plays a critical role in the statistical analysis of clinical trial data. In order to better fulfill this role, data managers must work together with the biostatisticians and be aligned in their understanding of data quality. This article proposes ten specific recommendations for data managers in order to facilitate more effective collaboration with biostatisticians.

See the article for the details but the recommendations are generally applicable to all data collection projects:

Recommendation #1: Communicate early and often with the biostatistician and provide frequent data extracts for review.

Recommendation #2: Employ caution when advising sites or interactive voice/web recognition (IVR/IVW) vendors on handling of randomization errors.

Recommendation #3: Collect the actual investigational treatment and dose group for each subject.

Recommendation #4: Think carefully and consult the biostatistician about the best way to structure investigational treatment exposure and accountability data.

Recommendation #5: Clarify in electronic data capture (EDC) specifications whether a question is only a “prompt” screen or whether the answer to the question will be collected explicitly in the database.

Recommendation #6: Recognize the most critical data items from a statistical analysis perspective and apply the highest quality standards to them.

Recommendation #7: Be alert to protocol deviations/violations (PDVs).

Recommendation #8: Plan for a database freeze and final review before database lock.

Recommendation #9: Archive a snapshot of the clinical database at key analysis milestones and at the end of the study.

Recommendation #10: Educate yourself about fundamental statistical principles whenever the opportunity arises.

I first saw this at John Johnson’s Data cleaning is harder than statistical analysis.

September 10, 2012

Mapping solution to heterogeneous data sources

Filed under: Bioinformatics,Biomedical,Genome,Heterogeneous Data,Mapping — Patrick Durusau @ 2:21 pm

dbSNO: a database of cysteine S-nitrosylation by Tzong-Yi Lee, Yi-Ju Chen, Cheng-Tsung Lu, Wei-Chieh Ching, Yu-Chuan Teng, Hsien-Da Huang and Yu-Ju Chen. (Bioinformatics (2012) 28 (17): 2293-2295. doi: 10.1093/bioinformatics/bts436)

OK, the title doesn’t jump out and say “mapping solution here!” 😉

Reading a bit further, you discover that text mining is used to locate sequences and that data is then mapped to “UniProtKB protein entries.”

The data set provides access to:

  • UniProt ID
  • Organism
  • Position
  • PubMed Id
  • Sequence

My concern is what happens when X is mapped to a UniProtKB protein entry to:

  • The prior identifier for X (in the article or source), and
  • The mapping from X to the UniProtKB protein entry?

If both of those are captured, then prior literature can be annotated upon rendering to point to later aggregation of information on a subject.

If the prior identifier, place of usage, the mapping, etc., are not captured, then prior literature, when we encounter it, remains frozen in time.

Mapping solutions work, but repay the effort several times over if the prior identifier and its mapping to the “new” identifier are captured as part of the process.

Abstract

Summary: S-nitrosylation (SNO), a selective and reversible protein post-translational modification that involves the covalent attachment of nitric oxide (NO) to the sulfur atom of cysteine, critically regulates protein activity, localization and stability. Due to its importance in regulating protein functions and cell signaling, a mass spectrometry-based proteomics method rapidly evolved to increase the dataset of experimentally determined SNO sites. However, there is currently no database dedicated to the integration of all experimentally verified S-nitrosylation sites with their structural or functional information. Thus, the dbSNO database is created to integrate all available datasets and to provide their structural analysis. Up to April 15, 2012, the dbSNO has manually accumulated >3000 experimentally verified S-nitrosylated peptides from 219 research articles using a text mining approach. To solve the heterogeneity among the data collected from different sources, the sequence identity of these reported S-nitrosylated peptides are mapped to the UniProtKB protein entries. To delineate the structural correlation and consensus motif of these SNO sites, the dbSNO database also provides structural and functional analyses, including the motifs of substrate sites, solvent accessibility, protein secondary and tertiary structures, protein domains and gene ontology.

Availability: The dbSNO is now freely accessible via http://dbSNO.mbc.nctu.edu.tw. The database content is regularly updated upon collecting new data obtained from continuously surveying research articles.

Contacts: francis@saturn.yu.edu.tw or yujuchen@gate.sinica.edu.tw.

Reveal—visual eQTL analytics [Statistics of Identity/Association]

Filed under: Bioinformatics,Biomedical,Genome — Patrick Durusau @ 1:36 pm

Reveal—visual eQTL analytics by Günter Jäger, Florian Battke and Kay Nieselt. (Bioinformatics (2012) 28 (18): i542-i548. doi: 10.1093/bioinformatics/bts382)

Abstract

Motivation: The analysis of expression quantitative trait locus (eQTL) data is a challenging scientific endeavor, involving the processing of very large, heterogeneous and complex data. Typical eQTL analyses involve three types of data: sequence-based data reflecting the genotypic variations, gene expression data and meta-data describing the phenotype. Based on these, certain genotypes can be connected with specific phenotypic outcomes to infer causal associations of genetic variation, expression and disease.

To this end, statistical methods are used to find significant associations between single nucleotide polymorphisms (SNPs) or pairs of SNPs and gene expression. A major challenge lies in summarizing the large amount of data as well as statistical results and to generate informative, interactive visualizations.

Results: We present Reveal, our visual analytics approach to this challenge. We introduce a graph-based visualization of associations between SNPs and gene expression and a detailed genotype view relating summarized patient cohort genotypes with data from individual patients and statistical analyses.

Availability: Reveal is included in Mayday, our framework for visual exploration and analysis. It is available at http://it.inf.uni-tuebingen.de/software/reveal/.

Contact: guenter.jaeger@uni-tuebingen.de

Interesting work on a number of fronts, not the least of it being “…analysis of expression quantitative trait locus (eQTL) data.”

Its use of statistical methods to discover “significant associations,” interactive visualizations and processing of “large, heterogeneous and complex data” are of more immediate interest to me.

Wikipedia is evidence for subjects (including relationships) that can be usefully identified using URLs. But that is only a fraction of all the subjects and relationships we may want to include in our topic maps.

An area I need to work up for my next topic map course is probabilistic identification of subjects and their relationships. What statistical techniques are useful for what fields? Or even what subjects within what fields? What are the processing tradeoffs versus certainty of identification?

Suggestions/comments?

September 9, 2012

Reverse engineering of gene regulatory networks from biological data [self-conscious?]

Filed under: Bioinformatics,Biomedical,Networks — Patrick Durusau @ 4:25 pm

Reverse engineering of gene regulatory networks from biological data by Li-Zhi Liu, Fang-Xiang Wu, Wen-Jun Zhang. (Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, Volume 2, Issue 5, pages 365–385, September/October 2012)

Abstract:

Reverse engineering of gene regulatory networks (GRNs) is one of the most challenging tasks in systems biology and bioinformatics. It aims at revealing network topologies and regulation relationships between components from biological data. Owing to the development of biotechnologies, various types of biological data are collected from experiments. With the availability of these data, many methods have been developed to infer GRNs. This paper firstly provides an introduction to the basic biological background and the general idea of GRN inferences. Then, different methods are surveyed from two aspects: models that those methods are based on and inference algorithms that those methods use. The advantages and disadvantages of these models and algorithms are discussed.

As you might expect, heterogeneous data is one topic of interest in this paper:

Models Based on Heterogeneous Data

Besides the dimensionality problem, the data from microarray experiments always contain many noises and measurement errors. Therefore, an accurate network can hardly be obtained due to the limited information in microarray data. With the development of technologies, a large amount of other diverse types of genomic data are collected. Many researchers are motivated to study GRNs by combining these data with microarray data. Because different types of the genomic data reflect different aspects of underlying networks, the inferences of GRNs based on the integration of different types of data are expected to provide more accurate and reliable results than based on microarray data alone. However, effectively integrating heterogeneous data is currently a hot research topic and a nontrivial task because they are generally collected along with much noise and related to each other in a complex way. (emphasis added)

Truth be known, high dimensionality and heterogeneous data are more accurate reflections of the objects of our study.

Conversely, the lower the dimensions of a model or the greater the homogeneity of the data, the less accurate they become.

Are we creating less accurate reflections to allow for the inabilities of our machines?

Will that make our machines less self-conscious about their limitations?

Or will that make us less self-conscious about our machines’ limitations?

September 8, 2012

Customizing the java classes for the NCBI generated by XJC

Filed under: Bioinformatics,Java — Patrick Durusau @ 4:28 pm

Customizing the java classes for the NCBI generated by XJC by Pierre Lindenbaum.

From the post:

Reminder: XJC is the Java XML Binding Compiler. It automates the mapping between XML documents and Java objects:

(mapping graphic omitted)

The code generated by XJC allows to :

  • Unmarshal XML content into a Java representation
  • Access and update the Java representation
  • Marshal the Java representation of the XML content into XML content

This post caught my eye because Pierre is adding an “equals” method.

It is a string equivalence test and for data in question that makes sense.

Your “equivalence” test might be more challenging.

Bioinformatics Tools in Haskell

Filed under: Bioinformatics,Haskell — Patrick Durusau @ 3:46 pm

Bioinformatics Tools in Haskell by Udo Stenzel.

From the post:

This is a collection of miscellaneous stuff that deals mostly with high-throughput sequencing data. I took some of my throw-away scripts that had developed a life of their own, separated out a library, and cleaned up the rest. Everything is licensed under the GPL and naturally comes without any warranty.

Most of the stuff here is written in Haskell. The natural way to run these programs is to install the Haskell Platform, which may be as easy as running ‘apt-get install haskell-platform‘, e.g. on Debian Testing aka “Squeeze”. Instead, you can install\ the Glasgow Haskell Compiler and Cabal individually. After that, download, unpack and ‘cabal install‘ Biohazard first, then install whatever else you need.

If you don’t want to become a Haskell programmer (you really should), you can still download the binary packages (for Linux on ix86_64) and hope that they work. You’ll probably need to install Gnu MP (‘apt-get install libgmp-dev‘ might do it). If the binaries don’t work for you, I don’t care; use the source instead.

Good for bioinformatics and I suspect for learning Haskell in high-throughput situations.

Speculation: How will processing change when there is only “high-throughput data streams?”

That is there isn’t any “going back” to find legacy data, you just wait for it to reappear in the stream?

Or if there were streams of “basic” data that doesn’t change much along with other data streams that are “new” or rapidly changing data.

If that sounds wasteful of bandwidth, imagine if bandwidth were to increase at the same rate as local storage? So that your incoming data connection is 1 TB or higher at your home computer.

Would you really need local storage at all?

September 7, 2012

EU-ADR Web Platform

Filed under: Bioinformatics,Biomedical,Drug Discovery,Medical Informatics — Patrick Durusau @ 10:29 am

EU-ADR Web Platform

I was disappointed to not find the UMLS concepts and related terms mapping for participants in the EU-ADR project.

I did find these workflows at the EU-ADR Web Platform:

MEDLINE ADR

In the filtering process of well known signals, the aim of the “MEDLINE ADR” workflow is to automate the search of publications related to ADRs corresponding to a given drug/adverse event association. To do so, we defined an approach based on the MeSH thesaurus, using the subheadings «chemically induced» and «adverse effects» with the “Pharmacological Action” knowledge. Using a threshold of ≥3 extracted publications, the automated search method, presented a sensitivity of 93% and a specificity of 97% on the true positive and true negative sets (WP 2.2). We then determined a threshold number of extracted publications ≥ 3 to confirm the knowledge of this association in the literature. This approach offers the opportunity to automatically determine if an ADR (association of a drug and an adverse event) has already been described in MEDLINE. However, the causality relationship between the drug and an event may be judged only by an expert reading the full text article and determining if the methodology of this article was correct and if the association is statically significant.

MEDLINE Co-occurrence

The “MEDLINE Co-occurrence” workflow performs a comprehensive data processing operation, searching the given Drug-Event combination in the PubMed database. Final workflow results include a final score, measuring found drugs relevance regarding the initial Drug-Event pair, as well as pointers to web pages for the discovered drugs.

DailyMed

The “DailyMed” workflow performs a comprehensive data processing operation, searching the given Drug-Event combination in the DailyMed database. Final workflow results include a final score, measuring found drugs relevance regarding the initial Drug-Event pair, as well as pointers to web pages for the discovered drugs.

DrugBank

The “DrugBank” workflow performs a comprehensive data processing operation, searching the given Drug-Event combination in the DrugBank database. Final workflow results include a final score, measuring found drugs relevance regarding the initial Drug-Event pair, as well as pointers to web pages for the discovered drugs.

Substantiation

The “Substantiation” workflow tries to establish a connection between the clinical event and the drug through a gene or protein, by identifying the proteins that are targets of the drug and are also associated with the event. In addition it also considers information about drug metabolites in this process. In such cases it can be argued that the binding of the drug to the protein would lead to the observed event phenotype. Associations between the event and proteins are found by querying our integrated gene-disease association database (Bauer-Mehren, et al., 2010). As this database provides annotations of the gene-disease associations to the articles reporting the association and in case of text-mining derived associations even the exact sentence, the article or sentence can be studied in more detail in order to inspect the supporting evidence for each gene-disease association. It has to be mentioned that our gene-disease association database also contains information about genetic variants or SNPs and their association to diseases or adverse drug events. The methodology for providing information about the binding of a drug (or metabolite) to protein targets is reported in deliverable 4.2, and includes extraction from different databases (annotated chemical libraries) and application of prediction methods based on chemical similarity.

A glimpse of what is state of the art today and a basis for building better tools for tomorrow.

Harmonization of Reported Medical Events in Europe

Filed under: Bioinformatics,Biomedical,Health care,Medical Informatics — Patrick Durusau @ 10:00 am

Harmonization process for the identification of medical events in eight European healthcare databases: the experience from the EU-ADR project by Paul Avillach, et. al. (J Am Med Inform Assoc doi:10.1136/amiajnl-2012-000933)

Abstract

Objective Data from electronic healthcare records (EHR) can be used to monitor drug safety, but in order to compare and pool data from different EHR databases, the extraction of potential adverse events must be harmonized. In this paper, we describe the procedure used for harmonizing the extraction from eight European EHR databases of five events of interest deemed to be important in pharmacovigilance: acute myocardial infarction (AMI); acute renal failure (ARF); anaphylactic shock (AS); bullous eruption (BE); and rhabdomyolysis (RHABD).

Design The participating databases comprise general practitioners’ medical records and claims for hospitalization and other healthcare services. Clinical information is collected using four different disease terminologies and free text in two different languages. The Unified Medical Language System was used to identify concepts and corresponding codes in each terminology. A common database model was used to share and pool data and verify the semantic basis of the event extraction queries. Feedback from the database holders was obtained at various stages to refine the extraction queries.

….

Conclusions The iterative harmonization process enabled a more homogeneous identification of events across differently structured databases using different coding based algorithms. This workflow can facilitate transparent and reproducible event extractions and understanding of differences between databases.

Not to be overly critical but the one thing left out of the abstract was some hint about the “…procedure used for harmonizing the extraction…” which interests me.

The workflow diagram from figure 2 is worth transposing into HTML markup:

  • Event definition
    • Choice of the event
    • Event Definition Form (EDF) containing the medical definition and diagnostic criteria for the event
  • Concepts selection and projection into the terminologies
    • Search for Unified Medical Language System (UMLS) concepts corresponding to the medical definition as reported in the EDF
    • Projection of UMLS concepts into the different terminologies used in the participating databases
    • Publication on the project’s forum of the first list of UMLS concepts and corresponding codes and terms for each terminology
  • Revision of concepts and related terms
    • Feedback from database holders about the list of concepts with corresponding codes and related terms that they have previously used to identify the event of interest
    • Report on literature review on search criteria being used in previous observational studies that explored the event of interest
    • Text mining in database to identify potentially missing codes through the identification of terms associated with the event in databases
    • Conference call for finalizing the list of concepts
    • Search for new UMLS concepts from the proposed terms
    • Final list of UMLS concepts and related codes posted on the forum
  • Translation of concepts and coding algorithms into queries
    • Queries in each database were built using:
      1. the common data model;
      2. the concept projection into different terminologies; and
      3. the chosen algorithms for event definition
    • Query Analysis
      • Database holders extract data on the event of interest using codes and free text from pre-defined concepts and with database-specific refinement strategies
      • Database holders calculate incidence rates and comparisons are made among databases
      • Database holders compare search queries via the forum

At least for non-members, the EU-ADR website does not appear to offer access to the UMLS concepts and related codes mapping. That mapping could be used to increase accessibility to any database using those codes.

September 6, 2012

Human Ignorance, Deep and Profound

Filed under: Bioinformatics,Biomedical,Graphs,Networks — Patrick Durusau @ 3:33 am

Scientists have discovered over 4 million gene switches, formerly known as “junk” (a scientific shorthand for “we don’t know what this means”), in the human genome. From the New York Times article Bits of Mystery DNA, Far From ‘Junk,’ Play Crucial Role (GINA KOLATA):

…The human genome is packed with at least four million gene switches that reside in bits of DNA that once were dismissed as “junk” but that turn out to play critical roles in controlling how cells, organs and other tissues behave. The discovery, considered a major medical and scientific breakthrough, has enormous implications for human health because many complex diseases appear to be caused by tiny changes in hundreds of gene switches.

The findings, which are the fruit of an immense federal project involving 440 scientists from 32 laboratories around the world, will have immediate applications for understanding how alterations in the non-gene parts of DNA contribute to human diseases, which may in turn lead to new drugs. They can also help explain how the environment can affect disease risk. In the case of identical twins, small changes in environmental exposure can slightly alter gene switches, with the result that one twin gets a disease and the other does not.

As scientists delved into the “junk” — parts of the DNA that are not actual genes containing instructions for proteins — they discovered a complex system that controls genes. At least 80 percent of this DNA is active and needed. The result of the work is an annotated road map of much of this DNA, noting what it is doing and how. It includes the system of switches that, acting like dimmer switches for lights, control which genes are used in a cell and when they are used, and determine, for instance, whether a cell becomes a liver cell or a neuron.

Reminds me of the discovery that glial cells aren’t packing material to support neurons. We were missing about half the human brain by size.

While I find both discoveries exciting, I am also mindful that we are not getting any closer to complete knowledge.

Rather opening up opportunities to correct prior mistakes and at some future time, to discover present ones.

PS: As you probably suspect, relationships between gene switches are extremely complex. New graph databases/algorithms anyone?

August 31, 2012

Next Generation Sequencing, GNU-Make and .INTERMEDIATE

Filed under: Bioinformatics,Genome — Patrick Durusau @ 9:41 am

Next Generation Sequencing, GNU-Make and .INTERMEDIATE by Pierre Lindenbaum.

From the post:

I gave a crash course about NGS to a few colleagues today. For my demonstration I wrote a simple Makefile. Basically, it downloads a subset of the human chromosome 22, indexes it with bwa, generates a set of fastqs with wgsim, align the fastqs, generates the *.sai, the *.sam, the *.bam, sorts the bam and calls the SNPs with mpileup.

An illustration that there is plenty of life left in GNU Make.

Plus an interesting tip on the use of .intermediate in make scripts.

As a starting point, consider Make (software).

August 30, 2012

MyMiner: a web application for computer-assisted biocuration and text annotation

Filed under: Annotation,Bioinformatics,Biomedical,Classification — Patrick Durusau @ 10:35 am

MyMiner: a web application for computer-assisted biocuration and text annotation by David Salgado, Martin Krallinger, Marc Depaule, Elodie Drula, Ashish V. Tendulkar, Florian Leitner, Alfonso Valencia and Christophe Marcelle. ( Bioinformatics (2012) 28 (17): 2285-2287. doi: 10.1093/bioinformatics/bts435 )

Abstract:

Motivation: The exponential growth of scientific literature has resulted in a massive amount of unstructured natural language data that cannot be directly handled by means of bioinformatics tools. Such tools generally require structured data, often generated through a cumbersome process of manual literature curation. Herein, we present MyMiner, a free and user-friendly text annotation tool aimed to assist in carrying out the main biocuration tasks and to provide labelled data for the development of text mining systems. MyMiner allows easy classification and labelling of textual data according to user-specified classes as well as predefined biological entities. The usefulness and efficiency of this application have been tested for a range of real-life annotation scenarios of various research topics.

Availability: http://myminer.armi.monash.edu.au.

Contacts: david.salgado@monash.edu and christophe.marcelle@monash.edu

Supplementary Information: Supplementary data are available at Bioinformatics online.

A useful tool and good tutorial materials.

I could easily see something similar for CS research (unless such already exists).

« Newer PostsOlder Posts »

Powered by WordPress