Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 18, 2015

UK Bioinformatics Landscape

Filed under: Bioinformatics,Topic Maps — Patrick Durusau @ 4:16 pm

UK Bioinformatics Landscape

Two of the four known challenges in the UK bioinformatics landscape could be addressed by topic maps:

  • Data integration and management of omics data, enabling the use of “big data” across thematic areas and to facilitate data sharing;
  • Open innovation, or pre-competitive approaches in particular to data sharing and data standardisation

I say could be addressed by topic maps, I’m not sure what else you would use to address data integration issues, at least robustly. If you don’t mind paying to migrate data when terminology changes enough to impair your effectiveness, and continuing to pay for every future migration, I suppose that is one solution.

Given the choice, I suspect many people would like to exit the wheel of ETL.

March 10, 2015

NIH-led effort launches Big Data portal for Alzheimer’s drug discovery

Filed under: BigData,Bioinformatics,Medical Informatics,Open Science — Patrick Durusau @ 6:23 pm

NIH-led effort launches Big Data portal for Alzheimer’s drug discovery

From the post:

A National Institutes of Health-led public-private partnership to transform and accelerate drug development achieved a significant milestone today with the launch of a new Alzheimer’s Big Data portal — including delivery of the first wave of data — for use by the research community. The new data sharing and analysis resource is part of the Accelerating Medicines Partnership (AMP), an unprecedented venture bringing together NIH, the U.S. Food and Drug Administration, industry and academic scientists from a variety of disciplines to translate knowledge faster and more successfully into new therapies.

The opening of the AMP-AD Knowledge Portal and release of the first wave of data will enable sharing and analyses of large and complex biomedical datasets. Researchers believe this approach will ramp up the development of predictive models of Alzheimer’s disease and enable the selection of novel targets that drive the changes in molecular networks leading to the clinical signs and symptoms of the disease.

“We are determined to reduce the cost and time it takes to discover viable therapeutic targets and bring new diagnostics and effective therapies to people with Alzheimer’s. That demands a new way of doing business,” said NIH Director Francis S. Collins, M.D., Ph.D. “The AD initiative of AMP is one way we can revolutionize Alzheimer’s research and drug development by applying the principles of open science to the use and analysis of large and complex human data sets.”

Developed by Sage Bionetworks , a Seattle-based non-profit organization promoting open science, the portal will house several waves of Big Data to be generated over the five years of the AMP-AD Target Discovery and Preclinical Validation Project by multidisciplinary academic groups. The academic teams, in collaboration with Sage Bionetworks data scientists and industry bioinformatics and drug discovery experts, will work collectively to apply cutting-edge analytical approaches to integrate molecular and clinical data from over 2,000 postmortem brain samples.

Big data and open science, now that sounds like a winning combination:

Because no publication embargo is imposed on the use of the data once they are posted to the AMP-AD Knowledge Portal, it increases the transparency, reproducibility and translatability of basic research discoveries, according to Suzana Petanceska, Ph.D., NIA’s program director leading the AMP-AD Target Discovery Project.

“The era of Big Data and open science can be a game-changer in our ability to choose therapeutic targets for Alzheimer’s that may lead to effective therapies tailored to diverse patients,” Petanceska said. “Simply stated, we can work more effectively together than separately.”

Imagine that, academics who aren’t hoarding data for recruitment purposes.

Works for me!

Does it work for you?

February 10, 2015

Biological Dirty Data

Filed under: Bioinformatics — Patrick Durusau @ 10:46 am

Reagent and laboratory contamination can critically impact sequence-based microbiome analyses by Susannah J Salter, et al. (BMC Biology 2014, 12:87 doi:10.1186/s12915-014-0087-z)

Abstract:

Background

The study of microbial communities has been revolutionised in recent years by the widespread adoption of culture independent analytical techniques such as 16S rRNA gene sequencing and metagenomics. One potential confounder of these sequence-based approaches is the presence of contamination in DNA extraction kits and other laboratory reagents.

Results

In this study we demonstrate that contaminating DNA is ubiquitous in commonly used DNA extraction kits and other laboratory reagents, varies greatly in composition between different kits and kit batches, and that this contamination critically impacts results obtained from samples containing a low microbial biomass. Contamination impacts both PCR-based 16S rRNA gene surveys and shotgun metagenomics. We provide an extensive list of potential contaminating genera, and guidelines on how to mitigate the effects of contamination.

Conclusions

These results suggest that caution should be advised when applying sequence-based techniques to the study of microbiota present in low biomass environments. Concurrent sequencing of negative control samples is strongly advised.

I first saw this in a tweet by Nick Loman, asking

what’s the point in publishing stuff like biomedcentral.com/1741-7007/12/87 if no-ones gonna act on it

I’m not sure what Nick’s criteria is for “no-ones gonna act on it,” but perhaps softly saying results could be better with better control for contamination isn’t a stark enough statement of the issue. Try:

Reagent and Laboratory Contamination – Garbage In, Garbage Out

Uncontrolled and/or unaccounted for contamination is certainly garbage in and results that contain uncontrolled and/or unaccounted for contamination fits my notion of garbage out.

Phrased as the choice between producing garbage and producing quality research frames the issue in such a way as to produce an impetus for change. Yes?

February 9, 2015

Newly Discovered Networks among Different Diseases…

Filed under: Bioinformatics,Medical Informatics,Networks — Patrick Durusau @ 4:32 pm

Newly Discovered Networks among Different Diseases Reveal Hidden Connections by Veronique Greenwood and Quanta Magazine.

From the post:

Stefan Thurner is a physicist, not a biologist. But not long ago, the Austrian national health insurance clearinghouse asked Thurner and his colleagues at the Medical University of Vienna to examine some data for them. The data, it turned out, were the anonymized medical claims records—every diagnosis made, every treatment given—of most of the nation, which numbers some 8 million people. The question was whether the same standard of care could continue if, as had recently happened in Greece, a third of the funding evaporated. But Thurner thought there were other, deeper questions that the data could answer as well.

In a recent paper in the New Journal of Physics, Thurner and his colleagues Peter Klimek and Anna Chmiel started by looking at the prevalence of 1,055 diseases in the overall population. They ran statistical analyses to uncover the risk of having two diseases together, identifying pairs of diseases for which the percentage of people who had both was higher than would be expected if the diseases were uncorrelated—in other words, a patient who had one disease was more likely than the average person to have the other. They applied statistical corrections to reduce the risk of drawing false connections between very rare and very common diseases, as any errors in diagnosis will get magnified in such an analysis. Finally, the team displayed their results as a network in which the diseases are nodes that connect to one another when they tend to occur together.

The style of analysis has uncovered some unexpected links. In another paper, published on the scientific preprint site arxiv.org, Thurner’s team confirmed a controversial connection between diabetes and Parkinson’s disease, as well as unique patterns in the timing of when diabetics develop high blood pressure. The paper in the New Journal of Physics generated additional connections that they hope to investigate further.

Every medical claim for almost eight (8) million people would make a very dense graph. Yes?

When you look at the original papers, notice that the researchers did not create a graph that held all their data. In the New Journal of Physics paper, only the diseases appear to demonstrate their clustering and the patients not at all. In the archiv.org paper, another means is used to demonstrate the risk of specific diseases and the two types (DM1, DM2) of diabetes.

I think the lesson here is that despite being “network” data, that isn’t determinative for presentation or analysis of data.

The National Centre for Biotechnology Information (NCBI) is part…

Filed under: Bioinformatics,DOI,R — Patrick Durusau @ 4:05 pm

The National Centre for Biotechnology Information (NCBI) is part…

The National Centre for Biotechnology Information (NCBI) is part of the National Institutes of Health’s National Library of Medicine, and most well-known for hosting Pubmed, the go-to search engine for biomedical literature – every (Medline-indexed) publication goes up there.

On a separate but related note, one thing I’m constantly looking to do is get DOIs for papers on demand. Most recently I found a package for R, knitcitations that generates bibliographies automatically from DOIs in the text, which worked quite well for a 10 page literature review chock full of references (I’m a little allergic to Mendeley and other clunky reference managers).

The “Digital Object Identifier”, as the name suggests, uniquely identifies a research paper (and recently it’s being co-opted to reference associated datasets). There’re lots of interesting and troublesome exceptions which I’ve mentioned previously, but in the vast majority of cases any paper published in at least the last 10 years or so will have one.

Although NCBI Pubmed does a great job of cataloguing biomedical literature, another site, doi.org provides a consistent gateway to the original source of the paper. You only need to append the DOI to “dx.doi.org/” to generate a working redirection link.

Last week the NCBI posted a webinar detailing the inner workings of Entrez Direct, the command line interface for Unix computers (GNU/Linux, and Macs; Windows users can fake it with Cygwin). It revolves around a custom XML parser written in Perl (typical for bioinformaticians) encoding subtle ‘switches’ to tailor the output just as you would from the web service (albeit with a fair portion more of the inner workings on show).

I’ve pieced together a basic pipeline, which has a function to generate citations for knitcitations from files listing basic bibliographic information, and in the final piece of the puzzle now have a custom function (or several) that does its best to find a single unique article matching the author, publication year, and title of a paper systematically, to find DOIs for entries in such a table.

BTW, the correct Github Gist link is: https://gist.github.com/lmmx/3c9406c4ec2c42b82158

The link in:

The scripts below are available here, I’ll update them on the GitHub Gist if I make amendments.

is broken.

A clever utility, although I am more in need of one for published CS literature. 😉

The copy to clipboard feature would be perfect for pasting into blogs posts.

February 4, 2015

ChEMBL 20 incorporates the Pistoia Alliance’s HELM annotation

Filed under: Bioinformatics,Chemistry,Medical Informatics — Patrick Durusau @ 4:35 pm

ChEMBL 20 incorporates the Pistoia Alliance’s HELM annotation by Richard Holland.

From the post:

The European Bioinformatics Institute (EMBL-EBI) has released version 20 of ChEMBL, the database of compound bioactivity data and drug targets. ChEMBL now incorporates the Hierarchical Editing Language for Macromolecules (HELM), the macromolecular representation standard recently released by the Pistoia Alliance.

HELM can be used to represent simple macromolecules (e.g. oligonucleotides, peptides and antibodies) complex entities (e.g. those with unnatural amino acids) or conjugated species (e.g. antibody-drug conjugates). Including the HELM notation for ChEMBL’s peptide-derived drugs and compounds will, in future, enable researchers to query that content in new ways, for example in sequence- and chemistry-based searches.

Initially created at Pfizer, HELM was released as an open standard with an accompanying toolkit through a Pistoia Alliance initiative, funded and supported by its member organisations. EMBL-EBI joins the growing list of HELM adopters and contributors, which include Biovia, ACD Labs, Arxspan, Biomax, BMS, ChemAxon, eMolecules, GSK, Lundbeck, Merck, NextMove, Novartis, Pfizer, Roche, and Scilligence. All of these organisations have either built HELM-based infrastructure, enabled HELM import/export in their tools, initiated projects for the incorporation of HELM into their workflows, published content in HELM format, or supplied funding or in-kind contributions to the HELM project.

More details:

The European Bioinformatics Institute

HELM project (open source, download, improve)

Pistoia Alliance

Another set of subjects ripe for annotation with topic maps!

January 20, 2015

Databases of Biological Databases (yes, plural)

Filed under: Bioinformatics,Biology,Medical Informatics — Patrick Durusau @ 4:15 pm

Mick Watson points out in a tweet today that there are at least two databases of biological databases.

Metabase

MetaBase is a user-contributed list of all the biological databases available on the internet. Currently there are 1,802 entries, each describing a different database. The databases are described in a semi-structured way by using templates and entries can cary various user comments and annotations (see a random entry). Entries can be searched, listed or browsed by category.

The site uses the same MediaWiki technology that powers Wikipedia, probably the best known user-contributed resource on the internet. The Mediawiki system allows users to participate on many different levels, ranging from authors and editors to curators and designers.

Database description

MetaBase aims to be a flexible, user-driven (user-created) resource for the biological database community.

The main focus of MetaBase is summarised below:

  • As a basic requirement, MB contains a list of databases, URLs and descriptions of the most commonly used biological databases currently available on the internet.
  • The system should be flexible, allowing users to contribute, update and maintain the data in different ways.
  • In the future we aim to generate more communication between the database developer and user communities.

A larger, more ambitious list of aims is given here.

The first point was acheived using data taken from the Molecular Biology Database Collection. Secondly, MetaBase has been implemented using MediaWiki. The final point will take longer, and is dependent on the community uptake of MB…

DBD – Database of Biological Databases

DBD: Database of Biological Database team are R.R. Siva Kiran, MVN Setty, Department of Biotechnology, MS Ramaiah Institute of Technology, MSR Nagar, Bangalore, India and G. Hanumantha Rao, Center for Biotechnology, Department of Chemical Engineering, Andhra University, Visakhapatnam-530003, India. DBD consists of 1200 Database entries covering wide range of databases useful for biological researchers.

Be aware that the DBD database reports its last update as 30-July-2008. I have written to confirm if that is the correct date.

Assuming it is, has anyone validated the links in the DBD database and/or compared them to the links in Metabase? That seems like a worthwhile service to the community.

January 14, 2015

ExAC Browser (Beta) | Exome Aggregation Consortium

Filed under: Bioinformatics,Genome,Genomics — Patrick Durusau @ 4:38 pm

ExAC Browser (Beta) | Exome Aggregation Consortium

From the webpage:

The Exome Aggregation Consortium (ExAC) is a coalition of investigators seeking to aggregate and harmonize exome sequencing data from a wide variety of large-scale sequencing projects, and to make summary data available for the wider scientific community.

The data set provided on this website spans 61,486 unrelated individuals sequenced as part of various disease-specific and population genetic studies. The ExAC Principal Investigators and groups that have contributed data to the current release are listed here.

All data here are released under a Fort Lauderdale Agreement
for the benefit of the wider biomedical community – see the terms of use here.

Sign up for our mailing list for future release announcements here.

“Big data” is so much more than “likes,” “clicks,” “visits,” “views,” etc.

I first saw this in a tweet by Mark McCarthy.

December 29, 2014

Bioclojure: a functional library for the manipulation of biological sequences

Filed under: Bioinformatics,Clojure,Functional Programming — Patrick Durusau @ 12:05 pm

Bioclojure: a functional library for the manipulation of biological sequences by Jordan Plieskatt, Gabriel Rinaldi, Paul J. Brindley, Xinying Jia, Jeremy Potriquet, Jeffrey Bethony, and Jason Mulvenna.

Abstract:

Motivation: BioClojure is an open-source library for the manipulation of biological sequence data written in the language Clojure. BioClojure aims to provide a functional framework for the processing of biological sequence data that provides simple mechanisms for concurrency and lazy evaluation of large datasets.

Results: BioClojure provides parsers and accessors for a range of biological sequence formats, including UniProtXML, Genbank XML, FASTA and FASTQ. In addition, it provides wrappers for key analysis programs, including BLAST, SignalP, TMHMM and InterProScan, and parsers for analyzing their output. All interfaces leverage Clojure’s functional style and emphasize laziness and composability, so that BioClojure, and user-defined, functions can be chained into simple pipelines that are thread-safe and seamlessly integrate lazy evaluation.

Availability and implementation: BioClojure is distributed under the Lesser GPL, and the source code is freely available from GitHub (https://github.com/s312569/clj-biosequence).

Contact: jason.mulvenna@qimberghofer.edu.au or jason.mulvenna@qimr.edu.au

The introduction to this article is a great cut-n-paste “case for Clojure in bioinformatics.”

Functional programming is a programming style that treats computation as the evaluation of mathematical functions (Hudak, 1989). In its purest form, functional programming removes the need for variable assignment by using immutable data structures that eliminate the use of state and side effects (Backus, 1978). This ensures that functions will always return the same value given the same input. This greatly simplifies debugging and testing, as individual functions can be assessed in isolation regardless of a global state. Immutability also greatly simplifies concurrency and facilitates leveraging of multi-core computing facilities with little or no modifications to functionally written code. Accordingly, as a programming style, functional programming offers advantages for software development, including (i) brevity, (ii) simple handling of concurrency and (iii) seamless integration of lazy evaluation, simplifying the handling of large datasets. Clojure is a Lisp variant that encourages a functional style of programming by providing immutable data structures, functions as first-class objects and uses recursive iteration as opposed to state-based looping (Hickey, 2008). Clojure is built on the Java virtual machine (JVM), and thus, applications developed using BioClojure can be compiled into Java byte code and ran on any platform that runs the JVM. Moreover, libraries constructed using Clojure can be called in Java programs and, conversely, Java classes and methods can be called from Clojure programs, making available a large number of third-party Java libraries. BioClojure aims to leverage the tools provided by Clojure to provide a functional interface with biological sequence data and associated programs. BioClojure is similar in intent to other bioinformatics packages such as BioPerl (Stajich et al., 2002), BioPython (Cock et al., 2009), Bio++ (Dutheil et al., 2006) and BioJava (Prlić et al., 2012) but differs from these bioinformatics software libraries in its embrace of the functional style. With the decreasing cost of biological analyses, for example, next-generation sequencing, biologists are dealing with greater amounts of data, and BioClojure is an attempt to provide tools, emphasizing concurrency and lazy evaluation, for manipulating these data.

I like the introduction as a form of evangelism but use of Clojure and Bioclojure in bioinformatics to demonstrate its advantages is the better form of promotion.

Evangelism works best when results are untestable, not so well when results can be counted and measured.

December 19, 2014

BigDataScript: a scripting language for data pipelines

Filed under: Bioinformatics,Data Pipelines — Patrick Durusau @ 8:34 pm

BigDataScript: a scripting language for data pipelines by Pablo Cingolani, Rob Sladek, and Mathieu Blanchette.

Abstract:

Motivation: The analysis of large biological datasets often requires complex processing pipelines that run for a long time on large computational infrastructures. We designed and implemented a simple script-like programming language with a clean and minimalist syntax to develop and manage pipeline execution and provide robustness to various types of software and hardware failures as well as portability.

Results: We introduce the BigDataScript (BDS) programming language for data processing pipelines, which improves abstraction from hardware resources and assists with robustness. Hardware abstraction allows BDS pipelines to run without modification on a wide range of computer architectures, from a small laptop to multi-core servers, server farms, clusters and clouds. BDS achieves robustness by incorporating the concepts of absolute serialization and lazy processing, thus allowing pipelines to recover from errors. By abstracting pipeline concepts at programming language level, BDS simplifies implementation, execution and management of complex bioinformatics pipelines, resulting in reduced development and debugging cycles as well as cleaner code.

Availability and implementation: BigDataScript is available under open-source license at http://pcingola.github.io/BigDataScript.

How would you compare this pipeline proposal to: XProc 2.0: An XML Pipeline Language?

I prefer XML solutions because I can reliably point to an element or attribute to endow it with explicit semantics.

While explicit semantics is my hobby horse, it may not be yours. Curious how you view this specialized language for bioinformatics pipelines?

I first saw this in a tweet by Pierre Lindenbaum.

December 17, 2014

History & Philosophy of Computational and Genome Biology

Filed under: Bioinformatics,Biology,Genome — Patrick Durusau @ 8:35 pm

History & Philosophy of Computational and Genome Biology by Mark Boguski.

A nice collection of books and articles on computational and genome biology. It concludes with this anecdote:

Despite all of the recent books and biographies that have come out about the Human Genome Project, I think there are still many good stories to be told. One of them is the origin of the idea for whole-genome shotgun and assembly. I recall a GRRC (Genome Research Review Committee) review that took place in late 1996 or early 1997 where Jim Weber proposed a whole-genome shotgun approach. The review panel, at first, wanted to unceremoniously “NeRF” (Not Recommend for Funding) the grant but I convinced them that it deserved to be formally reviewed and scored, based on Jim’s pioneering reputation in the area of genetic polymorphism mapping and its impact on the positional cloning of human disease genes and the origins of whole-genome genotyping. After due deliberation, the GRRC gave the Weber application a non-fundable score (around 350 as I recall) largely on the basis of Weber’s inability to demonstrate that the “shotgun” data could be assembled effectively.

Some time later, I was giving a ride to Jim Weber who was in Bethesda for a meeting. He told me why his grant got a low score and asked me if I knew any computer scientists that could help him address the assembly problem. I suggested he talk with Gene Myers (I knew Gene and his interests well since, as one of the five authors of the BLAST algorithm, he was a not infrequent visitor to NCBI).

The following May, Weber and Myers submitted a “perspective” for publication in Genome Research entitled “Human whole-genome shotgun sequencing“. This article described computer simulations which showed that assembly was possible and was essentially a rebuttal to the negative review and low priority score that came out of the GRRC. The editors of Genome Research (including me at the time) sent the Weber/Myers article to Phil Green (a well-known critic of shotgun sequencing) for review. Phil’s review was extremely detailed and actually longer that the Weber/Myers paper itself! The editors convinced Phil to allow us to publish his critique entitled “Against a whole-genome shotgun” as a point-counterpoint feature alongside the Weber-Myers article in the journal.

The rest, as they say, is history, because only a short time later, Craig Venter (whose office at TIGR had requested FAX copies of both the point and counterpoint as soon as they were published) and Mike Hunkapiller announced their shotgun sequencing and assembly project and formed Celera. They hired Gene Myers to build the computational capabilities and assemble their shotgun data which was first applied to the Drosophila genome as practice for tackling a human genome which, as is now known, was Venter’s own. Three of my graduate students (Peter Kuehl, Jiong Zhang and Oxana Pickeral) and I participated in the Drosophila annotation “jamboree” (organized by Mark Adams of Celera and Gerry Rubin) working specifically on an analysis of the counterparts of human disease genes in the Drosophila genome. Other aspects of the Jamboree are described in a short book by one of the other participants, Michael Ashburner.

The same type of stories exist not only from the early days of computer science but since then as well. Stories that will capture the imaginations of potential CS majors as well as illuminate areas where computer science can or can’t be useful.

How many of those stories have you captured?

I first saw this in a tweet by Neil Saunders.

December 7, 2014

Overlap and the Tree of Life

Filed under: Bioinformatics,Biology,XML — Patrick Durusau @ 9:43 am

I encountered a wonderful example of “overlap” in the markup sense today while reading about resolving conflicts in constructing a comprehensive tree of life.

overlap and the tree of life

The authors use a graph database which allows them to study various hypotheses on the resolutions of conflicts.

Their graph database, opentree-treemachine, is available on GitHub, https://github.com/OpenTreeOfLife/treemachine, as is the source to all the project’s software, https://github.com/OpenTreeOfLife.

There’s a thought for Balisage 2015. Is the processing of overlapping markup a question of storing documents with overlapping markup in graph databases and then streaming the non-overlapping results of a query to an XML processor?

And visualizing overlapping results or alternative resolutions to overlapping results via a graph database.

The question of which overlapping syntax to use becoming a matter of convenience and the amount of information captured, as opposed to attempts to fashion syntax that cheats XML processors and/or developing new means for processing XML.

Perhaps graph databases can make overlapping markup in documents the default case just as overlap is the default case in documents (single tree documents being rare outliers).

Remind me to send a note to Michael Sperberg-McQueen and friends about this idea.

BTW, the details of the article that lead me down this path:

Synthesis of phylogeny and taxonomy into a comprehensive tree of life by Steven A. Smith, et al.

Abstract:

Reconstructing the phylogenetic relationships that unite all biological lineages (the tree of life) is a grand challenge of biology. However, the paucity of readily available homologous character data across disparately related lineages renders direct phylogenetic inference currently untenable. Our best recourse towards realizing the tree of life is therefore the synthesis of existing collective phylogenetic knowledge available from the wealth of published primary phylogenetic hypotheses, together with taxonomic hierarchy information for unsampled taxa. We combined phylogenetic and taxonomic data to produce a draft tree of life—the Open Tree of Life—containing 2.3 million tips. Realization of this draft tree required the assembly of two resources that should prove valuable to the community: 1) a novel comprehensive global reference taxonomy, and 2) a database of published phylogenetic trees mapped to this common taxonomy. Our open source framework facilitates community comment and contribution, enabling a continuously updatable tree when new phylogenetic and taxonomic data become digitally available. While data coverage and phylogenetic conflict across the Open Tree of Life illuminates significant gaps in both the underlying data available for phylogenetic reconstruction and the publication of trees as digital objects, the tree provides a compelling starting point from which we can continue to improve through community contributions. Having a comprehensive tree of life will fuel fundamental research on the nature of biological diversity, ultimately providing up-to-date phylogenies for downstream applications in comparative biology, ecology, conservation biology, climate change studies, agriculture, and genomics.

A project with a great deal of significance beyond my interest in overlap in markup documents. Highly recommended reading. The resolution of conflicts in trees here involves an evaluation of data, much as you would for merging in a topic map.

Unlike the authors, I see no difficulty in super trees being rich enough with the underlying data to permit direct use of trees for resolution of conflicts. But you would have to design the trees from the start with those capabilities or have topic map like merging capabilities so you are not limited by early and necessarily preliminary data design decisions.

Enjoy!

I first saw this in a tweet by Ross Mounce.

November 29, 2014

VSEARCH

Filed under: Bioinformatics,Clustering,Merging,Topic Maps — Patrick Durusau @ 4:15 pm

VSEARCH: Open and free 64-bit multithreaded tool for processing metagenomic sequences, including searching, clustering, chimera detection, dereplication, sorting, masking and shuffling

From the webpage:

The aim of this project is to create an alternative to the USEARCH tool developed by Robert C. Edgar (2010). The new tool should:

  • have open source code with an appropriate open source license
  • be free of charge, gratis
  • have a 64-bit design that handles very large databases and much more than 4GB of memory
  • be as accurate or more accurate than usearch
  • be as fast or faster than usearch

We have implemented a tool called VSEARCH which supports searching, clustering, chimera detection, dereplication, sorting and masking (commands --usearch_global, --cluster_smallmem, --cluster_fast, --uchime_ref, --uchime_denovo, --derep_fulllength, --sortbysize, --sortbylength and --maskfasta, as well as almost all their options).

VSEARCH stands for vectorized search, as the tool takes advantage of parallelism in the form of SIMD vectorization as well as multiple threads to perform accurate alignments at high speed. VSEARCH uses an optimal global aligner (full dynamic programming Needleman-Wunsch), in contrast to USEARCH which by default uses a heuristic seed and extend aligner. This results in more accurate alignments and overall improved sensitivity (recall) with VSEARCH, especially for alignments with gaps.

The same option names as in USEARCH version 7 has been used in order to make VSEARCH an almost drop-in replacement.

The reconciliation of characteristics that are different is the only way that merging in topic maps varies from the clustering found in bioinformatics programs like VSEARCH. The results are a cluster of items deemed “similar” on some basis and with topic maps, subject to further processing.

Scaling isn’t easy in bioinformatics but it hasn’t been found daunting either.

There is much to be learned from projects such as VSEARCH to inform the processing of topic maps.

I first saw this in a tweet by Torbjørn Rognes.

November 17, 2014

Programming in the Life Sciences

Filed under: Bioinformatics,Life Sciences,Medical Informatics,Programming,Science — Patrick Durusau @ 4:06 pm

Programming in the Life Sciences by Egon Willighagen.

From the first post in this series, Programming in the Life Sciences #1: a six day course (October, 2013):

Our department will soon start the course Programming in the Life Sciences for a group of some 10 students from the Maastricht Science Programme. This is the first time we give this course, and over the next weeks I will be blogging about this course. First, some information. These are the goals, to use programming to:

  • have the ability to recognize various classes of chemical entities in pharmacology and to understand the basic physical and chemical interactions.
  • be familiar with technologies for web services in the life sciences.
  • obtain experience in using such web services with a programming language.
  • be able to select web services for a particular pharmacological question.
  • have sufficient background for further, more advanced, bioinformatics data analyses.

So, this course will be a mix of things. I will likely start with a lecture or too about scientific programming, such as the importance of reproducibility, licensing, documentation, and (unit) testing. To achieve these learning goals we have set a problem. The description is:


    In the life sciences the interactions between chemical entities is of key interest. Not only do these play an important role in the regulation of gene expression, and therefore all cellular processes, they are also one of the primary approaches in drug discovery. Pharmacology is the science studies the action of drugs, and for many common drugs, this is studying the interaction of small organic molecules and protein targets.
    And with the increasing information in the life sciences, automation becomes increasingly important. Big data and small data alike, provide challenges to integrate data from different experiments. The Open PHACTS platform provides web services to support pharmacological research and in this course you will learn how to use such web services from programming languages, allowing you to link data from such knowledge bases to other platforms, such as those for data analysis.

So, it becomes pretty clear what the students will be doing. They only have six days, so it won’t be much. It’s just to learn them the basic skills. The students are in their 3rd year at the university, and because of the nature of the programme they follow, a mixed background in biology, mathematics, chemistry, and physics. So, I have a good hope they will surprise me in what they will get done.

Pharmacology is the basic topic: drug-protein interaction, but the students are free to select a research question. In fact, I will not care that much what they like to study, as long as they do it properly. They will start with Open PHACTS’ Linked Data API, but here too, they are free to complement data from the OPS cache with additional information. I hope they do.

Now, regarding the technology they will use. The default will be JavaScript, and in the next week I will hack up demo code showing the integration of ops.js and d3.js. Let’s see how hard it will be; it’s new to me too. But, if the students already are familiar with another programming language and prefer to use that, I won’t stop them.

(For the Dutch readers, would #mscpils be a good tag?)

For quite a few “next weeks,” Egon’s blogging has gone on and life sciences, to say nothing of his readers, are all better off for it! His most recent post is titled: Programming in the Life Sciences #20: extracting data from JSON.

Definitely a series to catch or to pass along for anyone involved in life sciences.

Enjoy!

November 14, 2014

Exemplar Public Health Datasets

Filed under: Bioinformatics,Dataset,Medical Informatics,Public Data — Patrick Durusau @ 2:45 pm

Exemplar Public Health Datasets Editor: Jonathan Tedds.

From the post:

This special collection contains papers describing exemplar public heath datasets published as part of the Enhancing Discoverability of Public Health and Epidemiology Research Data project commissioned by the Wellcome Trust and the Public Health Research Data Forum.

The publication of the datasets included in this collection is intended to promote faster progress in improving health, better value for money and higher quality science, in accordance with the joint statement made by the forum members in January 2011.

Submission to this collection is by invitation only, and papers have been peer reviewed. The article template and instructions for submission are available here.

Data for analysis as well as examples of best practices for pubic health datasets.

Enjoy!

I first saw this in a tweet by Christophe Lallanne.

November 11, 2014

Computational drug repositioning through heterogeneous network clustering

Filed under: Bioinformatics,Biomedical,Drug Discovery — Patrick Durusau @ 3:49 pm

Computational drug repositioning through heterogeneous network clustering by Wu C, Gudivada RC, Aronow BJ, Jegga AG. (BMC Syst Biol. 2013;7 Suppl 5:S6. doi: 10.1186/1752-0509-7-S5-S6. Epub 2013 Dec 9.)

Abstract:

BACKGROUND:

Given the costly and time consuming process and high attrition rates in drug discovery and development, drug repositioning or drug repurposing is considered as a viable strategy both to replenish the drying out drug pipelines and to surmount the innovation gap. Although there is a growing recognition that mechanistic relationships from molecular to systems level should be integrated into drug discovery paradigms, relatively few studies have integrated information about heterogeneous networks into computational drug-repositioning candidate discovery platforms.

RESULTS:

Using known disease-gene and drug-target relationships from the KEGG database, we built a weighted disease and drug heterogeneous network. The nodes represent drugs or diseases while the edges represent shared gene, biological process, pathway, phenotype or a combination of these features. We clustered this weighted network to identify modules and then assembled all possible drug-disease pairs (putative drug repositioning candidates) from these modules. We validated our predictions by testing their robustness and evaluated them by their overlap with drug indications that were either reported in published literature or investigated in clinical trials.

CONCLUSIONS:

Previous computational approaches for drug repositioning focused either on drug-drug and disease-disease similarity approaches whereas we have taken a more holistic approach by considering drug-disease relationships also. Further, we considered not only gene but also other features to build the disease drug networks. Despite the relative simplicity of our approach, based on the robustness analyses and the overlap of some of our predictions with drug indications that are under investigation, we believe our approach could complement the current computational approaches for drug repositioning candidate discovery.

A reminder that data clustering isn’t just of academic interest but is useful in highly remunerative fields as well. 😉

There is a vast amount of literature on data clustering but I don’t know if there is a collection of data clustering patterns?

That is a work that summarizes where data clustering has been used by domain and the similarities on which clustering was performed.

In this article, the clustering was described as:

The nodes represent drugs or diseases while the edges represent shared gene, biological process, pathway, phenotype or a combination of these features.

Has that been used elsewhere in medical research?

Not that clustering should be limited to prior patterns but prior patterns could stimulate new patterns to be applied.

Thoughts?

October 23, 2014

Avoiding “Hive” Confusion

Filed under: BigData,Bioinformatics,Genomics,Hive — Patrick Durusau @ 6:46 pm

Depending on your community, when you hear “Hive,” you think “Apache Hive:”

The Apache Hive ™ data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

But, there is another “Hive,” which handles large datasets:

High-performance Integrated Virtual Environment (HIVE) is a specialized platform being developed/implemented by Dr. Simonyan’s group at FDA and Dr. Mazumder’s group at GWU where the storage library and computational powerhouse are linked seamlessly. This environment provides web access for authorized users to deposit, retrieve, annotate and compute on HTS data and analyze the outcomes using web-interface visual environments appropriately built in collaboration with research scientists and regulatory personnel.

I ran across this potential source of confusion earlier today and haven’t run it completely to ground but wanted to share some of what I have found so far.

Inside the HIVE, the FDA’s Multi-Omics Compute Architecture by Aaron Krol.

From the post:

“HIVE is not just a conventional virtual cloud environment,” says Simonyan. “It’s a different system that virtualizes the services.” Most cloud systems store data on multiple servers or compute units until users want to run a specific application. At that point, the relevant data is moved to a server that acts as a node for that computation. By contrast, HIVE recognizes which storage nodes contain data selected for analysis, then transfers executable code to those nodes, a relatively small task that allows computation to be performed wherever the data is stored. “We make the computations on exactly the machines where the data is,” says Simonyan. “So we’re not moving the data to the computational unit, we are moving computation to the data.”

When working with very large packets of data, cloud computing environments can sometimes spend more time on data transfer than on running code, making this “virtualized services” model much more efficient. To function, however, it relies on granular and readily-accessed metadata, so that searching for and collecting together relevant data doesn’t consume large quantities of compute time.

HIVE’s solution is the honeycomb data model, which stores raw NGS data and metadata together on the same network. The metadata — information like the sample, experiment, and run conditions that produced a set of NGS reads — is stored in its own tables that can be extended with as many values as users need to record. “The honeycomb data model allows you to put the entire database schema, regardless of how complex it is, into a single table,” says Simonyan. The metadata can then be searched through an object-oriented API that treats all data, regardless of type, the same way when executing search queries. The aim of the honeycomb model is to make it easy for users to add new data types and metadata fields, without compromising search and retrieval.

Popular consumption piece so next you may want to visit the HIVE site proper.

From the webpage:

HIVE is a cloud-based environment optimized for the storage and analysis of extra-large data, like Next Generation Sequencing data, Mass Spectroscopy files, Confocal Microscopy Images and others.

HIVE uses a variety of advanced scientific and computational visualization graphics, to get the MOST from your HIVE experience you must use a supported browser. These include Internet Explore 8.0 or higher (Internet Explorer 9.0 is recommended), Google Chrome, Mozilla Firefox and Safari.

A few exemplary analytical outputs are displayed below for your enjoyment. But before you can take advantage of all that HIVE has to offer and create these objects for yourself, you’ll need to register.

With A framework for organizing cancer-related variations from existing databases, publications and NGS data using a High-performance Integrated Virtual Environment (HIVE) by Tsung-Jung Wu, et al., you are starting to approach the computational issues of interest for data integration.

From the article:

The forementioned cooperation is difficult because genomics data are large, varied, heterogeneous and widely distributed. Extracting and converting these data into relevant information and comparing results across studies have become an impediment for personalized genomics (11). Additionally, because of the various computational bottlenecks associated with the size and complexity of NGS data, there is an urgent need in the industry for methods to store, analyze, compute and curate genomics data. There is also a need to integrate analysis results from large projects and individual publications with small-scale studies, so that one can compare and contrast results from various studies to evaluate claims about biomarkers.

See also: High-performance Integrated Virtual Environment (Wikipedia) for more leads to the literature.

Heterogeneous data is still at large and people are building solutions. Rather than either/or, what do you think topic maps could bring as a value-add to this project?

I first saw this in a tweet by ChemConnector.

October 14, 2014

Cryptic genetic variation in software:…

Filed under: Bioinformatics,Bugs,Programming,Software — Patrick Durusau @ 6:18 pm

Cryptic genetic variation in software: hunting a buffered 41 year old bug by Sean Eddy.

From the post:

In genetics, cryptic genetic variation means that a genome can contain mutations whose phenotypic effects are invisible because they are suppressed or buffered, but under rare conditions they become visible and subject to selection pressure.

In software code, engineers sometimes also face the nightmare of a bug in one routine that has no visible effect because of a compensatory bug elsewhere. You fix the other routine, and suddenly the first routine starts failing for an apparently unrelated reason. Epistasis sucks.

I’ve just found an example in our code, and traced the origin of the problem back 41 years to the algorithm’s description in a 1973 applied mathematics paper. The algorithm — for sampling from a Gaussian distribution — is used worldwide, because it’s implemented in the venerable RANLIB software library still used in lots of numerical codebases, including GNU Octave. It looks to me that the only reason code has been working is that a compensatory “mutation” has been selected for in everyone else’s code except mine.

,,,

A bug hunting story to read and forward! Sean just bagged a forty-one (41) year old bug. What’s the oldest bug you have ever found?

When you reach the crux of the problem, you will understand why ambiguous, vague, incomplete and poorly organized standards annoy me to no end.

No guarantees of unambiguous results but if you need extra eyes on IT standards you know where to find me.

I first saw this in a tweet by Neil Saunders.

October 13, 2014

The Dirty Little Secret of Cancer Research

Filed under: BigData,Bioinformatics,Genomics,Medical Informatics — Patrick Durusau @ 8:06 pm

The Dirty Little Secret of Cancer Research by Jill Neimark.

From the post:


Across different fields of cancer research, up to a third of all cell lines have been identified as imposters. Yet this fact is widely ignored, and the lines continue to be used under their false identities. As recently as 2013, one of Ain’s contaminated lines was used in a paper on thyroid cancer published in the journal Oncogene.

“There are about 10,000 citations every year on false lines—new publications that refer to or rely on papers based on imposter (human cancer) celllines,” says geneticist Christopher Korch, former director of the University of Colorado’s DNA Sequencing Analysis & Core Facility. “It’s like a huge pyramid of toothpicks precariously and deceptively held together.”

For all the worry about “big data,” where is the concern over “big bad data?”

Or is “big data” too big for correctness of the data to matter?

Once you discover that a paper is based on “imposter (human cancer) celllines,” how do you pass that information along to anyone who attempts to cite the article?

In other words, where do you write down that data about the paper, where the paper is the subject in question?

And how do you propagate that data across a universe of citations?

The post ends on a high note of current improvements but it is far from settled how to prevent reliance on compromised research.

I first saw this in a tweet by Dan Graur

October 10, 2014

Recognizing patterns in genomic data

Filed under: Bioinformatics,Biomedical,Genomics,Medical Informatics — Patrick Durusau @ 3:13 pm

Recognizing patterns in genomic data – New visualization software uncovers cancer subtypes from a vast repository of biomedical information by Stephanie Dutchen.

From the post:

Much of biomedical research these days is about big data—collecting and analyzing vast, detailed repositories of information about health and disease. These data sets can be treasure troves for investigators, often uncovering genetic mutations that drive a particular kind of cancer, for example.

Trouble is, it’s impossible for humans to browse that much data, let alone make any sense of it.

“It’s [StratomeX] a tool to help you make sense of the data you’re collecting and find the right questions to ask,” said Nils Gehlenborg, research associate in biomedical informatics at HMS and co-senior author of the correspondence in Nature Methods. “It gives you an unbiased view of patterns in the data. Then you can explore whether those patterns are meaningful.”

The software, called StratomeX, was developed to help researchers distinguish subtypes of cancer by crunching through the incredible amount of data gathered as part of The Cancer Genome Atlas, a National Institutes of Health–funded initiative. Identifying distinct cancer subtypes can lead to more effective, personalized treatments.

When users input a query, StratomeX compares tumor data at the molecular level that was collected from hundreds of patients and detects patterns that might indicate significant similarities or differences between groups of patients. The software presents those connections in an easy-to-grasp visual format.

“It helps you make meaningful distinctions,” said co-first author Alexander Lex, a postdoctoral researcher in the Pfister group.

Other than the obvious merits of this project, note the the role of software as the assistant to the user. It crunches the numbers in a specific domain and presents those results in a meaningful fashion.

It is up to the user to decide which patters are useful and which are not. Shades of “recommending” other instances of the “same” subject?

StratomeX is available for download.

I first saw this in a tweet by Harvard SEAS.

October 9, 2014

Programming for Biologists

Filed under: Bioinformatics,Biology,Programming,Teaching — Patrick Durusau @ 3:26 pm

Programming for Biologists by Ethan White.

From the post:

This is the website for Ethan White’s programming and database management courses designed for biologists. At the moment there are four courses being taught during Fall 2014.

The goal of these courses is to teach biologists how to use computers more effectively to make their research easier. We avoid a lot of the theory that is taught in introductory computer science classes in favor of covering more of the practical side of programming that is necessary for conducting research. In other words, the purpose of these courses is to teach you how to drive the car, not prepare you to be a mechanic.

Hmmm, less theory of engine design and more driving lessons? 😉

Despite my qualms about turn-key machine learning solutions, more people want to learn to drive a car than want to design an engine.

Should we teach topic maps the “right way” or should we teach them to drive?

I first saw this in a tweet by Christophe Lalanne.

October 6, 2014

Bioinformatics tools extracted from a typical mammalian genome project

Filed under: Archives,Bioinformatics,Documentation,Preservation — Patrick Durusau @ 7:55 pm

Bioinformatics tools extracted from a typical mammalian genome project

From the post:

In this extended blog post, I describe my efforts to extract the information about bioinformatics-related items from a recent genome sequencing paper, and the larger issues this raises in the field. It’s long, and it’s something of a hybrid between a blog post and a paper format, just to give it some structure for my own organization. A copy of this will also be posted at FigShare with the full data set. Huge thanks to the gibbon genome project team for a terrific paper and extensively-documented collection of their processes and resources. The issues I wanted to highlight are about the access to bioinformatics tools in general and are not specific to this project at all, but are about the field.

A must read if you are interested in useful preservation of research and data. The paper focuses on needed improvements in bioinformatics but the issues raised are common to all fields.

How well does your field perform when compared to bioinformatics?

October 2, 2014

NCBI webinar on E-Utilities October 15th

Filed under: Bioinformatics,Medical Informatics — Patrick Durusau @ 3:58 pm

NCBI webinar on E-Utilities October 15th

From the post:

On October 15th, NCBI will have a webinar entitled “An Introduction to NCBI’s E-Utilities, an NCBI API.” E-Utilities is a tool to assist programmers in accessing, searching and retrieving a wide variety of data from NCBI servers.

This presentation will introduce you to the Entrez Programming Utilities (E-Utilities), the public API for the NCBI Entrez system that includes 40 databases such as Pubmed, PMC, Gene, Genome, GEO and dbSNP. After covering the basic functions and URL syntax of the E-utilities, we will then demonstrate these functions using Entrez Direct, a set of UNIX command line programs that allow you to incorporate E-utility calls easily into simple shell scripts.

Click here to register.

Thought you might find this interesting for populating topic maps out of NCBI servers.

October 1, 2014

FOAM (Functional Ontology Assignments for Metagenomes):…

Filed under: Bioinformatics,Genomics,Ontology — Patrick Durusau @ 4:43 pm

FOAM (Functional Ontology Assignments for Metagenomes): a Hidden Markov Model (HMM) database with environmental focus by Emmanuel Prestat, et al. (Nucl. Acids Res. (2014) doi: 10.1093/nar/gku702 )

Abstract:

A new functional gene database, FOAM (Functional Ontology Assignments for Metagenomes), was developed to screen environmental metagenomic sequence datasets. FOAM provides a new functional ontology dedicated to classify gene functions relevant to environmental microorganisms based on Hidden Markov Models (HMMs). Sets of aligned protein sequences (i.e. ‘profiles’) were tailored to a large group of target KEGG Orthologs (KOs) from which HMMs were trained. The alignments were checked and curated to make them specific to the targeted KO. Within this process, sequence profiles were enriched with the most abundant sequences available to maximize the yield of accurate classifier models. An associated functional ontology was built to describe the functional groups and hierarchy. FOAM allows the user to select the target search space before HMM-based comparison steps and to easily organize the results into different functional categories and subcategories. FOAM is publicly available at http://portal.nersc.gov/project/m1317/FOAM/.

Aside from its obvious importance for genomics and bioinformatics, I mention this because the authors point out:

A caveat of this approach is that we did not consider the quality of the tree in the tree-splitting step (i.e. weakly supported branches were equally treated as strongly supported ones), producing models of different qualities. Nevertheless, we decided that the approach of rational classification is better than no classification at all. In the future, the groups could be recomputed, or split more optimally when more data become available (e.g. more KOs). From each cluster related to the KO in process, we extracted the alignment from which HMMs were eventually built.

I take that to mean that this “ontology” represents no unchanging ground truth but rather an attempt to enhance the “…screening of environmental metagenomic and metatranscriptomic sequence datasets for functional genes.”

As more information is gained, the present “ontology” can and will change. Those future changes create the necessity to map those changes and the facts that drove them.

I first saw this in a tweet by Jonathan Eisen

September 18, 2014

2015 Medical Subject Headings (MeSH) Now Available

Filed under: Bioinformatics,Biomedical,MARC,MeSH — Patrick Durusau @ 10:24 am

2015 Medical Subject Headings (MeSH) Now Available

From the post:

Introduction to MeSH 2015
The Introduction to MeSH 2015 is now available, including information on its use and structure, as well as recent updates and availability of data.

MeSH Browser
The default year in the MeSH Browser remains 2014 MeSH for now, but the alternate link provides access to 2015 MeSH. The MeSH Section will continue to provide access via the MeSH Browser for two years of the vocabulary: the current year and an alternate year. Sometime in November or December, the default year will change to 2015 MeSH and the alternate link will provide access to the 2014 MeSH.

Download MeSH
Download 2015 MeSH in XML and ASCII formats. Also available for 2015 from the same MeSH download page are:

  • Pharmacologic Actions (Forthcoming)
  • New Headings with Scope Notes
  • MeSH Replaced Headings
  • MeSH MN (tree number) changes
  • 2015 MeSH in MARC format

Enjoy!

September 9, 2014

PLOS Resources on Ebola

Filed under: Bioinformatics,Open Access,Open Data — Patrick Durusau @ 7:09 pm

PLOS Resources on Ebola by Virginia Barbour and PLOS Collections.

From the post:

The current Ebola outbreak in West Africa probably began in Guinea in 2013, but it was only recognized properly in early 2014 and shows, at the time of writing, no sign of subsiding. The continuous human-to-human transmission of this new outbreak virus has become increasingly worrisome.

Analyses thus far of this outbreak mark it as the most serious in recent years and the effects are already being felt far beyond those who are infected and dying; whole communities in West Africa are suffering because of its negative effects on health care and other infrastructures. Globally, countries far removed from the outbreak are considering their local responses, were Ebola to be imported; and the ripple effects on the normal movement of trade and people are just becoming apparent.

A great collection of PLOS resources on Ebola.

Even usual closed sources are making Ebola information available for free:

Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak (Science DOI: 10.1126/science.1259657) This is the gene sequencing report that establishes that one (1) person ate infected bush meat and is the source of all the following Ebola infections.

So much for needing highly specialized labs to “weaponize” biological agents. One infection is likely to result in > 20,000 deaths. You do the math.

I first saw this in a tweet by Alex Vespignani.

A document classifier for medicinal chemistry publications trained on the ChEMBL corpus

A document classifier for medicinal chemistry publications trained on the ChEMBL corpus by George Papadatos, et al. (Journal of Cheminformatics 2014, 6:40)

Abstract:

Background

The large increase in the number of scientific publications has fuelled a need for semi- and fully automated text mining approaches in order to assist in the triage process, both for individual scientists and also for larger-scale data extraction and curation into public databases. Here, we introduce a document classifier, which is able to successfully distinguish between publications that are ‘ChEMBL-like’ (i.e. related to small molecule drug discovery and likely to contain quantitative bioactivity data) and those that are not. The unprecedented size of the medicinal chemistry literature collection, coupled with the advantage of manual curation and mapping to chemistry and biology make the ChEMBL corpus a unique resource for text mining.

Results

The method has been implemented as a data protocol/workflow for both Pipeline Pilot (version 8.5) and KNIME (version 2.9) respectively. Both workflows and models are freely available at: ftp://ftp.ebi.ac.uk/pub/databases/chembl/text-mining webcite. These can be readily modified to include additional keyword constraints to further focus searches.

Conclusions

Large-scale machine learning document classification was shown to be very robust and flexible for this particular application, as illustrated in four distinct text-mining-based use cases. The models are readily available on two data workflow platforms, which we believe will allow the majority of the scientific community to apply them to their own data.

While the abstract mentions “the triage process,” it fails to capture the main goal of this paper:

…the main goal of our project diverges from the goal of the tools mentioned. We aim to meet the following criteria: ranking and prioritising the relevant literature using a fast and high performance algorithm, with a generic methodology applicable to other domains and not necessarily related to chemistry and drug discovery. In this regard, we present a method that builds upon the manually collated and curated ChEMBL document corpus, in order to train a Bag-of-Words (BoW) document classifier.

In more detail, we have employed two established classification methods, namely Naïve Bayesian (NB) and Random Forest (RF) approaches [12]-[14]. The resulting classification score, henceforth referred to as ‘ChEMBL-likeness’, is used to prioritise relevant documents for data extraction and curation during the triage process.

In other words, the focus of this paper is a classifier to help prioritize curation of papers. I take that as being different from classifiers used at other stages or for other purposes in the curation process.

I first saw this in a tweet by ChemConnector.

August 28, 2014

Thou Shalt Share!

Filed under: Bioinformatics,Genome,Genomics,Open Data — Patrick Durusau @ 1:46 pm

NIH Tells Genomic Researchers: ‘You Must Share Data’ by Paul Basken.

From the post:

Scientists who use government money to conduct genomic research will now be required to quickly share the data they gather under a policy announced on Wednesday by the National Institutes of Health.

The data-sharing policy, which will take effect with grants awarded in January, will give agency-financed researchers six months to load any genomic data they collect—from human or nonhuman subjects—into a government-established database or a recognized alternative.

NIH officials described the move as the latest in a series of efforts by the federal government to improve the efficiency of taxpayer-financed research by ensuring that scientific findings are shared as widely as possible.

“We’ve gone from a circumstance of saying, ‘Everybody should share data,’ to now saying, in the case of genomic data, ‘You must share data,’” said Eric D. Green, director of the National Human Genome Research Institute at the NIH.

A step in the right direction!

Waiting for other government funding sources and private funders (including in the humanities) to take the same step.

I first saw this in a tweet by Kevin Davies.

August 26, 2014

Test Your Analysis With Random Numbers

Filed under: Bioinformatics,Data Analysis,Statistics — Patrick Durusau @ 12:55 pm

A critical reanalysis of the relationship between genomics and well-being by Nicholas J. L. Brown, et al. (Nicholas J. L. Brown, doi: 10.1073/pnas.1407057111)

Abstract:

Fredrickson et al. [Fredrickson BL, et al. (2013) Proc Natl Acad Sci USA 110(33):13684–13689] claimed to have observed significant differences in gene expression related to hedonic and eudaimonic dimensions of well-being. Having closely examined both their claims and their data, we draw substantially different conclusions. After identifying some important conceptual and methodological flaws in their argument, we report the results of a series of reanalyses of their dataset. We first applied a variety of exploratory and confirmatory factor analysis techniques to their self-reported well-being data. A number of plausible factor solutions emerged, but none of these corresponded to Fredrickson et al.’s claimed hedonic and eudaimonic dimensions. We next examined the regression analyses that purportedly yielded distinct differential profiles of gene expression associated with the two well-being dimensions. Using the best-fitting two-factor solution that we identified, we obtained effects almost twice as large as those found by Fredrickson et al. using their questionable hedonic and eudaimonic factors. Next, we conducted regression analyses for all possible two-factor solutions of the psychometric data; we found that 69.2% of these gave statistically significant results for both factors, whereas only 0.25% would be expected to do so if the regression process was really able to identify independent differential gene expression effects. Finally, we replaced Fredrickson et al.’s psychometric data with random numbers and continued to find very large numbers of apparently statistically significant effects. We conclude that Fredrickson et al.’s widely publicized claims about the effects of different dimensions of well-being on health-related gene expression are merely artifacts of dubious analyses and erroneous methodology. (emphasis added)

To see the details you will need a subscription the the Proceedings of the National Academy of Sciences.

However, you can take this data analysis lesson from the abstract:

If your data can be replaced with random numbers and still yield statistically significant results, stop the publication process. Something is seriously wrong with your methodology.

I first saw this in a tweet by WvSchaik.

August 23, 2014

MeSH on Demand Tool:…

Filed under: Authoring Topic Maps,Bioinformatics,Medical Informatics,MeSH — Patrick Durusau @ 9:43 am

MeSH on Demand Tool: An Easy Way to Identify Relevant MeSH Terms by Dan Cho.

From the post:

Currently, the MeSH Browser allows for searches of MeSH terms, text-word searches of the Annotation and Scope Note, and searches of various fields for chemicals. These searches assume that users are familiar with MeSH terms and using the MeSH Browser.

Wouldn’t it be great if you could find MeSH terms directly from your text such as an abstract or grant summary? MeSH on Demand has been developed in close collaboration among MeSH Section, NLM Index Section, and the Lister Hill National Center for Biomedical Communications to address this need.

Using MeSH on Demand

Use MeSH on Demand to find MeSH terms relevant to your text up to 10,000 characters. One of the strengths of MeSH on Demand is its ease of use without any prior knowledge of the MeSH vocabulary and without any downloads.

Now there’s a clever idea!

Imagine extending it just a bit so that it produces topics for subjects it detects in your text and associations with the text and author of the text. I would call that assisted topic map authoring. You?

I followed a tweet by Michael Hoffman, which lead to: MeSH on Demand Update: How to Find Citations Related to Your Text, which describes an enhancement to MeSH on demands that finds relevant citations (10) based on your text.

The enhanced version mimics the traditional method of writing court opinions. A judge writes his decision and then a law clerk finds cases that support the positions taken in the opinion. You really thought it worked some other way? 😉

« Newer PostsOlder Posts »

Powered by WordPress