Archive for the ‘Bioinformatics’ Category

PubMed Commons to be Discontinued

Friday, February 2nd, 2018

PubMed Commons to be Discontinued

From the post:

PubMed Commons has been a valuable experiment in supporting discussion of published scientific literature. The service was first introduced as a pilot project in the fall of 2013 and was reviewed in 2015. Despite low levels of use at that time, NIH decided to extend the effort for another year or two in hopes that participation would increase. Unfortunately, usage has remained minimal, with comments submitted on only 6,000 of the 28 million articles indexed in PubMed.

While many worthwhile comments were made through the service during its 4 years of operation, NIH has decided that the low level of participation does not warrant continued investment in the project, particularly given the availability of other commenting venues.

Comments will still be available, see the post for details.

Good time for the reminder that even negative results from an experiment are valuable.

Even more so in this case because discussion/comment facilities are non-trivial components of a content delivery system. Time and resources not spent on comment facilities could be put in other directions.

Where do discussions of medical articles take place and can they be used to automatically annotate published articles?

Data Science Bowl 2018 – Spot Nuclei. Speed Cures.

Tuesday, January 16th, 2018

Spot Nuclei. Speed Cures.

From the webpage:

The 2018 Data Science Bowl offers our most ambitious mission yet: Create an algorithm to automate nucleus detection and unlock faster cures.

Compete on Kaggle

Three months. $100,000.

Even if you “lose,” think of the experience you will gain. No losers.


PS: Just thinking outloud but if:

This dataset contains a large number of segmented nuclei images. The images were acquired under a variety of conditions and vary in the cell type, magnification, and imaging modality (brightfield vs. fluorescence). The dataset is designed to challenge an algorithm’s ability to generalize across these variations.

isn’t the ability to generalize, with lower performance a downside?

Why not use the best algorithm for a specified set of data conditions, “merging” that algorithm so to speak, so that scientists always have the best algorithm for their specific data set.

So outside the contest, perhaps recognizing the conditions of the images are the most important subjects and they should be matched to the best conditions for particular algorithms.

Anyone interested in collaborating on a topic map entry?

A Guide to Reproducible Code in Ecology and Evolution

Thursday, December 7th, 2017

A Guide to Reproducible Code in Ecology and Evolution by British Ecological Society.

Natilie Cooper, Natural History Museum, UK and Pen-Yuan Hsing, Durham University, UK, write in the introduction:

The way we do science is changing — data are getting bigger, analyses are getting more complex, and governments, funding agencies and the scientific method itself demand more transparency and accountability in research. One way to deal with these changes is to make our research more reproducible, especially our code.

Although most of us now write code to perform our analyses, it is often not very reproducible. We have all come back to a piece of work we have not looked at for a while and had no idea what our code was doing or which of the many “final_analysis” scripts truly was the final analysis! Unfortunately, the number of tools for reproducibility and all the jargon can leave new users feeling overwhelmed, with no idea how to start making their code more reproducible. So, we have put together this guide to help.

A Guide to Reproducible Code covers all the basic tools and information you will need to start making your code more reproducible. We focus on R and Python, but many of the tips apply to any programming language. Anna Krystalli introduces some ways to organise files on your computer and to document your workflows. Laura Graham writes about how to make your code more reproducible and readable. François Michonneau explains how to write reproducible reports. Tamora James breaks down the basics of version control. Finally, Mike Croucher describes how to archive your code. We have also included a selection of helpful tips from other scientists.

True reproducibility is really hard. But do not let this put you off. We would not expect anyone to follow all of the advice in this booklet at once. Instead, challenge yourself to add one more aspect to each of your projects. Remember, partially reproducible research is much better than completely non-reproducible research.

Good luck!
… (emphasis in original)

Not counting front and back matter, 39 pages total. A lot to grasp in one reading but if you don’t already have reproducible research habits, keep a copy of this publication on top of your desk. Yes, on top of the incoming mail, today’s newspaper, forms and chart requests from administrators, etc. On top means just that, on top.

At some future date, when the pages are too worn, creased, folded, dog eared and annotated to be read easily, reprint it and transfer your annotations to a clean copy.

I first saw this in David Smith’s The British Ecological Society’s Guide to Reproducible Science.

PS: The same rules apply to data science.

A Primer for Computational Biology

Thursday, November 9th, 2017

A Primer for Computational Biology by Shawn T. O’Neil.

From the webpage:

A Primer for Computational Biology aims to provide life scientists and students the skills necessary for research in a data-rich world. The text covers accessing and using remote servers via the command-line, writing programs and pipelines for data analysis, and provides useful vocabulary for interdisciplinary work. The book is broken into three parts:

  1. Introduction to Unix/Linux: The command-line is the “natural environment” of scientific computing, and this part covers a wide range of topics, including logging in, working with files and directories, installing programs and writing scripts, and the powerful “pipe” operator for file and data manipulation.
  2. Programming in Python: Python is both a premier language for learning and a common choice in scientific software development. This part covers the basic concepts in programming (data types, if-statements and loops, functions) via examples of DNA-sequence analysis. This part also covers more complex subjects in software development such as objects and classes, modules, and APIs.
  3. Programming in R: The R language specializes in statistical data analysis, and is also quite useful for visualizing large datasets. This third part covers the basics of R as a programming language (data types, if-statements, functions, loops and when to use them) as well as techniques for large-scale, multi-test analyses. Other topics include S3 classes and data visualization with ggplot2.

Pass along to life scientists and students.

This isn’t the primer that separates the CS material from domain specific examples and prose. Adaptation to another domain is a question of re-writing.

I assume an adaptable primer wasn’t the author’s intention and so that isn’t a criticism but an observation that basic material is written over and over again, needlessly.

I first saw this in a tweet by Christophe Lalanne.

DNA Injection Attack (Shellcode in Data)

Thursday, August 10th, 2017

BioHackers Encoded Malware in a String of DNA by Andy Greenberg.

From the post:

WHEN BIOLOGISTS SYNTHESIZE DNA, they take pains not to create or spread a dangerous stretch of genetic code that could be used to create a toxin or, worse, an infectious disease. But one group of biohackers has demonstrated how DNA can carry a less expected threat—one designed to infect not humans nor animals but computers.

In new research they plan to present at the USENIX Security conference on Thursday, a group of researchers from the University of Washington has shown for the first time that it’s possible to encode malicious software into physical strands of DNA, so that when a gene sequencer analyzes it the resulting data becomes a program that corrupts gene-sequencing software and takes control of the underlying computer. While that attack is far from practical for any real spy or criminal, it’s one the researchers argue could become more likely over time, as DNA sequencing becomes more commonplace, powerful, and performed by third-party services on sensitive computer systems. And, perhaps more to the point for the cybersecurity community, it also represents an impressive, sci-fi feat of sheer hacker ingenuity.

“We know that if an adversary has control over the data a computer is processing, it can potentially take over that computer,” says Tadayoshi Kohno, the University of Washington computer science professor who led the project, comparing the technique to traditional hacker attacks that package malicious code in web pages or an email attachment. “That means when you’re looking at the security of computational biology systems, you’re not only thinking about the network connectivity and the USB drive and the user at the keyboard but also the information stored in the DNA they’re sequencing. It’s about considering a different class of threat.”

Very high marks for imaginative delivery but at its core, this is shellcode in data.

Shellcode in an environment the authors describe as follows:

Our results, and particularly our discovery that bioinformatics software packages do not seem to be written with adversaries in mind, suggest that the bioinformatics pipeline has to date not received significant adversarial pressure.

(Computer Security, Privacy, and DNA Sequencing: Compromising Computers with Synthesized DNA, Privacy Leaks, and More)

Question: Can you name any data pipelines that have been subjected to adversarial pressure?

The reading of DNA and transposition into machine format reminds me that a data pipeline could ingest apparently non-hostile data and as a result of transformations/processing, produce hostile data at some point in the data stream.

Transformation into shellcode, now that’s a very interesting concept.

Can You Replicate Your Searches?

Thursday, February 16th, 2017

A comment at PubMed raises the question of replicating reported literature searches:

From the comment:

Mellisa Rethlefsen

I thank the authors of this Cochrane review for providing their search strategies in the document Appendix. Upon trying to reproduce the Ovid MEDLINE search strategy, we came across several errors. It is unclear whether these are transcription errors or represent actual errors in the performed search strategy, though likely the former.

For instance, in line 39, the search is “tumour bed” [quotes not in original]. The correct syntax would be “tumour bed,kw,ti,ab” [no quotes]. The same is true for line 41, where the commas are replaced with periods.

In line 42, the search is “Breast Neoplasms /” [quotes not in original]. It is not entirely clear what the authors meant here, but likely they meant to search the MeSH heading Breast Neoplasms with the subheading radiotherapy. If that is the case, the search should have been “Breast Neoplasms/rt” [no quotes].

In lines 43 and 44, it appears as though the authors were trying to search for the MeSH term “Radiotherapy, Conformal” with two different subheadings, which they spell out and end with a subject heading field search (i.e., Radiotherapy, Conformal/adverse In Ovid syntax, however, the correct search syntax would be “Radiotherapy, Conformal/ae” [no quotes] without the subheading spelled out and without the extraneous .sh.

In line 47, there is another minor error, again with .sh being extraneously added to the search term “Radiotherapy/” [quotes not in original].

Though these errors are minor and are highly likely to be transcription errors, when attempting to replicate this search, each of these lines produces an error in Ovid. If a searcher is unaware of how to fix these problems, the search becomes unreplicable. Because the search could not have been completed as published, it is unlikely this was actually how the search was performed; however, it is a good case study to examine how even small details matter greatly for reproducibility in search strategies.

A great reminder that replication of searches is a non-trivial task and that search engines are literal to the point of idiocy.

Unmet Needs for Analyzing Biological Big Data… [Data Integration #1 – Spells Market Opportunity]

Wednesday, February 15th, 2017

Unmet Needs for Analyzing Biological Big Data: A Survey of 704 NSF Principal Investigators by Lindsay Barone, Jason Williams, David Micklos.


In a 2016 survey of 704 National Science Foundation (NSF) Biological Sciences Directorate principle investigators (BIO PIs), nearly 90% indicated they are currently or will soon be analyzing large data sets. BIO PIs considered a range of computational needs important to their work, including high performance computing (HPC), bioinformatics support, multi-step workflows, updated analysis software, and the ability to store, share, and publish data. Previous studies in the United States and Canada emphasized infrastructure needs. However, BIO PIs said the most pressing unmet needs are training in data integration, data management, and scaling analyses for HPC, acknowledging that data science skills will be required to build a deeper understanding of life. This portends a growing data knowledge gap in biology and challenges institutions and funding agencies to redouble their support for computational training in biology.

In particular, needs topic maps can address rank #1, #2, #6, #7, and #10, or as found by the authors:

A majority of PIs—across bioinformatics/other disciplines, larger/smaller groups, and the four NSF programs—said their institutions are not meeting nine of 13 needs (Figure 3). Training on integration of multiple data types (89%), on data management and metadata (78%), and on scaling analysis to cloud/HP computing (71%) were the three greatest unmet needs. High performance computing was an unmet need for only 27% of PIs—with similar percentages across disciplines, different sized groups, and NSF programs.

or graphically (figure 3):

So, cloud, distributed, parallel, pipelining, etc., processing is insufficient?

Pushing undocumented and unintegratable data at ever increasing speeds is impressive but gives no joy?

This report will provoke another round of Esperanto fantasies, that is the creation of “universal” vocabularies, which if used by everyone and back-mapped to all existing literature, would solve the problem.

The number of Esperanto fantasies and the cost/delay of back-mapping to legacy data defeats all such efforts. Those defeats haven’t prevented repeated funding of such fantasies in the past, present and no doubt the future.

Perhaps those defeats are a question of scope.

That is rather than even attempting some “universal” interchange of data, why not approach it incrementally?

I suspect the PI’s surveyed each had some particular data set in mind when they mentioned data integration (which itself is a very broad term).

Why not seek out, develop and publish data integrations in particular instances, as opposed to attempting to theorize what might work for data yet unseen?

The need topic maps wanted to meet remains unmet. With no signs of lessening.

Opportunity knocks. Will we answer?

Staying Current in Bioinformatics & Genomics: 2017 Edition

Wednesday, February 1st, 2017

Staying Current in Bioinformatics & Genomics: 2017 Edition by Stephen Turner.

From the post:

A while back I wrote this post about how I stay current in bioinformatics & genomics. That was nearly five years ago. A lot has changed since then. A few links are dead. Some of the blogs or Twitter accounts I mentioned have shifted focus or haven’t been updated in years (guilty as charged). The way we consume media has evolved — Google thought they could kill off RSS (long live RSS!), there are many new literature alert services, preprints have really taken off in this field, and many more scientists are engaging via social media than before.

People still frequently ask me how I stay current and keep a finger on the pulse of the field. I’m not claiming to be able to do this well — that’s a near-impossible task for anyone. Five years later and I still run our bioinformatics core, and I’m still mostly focused on applied methodology and study design rather than any particular phenotype, model system, disease, or specific method. It helps me to know that transcript-level estimates improve gene-level inferences from RNA-seq data, and that there’s software to help me do this, but the details underlying kmer shredding vs pseudoalignment to a transcriptome de Bruijn graph aren’t as important to me as knowing that there’s a software implementation that’s well documented, actively supported, and performs well in fair benchmarks. As such, most of what I pay attention to is applied/methods-focused.

What follows is a scattershot, noncomprensive guide to the people, blogs, news outlets, journals, and aggregators that I lean on in an attempt to stay on top of things. I’ve inevitably omitted some key resources, so please don’t be offended if you don’t see your name/blog/Twitter/etc. listed here (drop a link in the comments!). Whatever I write here now will be out of date in no time, so I’ll try to write an update post every year instead of every five.
… (emphasis in original)

Pure gold as is always the case with Stephen’s posts.

Stephen spends an hour everyday scanning his list of resources.

Taking his list as a starting point, what capabilities would you build into a dashboard to facilitate that daily review?

Top considerations for creating bioinformatics software documentation

Wednesday, January 18th, 2017

Top considerations for creating bioinformatics software documentation by Mehran Karimzadeh and Michael M. Hoffman.


Investing in documenting your bioinformatics software well can increase its impact and save your time. To maximize the effectiveness of your documentation, we suggest following a few guidelines we propose here. We recommend providing multiple avenues for users to use your research software, including a navigable HTML interface with a quick start, useful help messages with detailed explanation and thorough examples for each feature of your software. By following these guidelines, you can assure that your hard work maximally benefits yourself and others.


You have written a new software package far superior to any existing method. You submit a paper describing it to a prestigious journal, but it is rejected after Reviewer 3 complains they cannot get it to work. Eventually, a less exacting journal publishes the paper, but you never get as many citations as you expected. Meanwhile, there is not even a single day when you are not inundated by emails asking very simple questions about using your software. Your years of work on this method have not only failed to reap the dividends you expected, but have become an active irritation. And you could have avoided all of this by writing effective documentation in the first place.

Academic bioinformatics curricula rarely train students in documentation. Many bioinformatics software packages lack sufficient documentation. Developers often prefer spending their time elsewhere. In practice, this time is often borrowed, and by ducking work to document their software now, developers accumulate ‘documentation debt’. Later, they must pay off this debt, spending even more time answering user questions than they might have by creating good documentation in the first place. Of course, when confronted with inadequate documentation, some users will simply give up, reducing the impact of the developer’s work.
… (emphasis in original)

Take to heart the authors’ observation on automatic generation of documentation:

The main disadvantage of automatically generated documentation is that you have less control of how to organize the documentation effectively. Whether you used a documentation generator or not, however, there are several advantages to an HTML web site compared with a PDF document. Search engines will more reliably index HTML web pages. In addition, users can more easily navigate the structure of a web page, jumping directly to the information they need.

I would replace “…less control…” with “…virtually no meaningful control…” over the organization of the documentation.

Think about it for a second. You write short comments, sometimes even incomplete sentences as thoughts occur to you in a code or data context.

An automated tool gathers those comments, even incomplete sentences, rips them out of their original context and strings them one after the other.

Do you think that provides a meaningful narrative flow for any reader? Including yourself?

Your documentation doesn’t have to be great literature but as Karimzadeh and Hoffman point out, good documentation can make the difference between use and adoption and your hard work being ignored.

Ping me if you want to take your documentation to the next level.

Applied Computational Genomics Course at UU: Spring 2017

Thursday, January 12th, 2017

Applied Computational Genomics Course at UU: Spring 2017 by Aaron Quinlan.

I initially noticed this resource from posts on the two part Introduction to Unix (part 1) and Introduction to Unix (part 2).

Both of which are too elementary for you but something you can pass onto others. They do give you an idea of the Unix skill level required for the rest of the course.

From the GitHub page:

This course will provide a comprehensive introduction to fundamental concepts and experimental approaches in the analysis and interpretation of experimental genomics data. It will be structured as a series of lectures covering key concepts and analytical strategies. A diverse range of biological questions enabled by modern DNA sequencing technologies will be explored including sequence alignment, the identification of genetic variation, structural variation, and ChIP-seq and RNA-seq analysis. Students will learn and apply the fundamental data formats and analysis strategies that underlie computational genomics research. The primary goal of the course is for students to be grounded in theory and leave the course empowered to conduct independent genomic analyses. (emphasis in the original)

I take it successful completion will also enable you to intelligently question genomic analyses by others.

The explosive growth of genomics makes that a valuable skill in public discussions as well something nice for your toolbox.

Peer Review Fails, Again.

Saturday, April 23rd, 2016

One in 25 papers contains inappropriately duplicated images, screen finds by Cat Ferguson.

From the post:

Elisabeth Bik, a microbiologist at Stanford, has for years been a behind-the-scenes force in scientific integrity, anonymously submitting reports on plagiarism and image duplication to journal editors. Now, she’s ready to come out of the shadows.

With the help of two editors at microbiology journals, she has conducted a massive study looking for image duplication and manipulation in 20,621 published papers. Bik and co-authors Arturo Casadevall and Ferric Fang (a board member of our parent organization) found 782 instances of inappropriate image duplication, including 196 published papers containing “duplicated figures with alteration.” The study is being released as a pre-print on bioArxiv.

I don’t know if the refusal of three (3) journals to date to publish this work or that peer reviewers of the original papers missed the duplication is the sadder news about this paper.

Being in the business of publishing, not in the business of publishing correct results, the refusal to publish an article that establishes the poor quality of those publications, is perhaps understandable. Not acceptable but understandable.

Unless the joke is on the reading public and other researchers. Publications are just that, publications. May or may not resemble any experiment or experience that can be duplicated by others. Rely on published results at your own peril.

Transparent access to all data and not peer review is the only path to solving this problem.

Open-Source Sequence Clustering Methods Improve the State Of the Art

Wednesday, February 24th, 2016

Open-Source Sequence Clustering Methods Improve the State Of the Art by Evguenia Kopylova et al.


Sequence clustering is a common early step in amplicon-based microbial community analysis, when raw sequencing reads are clustered into operational taxonomic units (OTUs) to reduce the run time of subsequent analysis steps. Here, we evaluated the performance of recently released state-of-the-art open-source clustering software products, namely, OTUCLUST, Swarm, SUMACLUST, and SortMeRNA, against current principal options (UCLUST and USEARCH) in QIIME, hierarchical clustering methods in mothur, and USEARCH’s most recent clustering algorithm, UPARSE. All the latest open-source tools showed promising results, reporting up to 60% fewer spurious OTUs than UCLUST, indicating that the underlying clustering algorithm can vastly reduce the number of these derived OTUs. Furthermore, we observed that stringent quality filtering, such as is done in UPARSE, can cause a significant underestimation of species abundance and diversity, leading to incorrect biological results. Swarm, SUMACLUST, and SortMeRNA have been included in the QIIME 1.9.0 release.

IMPORTANCE Massive collections of next-generation sequencing data call for fast, accurate, and easily accessible bioinformatics algorithms to perform sequence clustering. A comprehensive benchmark is presented, including open-source tools and the popular USEARCH suite. Simulated, mock, and environmental communities were used to analyze sensitivity, selectivity, species diversity (alpha and beta), and taxonomic composition. The results demonstrate that recent clustering algorithms can significantly improve accuracy and preserve estimated diversity without the application of aggressive filtering. Moreover, these tools are all open source, apply multiple levels of multithreading, and scale to the demands of modern next-generation sequencing data, which is essential for the analysis of massive multidisciplinary studies such as the Earth Microbiome Project (EMP) (J. A. Gilbert, J. K. Jansson, and R. Knight, BMC Biol 12:69, 2014,

Bioinformatics has specialized clustering issues but improvements in clustering algorithms are likely to have benefits for others.

Not to mention garage gene hackers, who may benefit more directly.

Is Failing to Attempt to Replicate, “Just Part of the Whole Science Deal”?

Tuesday, February 16th, 2016

Genomeweb posted this summary of Stuart Firestein’s op-ed on failure to replicate:

Failure to replicate experiments is just part of the scientific process, writes Stuart Firestein, author and former chair of the biology department at Columbia University, in the Los Angeles Times. The recent worries over a reproducibility crisis in science are overblown, he adds.

“Science would be in a crisis if it weren’t failing most of the time,” Firestein writes. “Science is full of wrong turns, unconsidered outcomes, omissions and, of course, occasional facts.”

Failures to repeat experiments and the struggle to figure out what went wrong has also fed a number of discoveries, he says. For instance, in 1921, biologist Otto Loewi studied beating hearts from frogs in saline baths, one with the vagus nerve removed and one with it still intact. When the solution from the heart with the nerve still there was added to the other bath, that heart also slowed, suggesting that the nerve secreted a chemical that slowed the contractions.

However, Firestein notes Loewi and other researchers had trouble replicating the results for nearly six years. But that led the researchers to find that seasons can affect physiology and that temperature can affect enzyme function: Loewi’s first experiment was conducted at night and in the winter, while the follow-up ones were done during the day in heated buildings or on warmer days. This, he adds, also contributed to the understanding of how synapses fire, a finding for which Loewi shared the 1936 Nobel Prize.

“Replication is part of [the scientific] process, as open to failure as any other step,” Firestein adds. “The mistake is to think that any published paper or journal article is the end of the story and a statement of incontrovertible truth. It is a progress report.”

You will need to read Firestein’s comments in full: just part of the scientific process, to appreciate my concerns.

For example, Firestein says:

Absolutely not. Science is doing what it always has done — failing at a reasonable rate and being corrected. Replication should never be 100%. Science works beyond the edge of what is known, using new, complex and untested techniques. It should surprise no one that things occasionally come out wrong, even though everything looks correct at first.

I don’t know, would you say an 85% failure to replicate rate is significant? Drug development: Raise standards for preclinical cancer research, C. Glenn Begley & Lee M. Ellis, Nature 483, 531–533 (29 March 2012) doi:10.1038/483531a. Or over half of psychology studies? Over half of psychology studies fail reproducibility test. just to name two studies on replication.

I think we can agree with Firestein that replication isn’t at 100% but at what level are the attempts to replicate?

From what Firestein says,

“Replication is part of [the scientific] process, as open to failure as any other step,” Firestein adds. “The mistake is to think that any published paper or journal article is the end of the story and a statement of incontrovertible truth. It is a progress report.”

Systematic attempts at replication (and its failure) should be part and parcel of science.

Except…, that it’s obviously not.

If it were, there would have been no earth shaking announcements that fundamental cancer research experiments could not be replicated.

Failures to replicate would have been spread out over the literature and gradually resolved with better data, methods, if not both.

Failure to replicate is a legitimate part of the scientific method.

Not attempting to replicate, “I won’t look too close at your results if you don’t look too closely at mine,” isn’t.

There an ugly word for avoiding looking too closely at your own results or those of others.

Tackling Zika

Thursday, February 11th, 2016

F1000Research launches rapid, open, publishing channel to help scientists tackle Zika

From the post:

ZAO provides a platform for scientists and clinicians to publish their findings and source data on Zika and its mosquito vectors within days of submission, so that research, medical and government personnel can keep abreast of the rapidly evolving outbreak.

The channel provides diamond-access: it is free to access and articles are published free of charge. It also accepts articles on other arboviruses such as Dengue and Yellow Fever.

The need for the channel is clearly evidenced by a recent report on the global response to the Ebola virus by the Harvard-LSHTM (London School of Hygiene & Tropical Medicine) Independent Panel.

The report listed ‘Research: production and sharing of data, knowledge, and technology’ among its 10 recommendations, saying: “Rapid knowledge production and dissemination are essential for outbreak prevention and response, but reliable systems for sharing epidemiological, genomic, and clinical data were not established during the Ebola outbreak.”

Dr Megan Coffee, an infectious disease clinician at the International Rescue Committee in New York, said: “What’s published six months, or maybe a year or two later, won’t help you – or your patients – now. If you’re working on an outbreak, as a clinician, you want to know what you can know – now. It won’t be perfect, but working in an information void is even worse. So, having a way to get information and address new questions rapidly is key to responding to novel diseases.”

Dr. Coffee is also a co-author of an article published in the channel today, calling for rapid mobilisation and adoption of open practices in an important strand of the Zika response: drug discovery –

Sean Ekins, of Collaborative Drug Discovery, and lead author of the article, which is titled ‘Open drug discovery for the Zika virus’, said: “We think that we would see rapid progress if there was some call for an open effort to develop drugs for Zika. This would motivate members of the scientific community to rally around, and centralise open resources and ideas.”

Another co-author, of the article, Lucio Freitas-Junior of the Brazilian Biosciences National Laboratory, added: “It is important to have research groups working together and sharing data, so that scarce resources are not wasted in duplication. This should always be the case for neglected diseases research, and even more so in the case of Zika.”

Rebecca Lawrence, Managing Director, F1000, said: “One of the key conclusions of the recent Harvard-LSHTM report into the global response to Ebola was that rapid, open data sharing is essential in disease outbreaks of this kind and sadly it did not happen in the case of Ebola.

“As the world faces its next health crisis in the form of the Zika virus, F1000Research has acted swiftly to create a free, dedicated channel in which scientists from across the globe can share new research and clinical data, quickly and openly. We believe that it will play a valuable role in helping to tackle this health crisis.”


For more information:

Andrew Baud, Tala (on behalf of F1000), +44 (0) 20 3397 3383 or +44 (0) 7775 715775

Excellent news for researchers but a direct link to the new channel would have been helpful as well: Zika & Arbovirus Outbreaks (ZAO).

See this post: The Zika & Arbovirus Outbreaks channel on F1000Research by Thomas Ingraham.

News organizations should note that as of today, 11 February 2016, ZAO offers 9 articles, 16 posters and 1 set of slides. Those numbers are likely to increase rapidly.

Oh, did I mention the ZAO channel is free?

Unlike some journals, payment, prestige, privilege, are not pre-requisites for publication.

Useful research on Zika & Arboviruses is the only requirement.

I know, sounds like a dangerous precedent but defeating a disease like Zika will require taking risks.

Fast search of thousands of short-read sequencing experiments [NEW! Sequence Bloom Tree]

Monday, February 8th, 2016

Fast search of thousands of short-read sequencing experiments by Brad Solomon & Carl Kingsford.

Abstract from the “official” version at Nature Biotechnology (2016):

The amount of sequence information in public repositories is growing at a rapid rate. Although these data are likely to contain clinically important information that has not yet been uncovered, our ability to effectively mine these repositories is limited. Here we introduce Sequence Bloom Trees (SBTs), a method for querying thousands of short-read sequencing experiments by sequence, 162 times faster than existing approaches. The approach searches large data archives for all experiments that involve a given sequence. We use SBTs to search 2,652 human blood, breast and brain RNA-seq experiments for all 214,293 known transcripts in under 4 days using less than 239 MB of RAM and a single CPU. Searching sequence archives at this scale and in this time frame is currently not possible using existing tools.

That will set you back $32 for the full text and PDF.

Or, you can try the unofficial version:


Enormous databases of short-read RNA-seq sequencing experiments such as the NIH Sequence Read Archive (SRA) are now available. However, these collections remain difficult to use due to the inability to search for a particular expressed sequence. A natural question is which of these experiments contain sequences that indicate the expression of a particular sequence such as a gene isoform, lncRNA, or uORF. However, at present this is a computationally demanding question at the scale of these databases.

We introduce an indexing scheme, the Sequence Bloom Tree (SBT), to support sequence-based querying of terabase-scale collections of thousands of short-read sequencing experiments. We apply SBT to the problem of finding conditions under which query transcripts are expressed. Our experiments are conducted on a set of 2652 publicly available RNA-seq experiments contained in the NIH for the breast, blood, and brain tissues, comprising 5 terabytes of sequence. SBTs of this size can be queried for a 1000 nt sequence in 19 minutes using less than 300 MB of RAM, over 100 times faster than standard usage of SRA-BLAST and 119 times faster than STAR. SBTs allow for fast identification of experiments with expressed novel isoforms, even if these isoforms were unknown at the time the SBT was built. We also provide some theoretical guidance about appropriate parameter selection in SBT and propose a sampling-based scheme for potentially scaling SBT to even larger collections of files. While SBT can handle any set of reads, we demonstrate the effectiveness of SBT by searching a large collection of blood, brain, and breast RNA-seq files for all 214,293 known human transcripts to identify tissue-specific transcripts.

The implementation used in the experiments below is in C++ and is available as open source at∼ckingsf/software/bloomtree.

You will probably be interested in review comments by C. Titus Brown, Thoughts on Sequence Bloom Trees.

As of today, the exact string “Sequence Bloom Tree” gathers only 207 “hits” so the literature is still small enough to be read.

Don’t delay overlong pursuing this new search technique!

I first saw this in a tweet by Stephen Turner.

Unpronounceable — why can’t people give bioinformatics tools sensible names?

Monday, November 16th, 2015

Unpronounceable — why can’t people give bioinformatics tools sensible names? by Keith Bardnam.

From the post:

Okay, so many of you know that I have a bit of an issue with bioinformatics tools with names that are formed from very tenuous acronyms or initialisms. I’ve handed out many JABBA awards for cases of ‘Just Another Bogus Bioinformatics Acronym’. But now there is another blight on the landscape of bioinformatics nomenclature…that of unpronounceable names.

If you develop bioinformatics tools, you would hopefully want to promote those tools to others. This could be in a formal publication, or at a conference presentation, or even over a cup of coffee with a colleague. In all of these situations, you would hope that the name of your bioinformatics tool should be memorable. One way of making it memorable is to make it pronounceable. Surely, that’s not asking that much? And yet…

The examples Keith recites are quite amusing and you can find more at the JABBA awards.

He also includes some helpful advice on naming:

There is a lot of bioinformatics software in this world. If you choose to add to this ever growing software catalog, then it will be in your interest to make your software easy to discover and easy to promote. For your own sake, and for the sake of any potential users of your software, I strongly urge you to ask yourself the following five questions:

  1. Is the name memorable?
  2. Does the name have one obvious pronunciation?
  3. Could I easily spell the name out to a journalist over the phone?
  4. Is the name of my database tool free from any needless mixed capitalization?
  5. Have I considered whether my software name is based on such a tenuous acronym or intialism that it will probably end up receiving a JABBA award?

To which I would add:

6. Have you searched the name in popular Internet search engines?

I read a fair amount of computer news and little is more annoying that to search for new “name” only to find it has 10 million “hits.” Any relevant to the new usage are buried somewhere in the long set of results.

Two word names do better and three even better than two. That is if you want people to find your project, paper, software.

If not, then by all means use one of the most popular child name lists. You will know where to find your work, but the rest of us won’t.

The Gene Hackers [Chaos Remains King]

Tuesday, November 10th, 2015

The Gene Hackers by Michael Specter.

From the post:

It didn’t take Zhang or other scientists long to realize that, if nature could turn these molecules into the genetic equivalent of a global positioning system, so could we. Researchers soon learned how to create synthetic versions of the RNA guides and program them to deliver their cargo to virtually any cell. Once the enzyme locks onto the matching DNA sequence, it can cut and paste nucleotides with the precision we have come to expect from the search-and-replace function of a word processor. “This was a finding of mind-boggling importance,” Zhang told me. “And it set off a cascade of experiments that have transformed genetic research.”

With CRISPR, scientists can change, delete, and replace genes in any animal, including us. Working mostly with mice, researchers have already deployed the tool to correct the genetic errors responsible for sickle-cell anemia, muscular dystrophy, and the fundamental defect associated with cystic fibrosis. One group has replaced a mutation that causes cataracts; another has destroyed receptors that H.I.V. uses to infiltrate our immune system.

The potential impact of CRISPR on the biosphere is equally profound. Last year, by deleting all three copies of a single wheat gene, a team led by the Chinese geneticist Gao Caixia created a strain that is fully resistant to powdery mildew, one of the world’s most pervasive blights. In September, Japanese scientists used the technique to prolong the life of tomatoes by turning off genes that control how quickly they ripen. Agricultural researchers hope that such an approach to enhancing crops will prove far less controversial than using genetically modified organisms, a process that requires technicians to introduce foreign DNA into the genes of many of the foods we eat.

The technology has also made it possible to study complicated illnesses in an entirely new way. A few well-known disorders, such as Huntington’s disease and sickle-cell anemia, are caused by defects in a single gene. But most devastating illnesses, among them diabetes, autism, Alzheimer’s, and cancer, are almost always the result of a constantly shifting dynamic that can include hundreds of genes. The best way to understand those connections has been to test them in animal models, a process of trial and error that can take years. CRISPR promises to make that process easier, more accurate, and exponentially faster.

Deeply compelling read on the stellar career of Feng Zhang and his use of “clustered regularly interspaced short palindromic repeats” (CRISPR) for genetic engineering.

If you are up for the technical side, try PubMed on CRISPR at 2,306 “hits” as of today.

If not, continue with Michael’s article. You will get enough background to realize this is a very profound moment in the development of genetic engineering.

A profound moment that can be made all the more valuable by linking its results to the results (not articles or summaries of articles) of prior research.

Proposals for repackaging data in some yet-to-be-invented format are a non-starter from my perspective. That is more akin to the EU science/WPA projects than a realistic prospect for value-add.

Let’s start with the assumption that when held in electronic format, data has its native format as a given. Nothing we can change about that part of the problem of access.

Whether labbooks, databases, triple stores, etc.

That one assumption reduces worries about corrupting the original data and introduces a sense of “tinkering” with existing data interfaces. (Watch for a post tomorrow on the importance of “tinkering.”)

Hmmm, nodes anyone?

PS: I am not overly concerned about genetic “engineering.” My money is riding on chaos in genetics and environmental factors.

Text Mining Meets Neural Nets: Mining the Biomedical Literature

Wednesday, October 28th, 2015

Text Mining Meets Neural Nets: Mining the Biomedical Literature by Dan Sullivan.

From the webpage:

Text mining and natural language processing employ a range of techniques from syntactic parsing, statistical analysis, and more recently deep learning. This presentation presents recent advances in dense word representations, also known as word embedding, and their advantages over sparse representations, such as the popular term frequency-inverse document frequency (tf-idf) approach. It also discusses convolutional neural networks, a form of deep learning that is proving surprisingly effective in natural language processing tasks. Reference papers and tools are included for those interested in further details. Examples are drawn from the bio-medical domain.

Basically an abstract for the 58 slides you will find here:

The best thing about these slides is the wealth of additional links to other resources. There is only so much you can say on a slide so links to more details should be a standard practice.

Slide 53: Formalize a Mathematical Model of Semantics, seems a bit ambitious to me. Considering mathematics are a subset of natural languages. Difficult to see how the lesser could model the greater.

You could create a mathematical model of some semantics and say it was all that is necessary, but that’s been done before. Always strive to make new mistakes.

Big Data to Knowledge (Biomedical)

Tuesday, July 28th, 2015

Big Data to Knowledge (BD2K) Development of Software Tools and Methods for Biomedical Big Data in Targeted Areas of High Need (U01).


Open Date (Earliest Submission Date) September 6, 2015

Letter of Intent Due Date(s) September 6, 2015

Application Due Date(s) October 6, 2015,

Scientific Merit Review February 2016

Advisory Council Review May 2016

Earliest Start Date July 2016

From the webpage:

The purpose of this BD2K Funding Opportunity Announcement (FOA) is to solicit development of software tools and methods in the three topic areas of Data Privacy, Data Repurposing, and Applying Metadata, all as part of the overall BD2K initiative. While this FOA is intended to foster new development, submissions consisting of significant adaptations of existing methods and software are also invited.

The instructions say to submit early so that corrections to your application can be suggested. (Take the advice.)

Topic maps, particularly with customized subject identity rules, are a nice fit to the detailed requirements you will find at the grant site.

Ping me if you are interested in discussing why you should include topic maps in your application.

The European Bioinformatics Institute

Friday, July 10th, 2015

The European Bioinformatics Institute

From the webpage:

EMBL-EBI provides freely available data from life science experiments, performs basic research in computational biology and offers an extensive user training programme, supporting researchers in academia and industry.

I have mentioned the European Bioinformatics Institute before in connection with specific projects but never as a stand alone entry. The Institute is part of the European Molecular Biology Laboratory.

More from the webpage:

Our mission

  • To provide freely available data and bioinformatics services to all facets of the scientific community
  • To contribute to the advancement of biology through basic investigator-driven research
  • To provide advanced bioinformatics training to scientists at all levels
  • To help disseminate cutting-edge technologies to industry
  • To coordinate biological data provision throughout Europe

This should be high on your list of bioinformatics bookmarks!

Biomedical citation maker

Thursday, July 9th, 2015

Biomedical citation maker

I encountered this yesterday while writing about PMID, PMCID and DOI identifiers.

From the webpage:

This page creates biomedical citations to use when for writing on websites. In addition, the citation maker can optionally quote the conclusion of an article and help present numeric results from clinical studies that report diagnosis and treatment.

There is also a bookmarklet if you are interested.

Automatic formatting of citations is the only way to insure quality citation practices. Variations in citations can cause others to miss the cited article or related materials.

I don’t regard correct citations as an excessive burden. Some authors do.

PMID-PMCID-DOI Mappings (monthly update)

Wednesday, July 8th, 2015

PMID-PMCID-DOI Mappings (monthly update)

Dario Taraborelli tweets:

All PMID-PMCID-DOI mappings known by @EuropePMC_news, refreshed monthly

The file lists at 150MB but be aware that it decompresses to 909MB+. Approximately 25.6 million lines.

In case you are unfamiliar with PMID/PMCID:

PMID and PMCID are not the same thing.

PMID is the unique identifier number used in PubMed. They are assigned to each article record when it enters the PubMed system, so an in press publication will not have one unless it is issued as an electronic pre-pub. The PMID# is always found at the end of a PubMed citation.

Example of PMID#: Diehl SJ. Incorporating health literacy into adult basic education: from life skills to life saving. N C Med J. 2007 Sep-Oct;68(5):336-9. Review. PubMed PMID: 18183754.

PMCID is the unique identifier number used in PubMed Central. People are usually looking for this number in order to comply with the NIH Public Access Regulations. We have a webpage that gathers information to guide compliance. You can find it here: (broken link) [updated link:]

A PMCID# is assigned after an author manuscript is deposited into PubMed Central. Some journals will deposit for you. Is this your publication? What is the journal?

PMCID#s can be found at the bottom of an article citation in PubMed, but only for articles that have been deposited in PubMed Central.

Example of a PMCID#: Ishikawa H, Kiuchi T. Health literacy and health communication. Biopsychosoc Med. 2010 Nov 5;4:18. PubMed PMID: 21054840; PubMed Central PMCID: PMC2990724.

From: how do I find the PMID (is that the same as the PMCID?) for in press publications?

If I were converting this into a topic map, I would use the PMID, PMCID, and DOI entries as subject identifiers. (PMIDs and PMCIDs can be expressed as hrefs.)

Medical Sieve [Information Sieve]

Sunday, June 28th, 2015

Medical Sieve

An effort to capture anomalies from medical imaging, package those with other data, and deliver it for use by clinicians.

If you think of each medical image as represented a large amount of data, the underlying idea is to filter out all but the most relevant data, so that clinicians are not confronting an overload of information.

In network terms, rather than displaying all of the current connections to a network (the ever popular eye-candy view of connections), displaying only those connections that are different from all the rest.

The same technique could be usefully applied in a number of “big data” areas.

From the post:

Medical Sieve is an ambitious long-term exploratory grand challenge project to build a next generation cognitive assistant with advanced multimodal analytics, clinical knowledge and reasoning capabilities that is qualified to assist in clinical decision making in radiology and cardiology. It will exhibit a deep understanding of diseases and their interpretation in multiple modalities (X-ray, Ultrasound, CT, MRI, PET, Clinical text) covering various radiology and cardiology specialties. The project aims at producing a sieve that filters essential clinical and diagnostic imaging information to form anomaly-driven summaries and recommendations that tremendously reduce the viewing load of clinicians without negatively impacting diagnosis.

Statistics show that eye fatigue is a common problem with radiologists as they visually examine a large number of images per day. An emergency room radiologist may look at as many 200 cases a day, and some of these imaging studies, particulary lower body CT angiography can be as many as 3000 images per study. Due to the volume overload, and limited amount of clinical information available as part of imaging studies, diagnosis errors, particularly relating to conincidental diagnosis cases can occur. With radiologists also being a scarce resource in many countries, it will even more important to reduce the volume of data to be seen by clinicians particularly, when they have to be sent over low bandwidth teleradiology networks.

MedicalSieve is an image-guided informatics system that acts as a medical sieve filtering the essential clinical information physicians need to know about the patient for diagnosis and treatment planning. The system gathers clinical data about the patient from a variety of enterprise systems in hospitals including EMR, pharmacy, labs, ADT, and radiology/cardiology PACs systems using HL7 and DICOM adapters. It then uses sophisticated medical text and image processing, pattern recognition and machine learning techniques guided by advanced clinical knowledge to process clinical data about the patient to extract meaningful summaries indicating the anomalies. Finally, it creates advanced summaries of imaging studies capturing the salient anomalies detected in various viewpoints.

Medical Sieve is leading the way in diagnostic interpretation of medical imaging datasets guided by clinical knowledge with many first-time inventions including (a) the first fully automatic spatio-temporal coronary stenosis detection and localization from 2D X-ray angiography studies, (b) novel methods for highly accurate benign/malignant discrimination in breast imaging, and (c) first automated production of AHA guideline17 segment model for cardiac MRI diagnosis.

For more details on the project, please contact Tanveer Syeda-Mahmood (>

You can watch a demo of our Medical Sieve Cognitive Assistant Application here.

Curious: How would you specify the exclusions of information? So that you could replicate the “filtered” view of the data?

Replication is a major issue in publicly funded research these days. Not reason for that to be any different for data science.


Memantic Is Online!

Monday, June 1st, 2015


I first blogged about the Memantic paper: Memantic: A Medical Knowledge Discovery Engine in March of this year and am very happy to now find it online!

From the about page:

Memantic captures relationships between medical concepts by mining biomedical literature and organises these relationships visually according to a well-known medical ontology. For example, a search for “Vitamin B12 deficiency” will yield a visual representation of all related diseases, symptoms and other medical entities that Memantic has discovered from the 25 million medical publications and abstracts mentioned above, as well as a number of medical encyclopaedias.

The user can explore a relationship of interest (such as the one between “Vitamin B12 deficiency” and “optic neuropathy”, for instance) by clicking on it, which will bring up links to all the scientific texts that have been discovered to support that relationship. Furthermore, the user can select the desired type of related concepts — such as “diseases”, “symptoms”, “pharmacological agents”, “physiological functions”, and so on — and use it as a filter to make the visualisation even more concise. Finally, the related concepts can be semantically grouped into an expandable tree hierarchy to further reduce screen clutter and to let the user quickly navigate to the relevant area of interest.

Concisely organising related medical entities without duplication

Memantic first presents all medical terms related to the query concept and then groups publications by the presence of each such term in addition to the query itself. The hierarchical nature of this grouping allows the user to quickly establish previously unencountered relationships and to drill down into the hierarchy to only look at the papers concerning such relationships. Contrast this with the same search performed on Google, where the user normally gets a number of links, many of which have the same title; the user has to go through each link to see if it contains any novel information that is relevant to their query.

Keeping the index of relationships up-to-date

Memantic perpetually renews its index by continuously mining the biomedical literature, extracting new relationships and adding supporting publications to the ones already discovered. The key advantage of Memantic’s user interface is that novel relationships become apparent to the user much quicker than on standard search engines. For example, Google may index a new research paper that exposes a previously unexplored connection between a particular drug and the disease that is being searched for by the user. However, Google may not assign that paper the sufficient weight for it to appear in the first few pages of the search results, thus making it invisible to the people searching for the disease who do not persevere in clicking past those initial pages.

To get a real feel for what the site is capable of, you need to create an account (free) and try it for yourself.

I am not a professional medical researchers but was able to duplicate some prior research I have done on edge case conditions fairly quickly. Whether that was due to the interface and its techniques or because of my knowledge of the subject area is hard to answer.

The interface alone is worth the visit.

Do give Memantic a spin! I think you will like what you find.

GOBLET: The Global Organisation for Bioinformatics Learning, Education and Training

Thursday, April 16th, 2015

GOBLET: The Global Organisation for Bioinformatics Learning, Education and Training by Teresa K. Atwood, et al. (PLOS Published: April 9, 2015 DOI: 10.1371/journal.pcbi.1004143)


In recent years, high-throughput technologies have brought big data to the life sciences. The march of progress has been rapid, leaving in its wake a demand for courses in data analysis, data stewardship, computing fundamentals, etc., a need that universities have not yet been able to satisfy—paradoxically, many are actually closing “niche” bioinformatics courses at a time of critical need. The impact of this is being felt across continents, as many students and early-stage researchers are being left without appropriate skills to manage, analyse, and interpret their data with confidence. This situation has galvanised a group of scientists to address the problems on an international scale. For the first time, bioinformatics educators and trainers across the globe have come together to address common needs, rising above institutional and international boundaries to cooperate in sharing bioinformatics training expertise, experience, and resources, aiming to put ad hoc training practices on a more professional footing for the benefit of all.

Great background on GOBLET,

One of the functions of GOBLET is to share training materials in bioinformatics and that is well underway. The Training Portal has eighty-nine (89) sets of training materials as of today, ranging from Pathway and Network Analysis 2014 Module 1 – Introduction to Gene Lists to Parsing data records using Python programming and points in between!

If your training materials aren’t represented, perhaps it is time for you to correct that oversight.


I first saw this in a tweet by Mick Watson.

Photoshopping Science? Where Was Peer Review?

Sunday, April 5th, 2015

Too Much to be Nothing? by Leonid Schneider.

From the post:

(March 24th, 2015) Already at an early age, Olivier Voinnet had achieved star status among plant biologists – until suspicions arose last year that more than 30 of his publications contained dubious images. Voinnet’s colleagues are shocked – and demand an explanation.

Several months ago, a small group of international plant scientists set themselves the task of combing through the relevant literature for evidence of potential data manipulation. They posted their discoveries on the post-publication peer review platform PubPeer. As one of these anonymous scientists (whose real name is known to Laborjournal/Lab Times) explained, all this detective work was accomplished simply by taking a good look at the published figures. Soon, the scientists stumbled on something unexpected: putative image manipulations in the papers of one of the most eminent scientists in the field, Sir David Baulcombe. Even more strikingly, all these suspicious publications (currently seven, including papers in Cell, PNAS and EMBO J) featured his former PhD student, Olivier Voinnet, as first or co-author.

Baulcombe’s research group at The Sainsbury Laboratory (TSL) in Norwich, England, has discovered nothing less than RNA interference (RNAi) in plants, the famous viral defence mechanism, which went on to revolutionise biomedical research as a whole and the technology of controlled gene silencing in particular. Olivier Voinnet himself also prominently contributed to this discovery, which certainly helped him, then only 33 years old, to land a research group leader position at the CNRS Institute for Plant Molecular Biology in Strasbourg, in his native country, France. During his time in Strasbourg, Voinnet won many prestigious prizes and awards, such as the ERC Starting Grant and the EMBO Young Investigator Award, plus the EMBO Gold Medal. Finally, at the end of 2010, the Swiss Federal Institute of Technology (ETH) in Zürich appointed the 38-year-old EMBO Member as Professor of RNA biology. Shortly afterwards, Voinnet was awarded the well-endowed Max Rössler Prize of the ETH.

Disturbing news from the plant sciences of evidence of photo manipulation in published articles.

The post examines the charges at length and indicates what is or is not known at this juncture. Investigations are underway and reports from those investigation will appear in the future.

A step that could be taken now, since the articles in question (about 20) have been published, would be for the journals to disclose the peer reviewers who failed to catch the photo manipulation.

The premise of peer review is holding an author responsible for the content of their article so it is only fair to hold peer reviewers responsible for articles approved by their reviews.

Peer review isn’t much of a gate keeper if it is unable to discover false information or even patterns of false information prior to publication.

I haven’t been reading Lab Times on a regular basis but it looks like I need to correct that oversight.

Classifying Plankton With Deep Neural Networks

Monday, March 23rd, 2015

Classifying Plankton With Deep Neural Networks by Sander Dieleman.

From the post:

The National Data Science Bowl, a data science competition where the goal was to classify images of plankton, has just ended. I participated with six other members of my research lab, the Reservoir lab of prof. Joni Dambre at Ghent University in Belgium. Our team finished 1st! In this post, we’ll explain our approach.

The ≋ Deep Sea ≋ team consisted of Aäron van den Oord, Ira Korshunova, Jeroen Burms, Jonas Degrave, Lionel Pigou, Pieter Buteneers and myself. We are all master students, PhD students and post-docs at Ghent University. We decided to participate together because we are all very interested in deep learning, and a collaborative effort to solve a practical problem is a great way to learn.

There were seven of us, so over the course of three months, we were able to try a plethora of different things, including a bunch of recently published techniques, and a couple of novelties. This blog post was written jointly by the team and will cover all the different ingredients that went into our solution in some detail.


This blog post is going to be pretty long! Here’s an overview of the different sections. If you want to skip ahead, just click the section title to go there.


The problem

The goal of the competition was to classify grayscale images of plankton into one of 121 classes. They were created using an underwater camera that is towed through an area. The resulting images are then used by scientists to determine which species occur in this area, and how common they are. There are typically a lot of these images, and they need to be annotated before any conclusions can be drawn. Automating this process as much as possible should save a lot of time!

The images obtained using the camera were already processed by a segmentation algorithm to identify and isolate individual organisms, and then cropped accordingly. Interestingly, the size of an organism in the resulting images is proportional to its actual size, and does not depend on the distance to the lens of the camera. This means that size carries useful information for the task of identifying the species. In practice it also means that all the images in the dataset have different sizes.

Participants were expected to build a model that produces a probability distribution across the 121 classes for each image. These predicted distributions were scored using the log loss (which corresponds to the negative log likelihood or equivalently the cross-entropy loss).

This loss function has some interesting properties: for one, it is extremely sensitive to overconfident predictions. If your model predicts a probability of 1 for a certain class, and it happens to be wrong, the loss becomes infinite. It is also differentiable, which means that models trained with gradient-based methods (such as neural networks) can optimize it directly – it is unnecessary to use a surrogate loss function.

Interestingly, optimizing the log loss is not quite the same as optimizing classification accuracy. Although the two are obviously correlated, we paid special attention to this because it was often the case that significant improvements to the log loss would barely affect the classification accuracy of the models.

This rocks!

Code is coming soon to Github!

Certainly of interest to marine scientists but also to anyone in bio-medical imaging.

The problem of too much data and too few experts is a common one.

What I don’t recall seeing are releases of pre-trained classifiers. Is the art developing too quickly for that to be a viable product? Just curious.

I first saw this in a tweet by Angela Zutavern.

Memantic: A Medical Knowledge Discovery Engine

Saturday, March 21st, 2015

Memantic: A Medical Knowledge Discovery Engine by Alexei Yavlinsky.


We present a system that constructs and maintains an up-to-date co-occurrence network of medical concepts based on continuously mining the latest biomedical literature. Users can explore this network visually via a concise online interface to quickly discover important and novel relationships between medical entities. This enables users to rapidly gain contextual understanding of their medical topics of interest, and we believe this constitutes a significant user experience improvement over contemporary search engines operating in the biomedical literature domain.

Alexei takes advantage of prior work on medical literature to index and display searches of medical literature in an “economical” way that can enable researchers to discover new relationships in the literature without being overwhelmed by bibliographic detail.

You will need to check my summary against the article but here is how I would describe Memantic:

Memantic indexes medical literature and records the co-occurrences of terms in every text. Those terms are mapped into a standard medical ontology (which reduces screen clutter). When a search is performed, the “results are displayed as nodes based on the medical ontology and includes relationships established by the co-occurrences found during indexing. This enables users to find relationships without the necessity of searching through multiple articles or deduping their search results manually.

As I understand it, Memantic is as much an effort at efficient visualization as it is an improvement in search technique.

Very much worth a slow read over the weekend!

I first saw this in a tweet by Sami Ghazali.

PS: I tried viewing the videos listed in the paper but wasn’t able to get any sound? Maybe you will have better luck.

UK Bioinformatics Landscape

Wednesday, March 18th, 2015

UK Bioinformatics Landscape

Two of the four known challenges in the UK bioinformatics landscape could be addressed by topic maps:

  • Data integration and management of omics data, enabling the use of “big data” across thematic areas and to facilitate data sharing;
  • Open innovation, or pre-competitive approaches in particular to data sharing and data standardisation

I say could be addressed by topic maps, I’m not sure what else you would use to address data integration issues, at least robustly. If you don’t mind paying to migrate data when terminology changes enough to impair your effectiveness, and continuing to pay for every future migration, I suppose that is one solution.

Given the choice, I suspect many people would like to exit the wheel of ETL.

NIH-led effort launches Big Data portal for Alzheimer’s drug discovery

Tuesday, March 10th, 2015

NIH-led effort launches Big Data portal for Alzheimer’s drug discovery

From the post:

A National Institutes of Health-led public-private partnership to transform and accelerate drug development achieved a significant milestone today with the launch of a new Alzheimer’s Big Data portal — including delivery of the first wave of data — for use by the research community. The new data sharing and analysis resource is part of the Accelerating Medicines Partnership (AMP), an unprecedented venture bringing together NIH, the U.S. Food and Drug Administration, industry and academic scientists from a variety of disciplines to translate knowledge faster and more successfully into new therapies.

The opening of the AMP-AD Knowledge Portal and release of the first wave of data will enable sharing and analyses of large and complex biomedical datasets. Researchers believe this approach will ramp up the development of predictive models of Alzheimer’s disease and enable the selection of novel targets that drive the changes in molecular networks leading to the clinical signs and symptoms of the disease.

“We are determined to reduce the cost and time it takes to discover viable therapeutic targets and bring new diagnostics and effective therapies to people with Alzheimer’s. That demands a new way of doing business,” said NIH Director Francis S. Collins, M.D., Ph.D. “The AD initiative of AMP is one way we can revolutionize Alzheimer’s research and drug development by applying the principles of open science to the use and analysis of large and complex human data sets.”

Developed by Sage Bionetworks , a Seattle-based non-profit organization promoting open science, the portal will house several waves of Big Data to be generated over the five years of the AMP-AD Target Discovery and Preclinical Validation Project by multidisciplinary academic groups. The academic teams, in collaboration with Sage Bionetworks data scientists and industry bioinformatics and drug discovery experts, will work collectively to apply cutting-edge analytical approaches to integrate molecular and clinical data from over 2,000 postmortem brain samples.

Big data and open science, now that sounds like a winning combination:

Because no publication embargo is imposed on the use of the data once they are posted to the AMP-AD Knowledge Portal, it increases the transparency, reproducibility and translatability of basic research discoveries, according to Suzana Petanceska, Ph.D., NIA’s program director leading the AMP-AD Target Discovery Project.

“The era of Big Data and open science can be a game-changer in our ability to choose therapeutic targets for Alzheimer’s that may lead to effective therapies tailored to diverse patients,” Petanceska said. “Simply stated, we can work more effectively together than separately.”

Imagine that, academics who aren’t hoarding data for recruitment purposes.

Works for me!

Does it work for you?