Archive for the ‘Life Sciences’ Category

Data Analysis for the Life Sciences… [No Ur-Data Analysis Book?]

Thursday, September 24th, 2015

Data Analysis for the Life Sciences – a book completely written in R markdown by Rafael Irizarry.

From the post:

Data analysis is now part of practically every research project in the life sciences. In this book we use data and computer code to teach the necessary statistical concepts and programming skills to become a data analyst. Following in the footsteps of Stat Labs, instead of showing theory first and then applying it to toy examples, we start with actual applications and describe the theory as it becomes necessary to solve specific challenges. We use simulations and data analysis examples to teach statistical concepts. The book includes links to computer code that readers can use to program along as they read the book.

It includes the following chapters: Inference, Exploratory Data Analysis, Robust Statistics, Matrix Algebra, Linear Models, Inference for High-Dimensional Data, Statistical Modeling, Distance and Dimension Reduction, Practical Machine Learning, and Batch Effects.

Have you ever wondered about the growing proliferation of data analysis books?

The absence of one Ur-Data Analysis book that everyone could read and use?

I have a longer post coming on a this idea but if each discipline has the need for its own view on data analysis, it is really surprising that no one system of semantics satisfies all communities?

In other words, is the evidence of heterogeneous semantics so strong that we should abandon attempts at uniform semantics and focus on communicating across systems of semantics?

I’m sure there are other examples of where every niche has its own vocabulary, tables in relational databases or column headers in spreadsheets for example.

What is your favorite example of heterogeneous semantics?

Assuming heterogeneous semantics are here to stay (they have been around since the start of human to human communication, possibly earlier), what solution do you suggest?

I first saw this in a tweet by Christophe Lalanne.

Ephemeral identifiers for life science data

Wednesday, May 27th, 2015

10 Simple rules for design, provision, and reuse of persistent identifiers for life science data by Julie A. McMurray, et al. (35 others).

From the introduction:

When we interact, we use names to identify things. Usually this works well, but there are many familiar pitfalls. For example , the “morning star” and “evening star” are both names for the planet Venus. “The luminiferous ether” is a name for an entity which no one still thinks exists. There are many women named “Margaret”, some of whom go by “Maggie” and some of whom have changed their surnames. We use everyday conversational mechanisms to work around these problems successfully. Naming problems have plagued the life sciences since Linnaeus pondered the Norway spruce; in the much larger conversation that underlies the life sciences, problems with identifiers (Box 1) impede the flow and integrity of information. This is especially challenging within “synthesis research” disciplines such as systems biology, translational medicine, and ecology. Implementation – driven initiatives such as ELIXIR , BD2K, and others (Text S1) have therefore been actively working to understand and address underlying problems with identifiers.

Good, global-scale, persistent identifier design is harder than it appears, and is essential for data to be Findable, Accessible, Interoperable, and Reusable (Data FAIRport principles [1]). Digital entities (e.g., files), physical entities (e.g., biosamples), and descriptive entities (e.g., ‘mitosis’) have different requirements for identifiers. Identifiers are further complicated by imprecise terminology and different forms (Box 1).

Of the identifier forms, Local Resource Identifiers (LRI) and their corresponding full Uniform Resource Identifiers (URIs) are still among the most commonly used and most problematic identifiers in the bio-data ecosystem. Other forms of identifiers such as Uniform Resource Name (URNs) are less impactful because of their current lack of uptake. Here, we build on emerging conventions and existing general recommendations [2,3] and summarise the identifier characteristics most important to optimising the flow and integrity of life-science data (Table 1). We propose actions to take in the identifier ‘green field’ and offer guidance for using real-world identifiers from diverse sources.

Truth be told, global, persistent identifier design is overreaching.

First, some identifiers are more widely used than others, but there are no globally accepted identifiers of any sort.

Second, “persistent” is undefined. Present identifiers (curies or URIs) have not persisted pre-Web identifiers. On what basis would you claim that future generations will persist our identifiers?

However, systems expect to be able to make references by single, opaque, identifiers and so the hunt goes on for a single identifier.

The more robust and in fact persistent approach is to have a bag of identifiers for any subject, where each identifier itself has a bag of properties associated with it.

That avoids the exclusion of old identifiers and hence historical records and avoids pre-exclusion of future identifiers, which come into use long after our identifier is no long the most popular one.

Systems can continue to use a single identifier, locally as it were but software where semantic integration is important, should use sets of identifiers to facilitate integration across data sources.

Programming in the Life Sciences

Monday, November 17th, 2014

Programming in the Life Sciences by Egon Willighagen.

From the first post in this series, Programming in the Life Sciences #1: a six day course (October, 2013):

Our department will soon start the course Programming in the Life Sciences for a group of some 10 students from the Maastricht Science Programme. This is the first time we give this course, and over the next weeks I will be blogging about this course. First, some information. These are the goals, to use programming to:

  • have the ability to recognize various classes of chemical entities in pharmacology and to understand the basic physical and chemical interactions.
  • be familiar with technologies for web services in the life sciences.
  • obtain experience in using such web services with a programming language.
  • be able to select web services for a particular pharmacological question.
  • have sufficient background for further, more advanced, bioinformatics data analyses.

So, this course will be a mix of things. I will likely start with a lecture or too about scientific programming, such as the importance of reproducibility, licensing, documentation, and (unit) testing. To achieve these learning goals we have set a problem. The description is:


    In the life sciences the interactions between chemical entities is of key interest. Not only do these play an important role in the regulation of gene expression, and therefore all cellular processes, they are also one of the primary approaches in drug discovery. Pharmacology is the science studies the action of drugs, and for many common drugs, this is studying the interaction of small organic molecules and protein targets.
    And with the increasing information in the life sciences, automation becomes increasingly important. Big data and small data alike, provide challenges to integrate data from different experiments. The Open PHACTS platform provides web services to support pharmacological research and in this course you will learn how to use such web services from programming languages, allowing you to link data from such knowledge bases to other platforms, such as those for data analysis.

So, it becomes pretty clear what the students will be doing. They only have six days, so it won’t be much. It’s just to learn them the basic skills. The students are in their 3rd year at the university, and because of the nature of the programme they follow, a mixed background in biology, mathematics, chemistry, and physics. So, I have a good hope they will surprise me in what they will get done.

Pharmacology is the basic topic: drug-protein interaction, but the students are free to select a research question. In fact, I will not care that much what they like to study, as long as they do it properly. They will start with Open PHACTS’ Linked Data API, but here too, they are free to complement data from the OPS cache with additional information. I hope they do.

Now, regarding the technology they will use. The default will be JavaScript, and in the next week I will hack up demo code showing the integration of ops.js and d3.js. Let’s see how hard it will be; it’s new to me too. But, if the students already are familiar with another programming language and prefer to use that, I won’t stop them.

(For the Dutch readers, would #mscpils be a good tag?)

For quite a few “next weeks,” Egon’s blogging has gone on and life sciences, to say nothing of his readers, are all better off for it! His most recent post is titled: Programming in the Life Sciences #20: extracting data from JSON.

Definitely a series to catch or to pass along for anyone involved in life sciences.

Enjoy!