Archive for the ‘Replication’ Category

A Guide to Reproducible Code in Ecology and Evolution

Thursday, December 7th, 2017

A Guide to Reproducible Code in Ecology and Evolution by British Ecological Society.

Natilie Cooper, Natural History Museum, UK and Pen-Yuan Hsing, Durham University, UK, write in the introduction:

The way we do science is changing — data are getting bigger, analyses are getting more complex, and governments, funding agencies and the scientific method itself demand more transparency and accountability in research. One way to deal with these changes is to make our research more reproducible, especially our code.

Although most of us now write code to perform our analyses, it is often not very reproducible. We have all come back to a piece of work we have not looked at for a while and had no idea what our code was doing or which of the many “final_analysis” scripts truly was the final analysis! Unfortunately, the number of tools for reproducibility and all the jargon can leave new users feeling overwhelmed, with no idea how to start making their code more reproducible. So, we have put together this guide to help.

A Guide to Reproducible Code covers all the basic tools and information you will need to start making your code more reproducible. We focus on R and Python, but many of the tips apply to any programming language. Anna Krystalli introduces some ways to organise files on your computer and to document your workflows. Laura Graham writes about how to make your code more reproducible and readable. François Michonneau explains how to write reproducible reports. Tamora James breaks down the basics of version control. Finally, Mike Croucher describes how to archive your code. We have also included a selection of helpful tips from other scientists.

True reproducibility is really hard. But do not let this put you off. We would not expect anyone to follow all of the advice in this booklet at once. Instead, challenge yourself to add one more aspect to each of your projects. Remember, partially reproducible research is much better than completely non-reproducible research.

Good luck!
… (emphasis in original)

Not counting front and back matter, 39 pages total. A lot to grasp in one reading but if you don’t already have reproducible research habits, keep a copy of this publication on top of your desk. Yes, on top of the incoming mail, today’s newspaper, forms and chart requests from administrators, etc. On top means just that, on top.

At some future date, when the pages are too worn, creased, folded, dog eared and annotated to be read easily, reprint it and transfer your annotations to a clean copy.

I first saw this in David Smith’s The British Ecological Society’s Guide to Reproducible Science.

PS: The same rules apply to data science.

A Docker tutorial for reproducible research [Reproducible Reporting In The Future?]

Wednesday, November 15th, 2017

R Docker tutorial: A Docker tutorial for reproducible research.

From the webpage:

This is an introduction to Docker designed for participants with knowledge about R and RStudio. The introduction is intended to be helping people who need Docker for a project. We first explain what Docker is and why it is useful. Then we go into the the details on how to use it for a reproducible transportable project.

Six lessons, instructions for installing Docker, plus zip/tar ball of the materials. What more could you want?

Science has paid lip service to the idea of replication of results for centuries but with the sharing of data and analysis, reproducible research is becoming a reality.

Is reproducible reporting in the near future? Reporters preparing their analysis and releasing raw data and their extraction methods?

Or will selective releases of data, when raw data is released at all, continue to be the norm?

Please let @ICIJorg know how you feel about data hoarding, #ParadisePapers, #PanamaPapers, when data and code sharing are becoming the norm in science.

Is Failing to Attempt to Replicate, “Just Part of the Whole Science Deal”?

Tuesday, February 16th, 2016

Genomeweb posted this summary of Stuart Firestein’s op-ed on failure to replicate:

Failure to replicate experiments is just part of the scientific process, writes Stuart Firestein, author and former chair of the biology department at Columbia University, in the Los Angeles Times. The recent worries over a reproducibility crisis in science are overblown, he adds.

“Science would be in a crisis if it weren’t failing most of the time,” Firestein writes. “Science is full of wrong turns, unconsidered outcomes, omissions and, of course, occasional facts.”

Failures to repeat experiments and the struggle to figure out what went wrong has also fed a number of discoveries, he says. For instance, in 1921, biologist Otto Loewi studied beating hearts from frogs in saline baths, one with the vagus nerve removed and one with it still intact. When the solution from the heart with the nerve still there was added to the other bath, that heart also slowed, suggesting that the nerve secreted a chemical that slowed the contractions.

However, Firestein notes Loewi and other researchers had trouble replicating the results for nearly six years. But that led the researchers to find that seasons can affect physiology and that temperature can affect enzyme function: Loewi’s first experiment was conducted at night and in the winter, while the follow-up ones were done during the day in heated buildings or on warmer days. This, he adds, also contributed to the understanding of how synapses fire, a finding for which Loewi shared the 1936 Nobel Prize.

“Replication is part of [the scientific] process, as open to failure as any other step,” Firestein adds. “The mistake is to think that any published paper or journal article is the end of the story and a statement of incontrovertible truth. It is a progress report.”

You will need to read Firestein’s comments in full: just part of the scientific process, to appreciate my concerns.

For example, Firestein says:


Absolutely not. Science is doing what it always has done — failing at a reasonable rate and being corrected. Replication should never be 100%. Science works beyond the edge of what is known, using new, complex and untested techniques. It should surprise no one that things occasionally come out wrong, even though everything looks correct at first.

I don’t know, would you say an 85% failure to replicate rate is significant? Drug development: Raise standards for preclinical cancer research, C. Glenn Begley & Lee M. Ellis, Nature 483, 531–533 (29 March 2012) doi:10.1038/483531a. Or over half of psychology studies? Over half of psychology studies fail reproducibility test. just to name two studies on replication.

I think we can agree with Firestein that replication isn’t at 100% but at what level are the attempts to replicate?

From what Firestein says,

“Replication is part of [the scientific] process, as open to failure as any other step,” Firestein adds. “The mistake is to think that any published paper or journal article is the end of the story and a statement of incontrovertible truth. It is a progress report.”

Systematic attempts at replication (and its failure) should be part and parcel of science.

Except…, that it’s obviously not.

If it were, there would have been no earth shaking announcements that fundamental cancer research experiments could not be replicated.

Failures to replicate would have been spread out over the literature and gradually resolved with better data, methods, if not both.

Failure to replicate is a legitimate part of the scientific method.

Not attempting to replicate, “I won’t look too close at your results if you don’t look too closely at mine,” isn’t.

There an ugly word for avoiding looking too closely at your own results or those of others.

Why Use Make

Thursday, November 12th, 2015

Why Use Make by Mike Bostock.

From the post:

I love Make. You may think of Make as merely a tool for building large binaries or libraries (and it is, almost to a fault), but it’s much more than that. Makefiles are machine-readable documentation that make your workflow reproducible.

To illustrate with a recent example: yesterday Kevin and I needed to update a six-month old graphic on drought to accompany a new article on thin snowpack in the West. The article was already on the homepage, so the clock was ticking to republish with new data as soon as possible.

Shamefully, I hadn’t documented the data-transformation process, and it’s painfully easy to forget details over six months: I had a mess of CSV and GeoJSON data files, but not the exact source URL from the NCDC; I was temporarily confused as to the right Palmer drought metric (Drought Severity Index or Z Index?) and the corresponding categorical thresholds; finally, I had to resurrect the code to calculate drought coverage area.

Despite these challenges, we republished the updated graphic without too much delay. But I was left thinking how much easier it could have been had I simply recorded the process the first time as a makefile. I could have simply typed make in the terminal and be done!

Remember how science has been losing the ability to replicate experiments due to computers? How Computers Broke Science… [Soon To Break Businesses …]

So you are trying to remember and explain to an opponent’s attorney the process you went through in processing data, after about 3 hours of sharp questioning, how clear do you think you will be? Will you really remember every step? The source of every file?

Had you documented your workflow you can read from your Make file and say exactly what happened, in what order and with what sources. You do need to do that every time if you want anyone to believe the make file represents what actually happened.

You will be on more solid ground than trying to remember which files, the dates on those files, their content, etc.

Mike concludes his post with:

So do your future self and coworkers a favor, and use Make!

Let’s modify that to read:

So do your future self, coworkers, and lawyer a favor, and use Make!

I first saw this in a tweet by Christophe Lalanne.

How Computers Broke Science… [Soon To Break Businesses …]

Tuesday, November 10th, 2015

How Computers Broke Science — and What We can do to Fix It by Ben Marwick.

From the post:

Reproducibility is one of the cornerstones of science. Made popular by British scientist Robert Boyle in the 1660s, the idea is that a discovery should be reproducible before being accepted as scientific knowledge.

In essence, you should be able to produce the same results I did if you follow the method I describe when announcing my discovery in a scholarly publication. For example, if researchers can reproduce the effectiveness of a new drug at treating a disease, that’s a good sign it could work for all sufferers of the disease. If not, we’re left wondering what accident or mistake produced the original favorable result, and would doubt the drug’s usefulness.

For most of the history of science, researchers have reported their methods in a way that enabled independent reproduction of their results. But, since the introduction of the personal computer — and the point-and-click software programs that have evolved to make it more user-friendly — reproducibility of much research has become questionable, if not impossible. Too much of the research process is now shrouded by the opaque use of computers that many researchers have come to depend on. This makes it almost impossible for an outsider to recreate their results.

Recently, several groups have proposed similar solutions to this problem. Together they would break scientific data out of the black box of unrecorded computer manipulations so independent readers can again critically assess and reproduce results. Researchers, the public, and science itself would benefit.

Whether you are looking for specific proposals to make computed results capable of replication or quotes to support that idea, this is a good first stop.

FYI for business analysts, how are you going to replicate results of computer runs to establish your “due diligence” before critical business decisions?

What looked like a science or academic issue has liability implications!

Changing a few variables in a spreadsheet or more complex machine learning algorithms can make you look criminally negligent if not criminal.

The computer illiteracy/incompetence of prosecutors and litigants is only going to last so long. Prepare defensive audit trails to enable the replication of your actual* computer-based business analysis.

*I offer advice on techniques for such audit trails. The audit trails you choose to build are up to you.