Archive for the ‘Research Methods’ Category

A Guide to Reproducible Code in Ecology and Evolution

Thursday, December 7th, 2017

A Guide to Reproducible Code in Ecology and Evolution by British Ecological Society.

Natilie Cooper, Natural History Museum, UK and Pen-Yuan Hsing, Durham University, UK, write in the introduction:

The way we do science is changing — data are getting bigger, analyses are getting more complex, and governments, funding agencies and the scientific method itself demand more transparency and accountability in research. One way to deal with these changes is to make our research more reproducible, especially our code.

Although most of us now write code to perform our analyses, it is often not very reproducible. We have all come back to a piece of work we have not looked at for a while and had no idea what our code was doing or which of the many “final_analysis” scripts truly was the final analysis! Unfortunately, the number of tools for reproducibility and all the jargon can leave new users feeling overwhelmed, with no idea how to start making their code more reproducible. So, we have put together this guide to help.

A Guide to Reproducible Code covers all the basic tools and information you will need to start making your code more reproducible. We focus on R and Python, but many of the tips apply to any programming language. Anna Krystalli introduces some ways to organise files on your computer and to document your workflows. Laura Graham writes about how to make your code more reproducible and readable. François Michonneau explains how to write reproducible reports. Tamora James breaks down the basics of version control. Finally, Mike Croucher describes how to archive your code. We have also included a selection of helpful tips from other scientists.

True reproducibility is really hard. But do not let this put you off. We would not expect anyone to follow all of the advice in this booklet at once. Instead, challenge yourself to add one more aspect to each of your projects. Remember, partially reproducible research is much better than completely non-reproducible research.

Good luck!
… (emphasis in original)

Not counting front and back matter, 39 pages total. A lot to grasp in one reading but if you don’t already have reproducible research habits, keep a copy of this publication on top of your desk. Yes, on top of the incoming mail, today’s newspaper, forms and chart requests from administrators, etc. On top means just that, on top.

At some future date, when the pages are too worn, creased, folded, dog eared and annotated to be read easily, reprint it and transfer your annotations to a clean copy.

I first saw this in David Smith’s The British Ecological Society’s Guide to Reproducible Science.

PS: The same rules apply to data science.

Calling Bullshit in the Age of Big Data (Syllabus)

Friday, January 13th, 2017

Calling Bullshit in the Age of Big Data by Carl T. Bergstrom and Jevin West.

From the about page:

The world is awash in bullshit. Politicians are unconstrained by facts. Science is conducted by press release. So-called higher education often rewards bullshit over analytic thought. Startup culture has elevated bullshit to high art. Advertisers wink conspiratorially and invite us to join them in seeing through all the bullshit, then take advantage of our lowered guard to bombard us with second-order bullshit. The majority of administrative activity, whether in private business or the public sphere, often seems to be little more than a sophisticated exercise in the combinatorial reassembly of bullshit.

We’re sick of it. It’s time to do something, and as educators, one constructive thing we know how to do is to teach people. So, the aim of this course is to help students navigate the bullshit-rich modern environment by identifying bullshit, seeing through it, and combatting it with effective analysis and argument.

What do we mean, exactly, by the term bullshit? As a first approximation, bullshit is language intended to persuade by impressing and overwhelming a reader or listener, with a blatant disregard for truth and logical coherence.

While bullshit may reach its apogee in the political sphere, this isn’t a course on political bullshit. Instead, we will focus on bullshit that comes clad in the trappings of scholarly discourse. Traditionally, such highbrow nonsense has come couched in big words and fancy rhetoric, but more and more we see it presented instead in the guise of big data and fancy algorithms — and these quantitative, statistical, and computational forms of bullshit are those that we will be addressing in the present course.

Of course an advertisement is trying to sell you something, but do you know whether the TED talk you watched last night is also bullshit — and if so, can you explain why? Can you see the problem with the latest New York Times or Washington Post article fawning over some startup’s big data analytics? Can you tell when a clinical trial reported in the New England Journal or JAMA is trustworthy, and when it is just a veiled press release for some big pharma company?

Our aim in this course is to teach you how to think critically about the data and models that constitute evidence in the social and natural sciences.

Learning Objectives

Our learning objectives are straightforward. After taking the course, you should be able to:

  • Remain vigilant for bullshit contaminating your information diet.
  • Recognize said bullshit whenever and wherever you encounter it.
  • Figure out for yourself precisely why a particular bit of bullshit is bullshit.
  • Provide a statistician or fellow scientist with a technical explanation of why a claim is bullshit.
  • Provide your crystals-and-homeopathy aunt or casually racist uncle with an accessible and persuasive explanation of why a claim is bullshit.

We will be astonished if these skills do not turn out to be among the most useful and most broadly applicable of those that you acquire during the course of your college education.

A great syllabus and impressive set of readings, although I must confess my disappointment that Is There a Text in This Class? The Authority of Interpretive Communities and Doing What Comes Naturally: Change, Rhetoric, and the Practice of Theory in Literary and Legal Studies, both by Stanley Fish, weren’t on the list.

Bergstrom and West are right about the usefulness of this “class” but I would use Fish and other literary critics to push your sensitivity to “bullshit” a little further than the readings indicate.

All communication is an attempt to persuade within a social context. If you share a context with a speaker, you are far more likely to recognize and approve of their use of “evidence” to make their case. If you don’t share such a context, say a person claiming a particular interpretation of the Bible due to divine revelation, their case doesn’t sound like it has any evidence at all.

It’s a subtle point but one known in the legal, literary and philosophical communities for a long time. That it’s new to scientists and/or data scientists speaks volumes about the lack of humanities education in science majors.

Moral Machine [Research Design Failure]

Tuesday, October 4th, 2016

Moral Machine

From the webpage:

Welcome to the Moral Machine! A platform for gathering a human perspective on moral decisions made by machine intelligence, such as self-driving cars.

We show you moral dilemmas, where a driverless car must choose the lesser of two evils, such as killing two passengers or five pedestrians. As an outside observer, you judge which outcome you think is more acceptable. You can then see how your responses compare with those of other people.

If you’re feeling creative, you can also design your own scenarios, for you and others to browse, share, and discuss.

The first time I recall hearing this type of discussion was over thirty years ago when a friend, taking an ethics class related the following problem:

You are driving a troop transport with twenty soldiers in the back and are about to enter a one lane bridge. You see a baby sitting in the middle of the bridge. Do you serve, going down an embankment, killing all on board or do you go straight?

A lively college classroom discussion erupted and continued for the entire class. Various theories and justifications were offered, etc. When the class bell rang, the professor announced the child perished 59 minutes, 59 seconds ago.

As you may guess, not a single person in the class called out “Swerve” when the question was posed.

The exercise was to illustrate that many “moral” decisions are made at the limits of human reaction time. Typically, 150 and 300 milliseconds. (Speedy Science: How Fast Can You React? is a great activity from Scientific American to test your reaction time.)

The examples in MIT’s Moral Machine perpetuate the myth that moral decisions are the result of reflection and consideration of multiple factors.

Considered moral decisions do exist. Dietrich Bonhoeffer deciding to participate in a conspiracy to assassinate Adolf Hitler. Lyndon Johnson supporting civil rights in the South. But those are not the subject of the “Moral Machine.”

Nor is the “Moral Machine” even a useful simulation of what a driven and/or driverless car would confront. Visibility isn’t an issue as it often is, there are no distractions, no smart phones ringing, no conflicting input from passengers, etc.

In short, the “Moral Machine” creates a fictional choice, about which to solicit your “moral” advice, under conditions you will never experience.

Separating pedestrians from vehicles (once suggested by Buckminster Fuller I think) is a far more useful exercise than college level discussion questions.

File Organization and Naming – Practical Tip

Wednesday, August 3rd, 2016


Daily morning mantra Hell!

More like a cover for the keyboard that has to be removed every morning!

Or make that the passphrase for your screensaver.

How’s your file organization/naming practice?

Software Carpentry Bug BBQ (June 13th, 2016)

Sunday, June 5th, 2016

Software Carpentry Bug BBQ

From the post:

Software Carpentry is having a Bug BBQ on June 13th

Software Carpentry is aiming to ship a new version (5.4) of the Software Carpentry lessons by the end of June. To help get us over the finish line we are having a Bug BBQ on June 13th to squash as many bugs as we can before we publish the lessons. The June 13th Bug BBQ is also an opportunity for you to engage with our world-wide community. For more info about the event, read-on and visit our Bug BBQ website.

How can you participate? We’re asking you, members of the Software Carpentry community, to spend a few hours on June 13th to wrap up outstanding tasks to improve the lessons. Ahead of the event, the lesson maintainers will be creating milestones to identify all the issues and pull requests that need to be resolved we wrap up version 5.4. In addition to specific fixes laid out in the milestones, we also need help to proofread and bugtest the

Where will this be? Join in from where you are: No need to go anywhere – if you’d like to participate remotely, start by having a look at the milestones on the website to see what tasks are still open, and send a pull request with your ideas to the corresponding repo. If you’d like to get together with other people working on these lessons live, we have created this map for live sites that are being organized. And if there’s no site listed near you, organize one yourself and let us know you are doing that here so that we can add your site to the map!

The Bug BBQ is going to be a great chance to get the community together, get our latest lessons over the finish line, and wrap up a product that gives you and all our contributors credit for your hard work with a citable object – we will be minting a DOI for this on publication.

A community BBQ that is open to everyone, dietary restrictions or not!

And the organizers have removed distance as a consideration for “attending.”

For those of us on non-BBQ diets, a unique opportunity to participate with others in the community for a worthy cause.

Mark your calendars today!

Reproducible Research Resources for Research(ing) Parasites

Friday, June 3rd, 2016

Reproducible Research Resources for Research(ing) Parasites by Scott Edmunds.

From the post:

Two new research papers on scabies and tapeworms published today showcase a new collaboration with This demonstrates a new way to share scientific methods that allows scientists to better repeat and build upon these complicated studies on difficult-to-study parasites. It also highlights a new means of writing all research papers with citable methods that can be updated over time.

While there has been recent controversy (and hashtags in response) from some of the more conservative sections of the medical community calling those who use or build on previous data “research parasites”, as data publishers we strongly disagree with this. And also feel it is unfair to drag parasites into this when they can teach us a thing or two about good research practice. Parasitology remains a complex field given the often extreme differences between parasites, which all fall under the umbrella definition of an organism that lives in or on another organism (host) and derives nutrients at the host’s expense. Published today in GigaScience are articles on two parasitic organisms, scabies and on the tapeworm Schistocephalus solidus. Not only are both papers in parasitology, but the way in which these studies are presented showcase a new collaboration with that provides a unique means for reporting the Methods that serves to improve reproducibility. Here the authors take advantage of their open access repository of scientific methods and a collaborative protocol-centered platform, and we for the first time have integrated this into our submission, review and publication process. We now also have a groups page on the portal where our methods can be stored.

A great example of how sharing data advances research.

Of course, that assumes that one of your goals is to advance research and not solely yourself, your funding and/or your department.

Such self-centered as opposed to research-centered individuals do exist, but I would not malign true parasites by describing them as such, even colloquially.

The days of science data hoarders are numbered and one can only hope that the same is true for the “gatekeepers” of humanities data, manuscripts and artifacts.

The only known contribution of hoarders or “gatekeepers” has been to the retarding of their respective disciplines.

Given the choice of advancing your field along with yourself, or only yourself, which one will you choose?

How to Read a Paper

Saturday, October 17th, 2015

How to Read a Paper by S. Keshav.


Researchers spend a great deal of time reading research papers. However, this skill is rarely taught, leading to much wasted effort. This article outlines a practical and efficient three-pass method for reading research papers. I also describe how to use this method to do a literature survey.

Sean Cribbs mentions this paper in: The Refreshingly Rewarding Realm of Research Papers but it is important enough for a separate post.

You should keep a copy of it at hand until the three-pass method becomes habit.

Other resources that Keshav mentions:

T. Roscoe, Writing Reviews for Systems Conferences

H. Schulzrinne, Writing Technical Articles

G.M. Whitesides, Whitesides’ Group: Writing a Paper (updated URL)

All three are fairly short and well worth your time to read and re-read.

Experienced writers as well!

After more than thirty years of professional writing I still benefit from well-written writing/editing advice.

The Refreshingly Rewarding Realm of Research Papers

Wednesday, October 14th, 2015

From the description:

Sean Cribbs teaches us how to read and implement research papers – and translate what they describe into code. He covers examples of research implementations he’s been involved in and the relationships he’s built with researchers in the process.

A bit longer description at:

Have you ever run into a thorny problem that makes your code slow or complicated, for which there is no obvious solution? Have you ever needed a data structure that your language’s standard library didn’t provide? You might need to implement a research paper!

While much of research in Computer Science doesn’t seem relevant to your everyday web application, all of those tools and techniques you use daily originally came from research! In this talk we’ll learn why you might want to read and implement research papers, how to read them for relevant information, and how to translate what they describe into code and test the results. Finally, we’ll discuss examples of research implementation I’ve been involved in and the relationships I’ve built with researchers in the process.

As you might imagine, I think this rocks!

The Economics of Reproducibility in Preclinical Research

Wednesday, June 10th, 2015

The Economics of Reproducibility in Preclinical Research by Leonard P. Freedman, Iain M. Cockburn, Timothy S. Simcoe. PLOS Published: June 9, 2015 DOI: 10.1371/journal.pbio.1002165.


Low reproducibility rates within life science research undermine cumulative knowledge production and contribute to both delays and costs of therapeutic drug development. An analysis of past studies indicates that the cumulative (total) prevalence of irreproducible preclinical research exceeds 50%, resulting in approximately US$28,000,000,000 (US$28B)/year spent on preclinical research that is not reproducible—in the United States alone. We outline a framework for solutions and a plan for long-term improvements in reproducibility rates that will help to accelerate the discovery of life-saving therapies and cures.

The authors find four categories of irreproducibility:

(1) study design, (2) biological reagents and reference materials, (3) laboratory protocols, and (4) data analysis and reporting.

But only address “(1) study design, (2) biological reagents and reference materials.”

Once again, documentation doesn’t make the cut. 🙁

I find that curious because judging just from the flood of social media data, people in general spend a good part of every day capturing and transmitting information. Where is the pain point between that activity and formal documentation that makes the later into an anathema?

Documentation, among other things, could lead to higher reproducibility rates for medical and other research areas, to say nothing of saving data scientists time puzzling out data and/or programmers debugging old code.

Tips on Digging into Scholarly Research Journals

Tuesday, April 14th, 2015

Tips on Digging into Scholarly Research Journals by Gary Price.

Gary gives a great guide to using JournalTOCs, a free service that provides tables of content and abstracts where available for thousands of academic journals.

Subject to the usual warning about reading critically, academic journals can be a rich source of analysis and data.


Research Reports by U.S. Congress and UK House of Commons

Sunday, April 12th, 2015

Research Reports by U.S. Congress and UK House of Commons by Gary Price.

Gary’s post covers the Congressional Research Service (CRS) (US) and the House of Commons Library Research Service (UK).

Truly amazing I know for an open and transparent government like the United States Goverment but CRS reports are not routinely made available to the public and so we have to rely on the kindness of strangers to make them available. Gary reports:

The good news is that Steven Aftergood, director of the Government Secrecy Project at the Federation of American Scientists (FAS), gets ahold of many of these reports and shares them on the FAS website.

The House of Commons Library Research Service appears to not mind officially sharing its research with anyone with web access.

Unlike some government agencies and publications, the CRS and LRS enjoy reputations for high quality scholarship and accuracy. You still need to evaluate their conclusions and the evidence cited or not, but outright deception and falsehood aren’t part of their traditions.

Barkan, Bintliff, and Whisner’s Fundamentals of Legal Research, 10th

Monday, April 6th, 2015

Barkan, Bintliff, and Whisner’s Fundamentals of Legal Research, 10th by Steven M Barkan; Barbara Bintliff; Mary Whisner. (ISBN-13: 9781609300562)


This classic textbook has been updated to include the latest methods and resources. Fundamentals of Legal Research provides an authoritative introduction and guide to all aspects of legal research, integrating electronic and print sources. The Tenth Edition includes chapters on the true basics (case reporting, statutes, and so on) as well as more specialized chapters on legislative history, tax law, international law, and the law of the United Kingdom. A new chapter addresses Native American tribal law. Chapters on the research process, legal writing, and citation format help integrate legal research into the larger process of solving legal problems and communicating the solutions. This edition includes an updated glossary of research terms and revised tables and appendixes. Because of its depth and breadth, this text is well suited for advanced legal research classes; it is a book that students will want to retain for future use. Moreover, it has a place on librarians’ and attorneys’ ready reference shelves. Barkan, Bintliff and Whisner’s Assignments to Fundamentals of Legal Research complements the text.

I haven’t seen this volume in hard copy but if you are interested in learning what connections researchers are looking for with search tools, law is a great place to start.

The purpose of legal research, isn’t to find the most popular “fact” (Google), or to find every term for a “fact” ever tweeted (Twitter), but rather to find facts and their relationships to other facts, which flesh out to a legal view of a situation in context.

If you think about it, putting legislation, legislative history, court records and decisions, along with non-primary sources online, is barely a start towards making that information “accessible.” A necessary first step but not sufficient for meaningful access.

On the Shoulders of Giants: The Growing Impact of Older Articles

Friday, November 7th, 2014

On the Shoulders of Giants: The Growing Impact of Older Articles by Alex Verstak, et al.


In this paper, we examine the evolution of the impact of older scholarly articles. We attempt to answer four questions. First, how often are older articles cited and how has this changed over time. Second, how does the impact of older articles vary across different research fields. Third, is the change in the impact of older articles accelerating or slowing down. Fourth, are these trends different for much older articles.

To answer these questions, we studied citations from articles published in 1990-2013. We computed the fraction of citations to older articles from articles published each year as the measure of impact. We considered articles that were published at least 10 years before the citing article as older articles. We computed these numbers for 261 subject categories and 9 broad areas of research. Finally, we repeated the computation for two other definitions of older articles, 15 years and older and 20 years and older.

There are three conclusions from our study. First, the impact of older articles has grown substantially over 1990-2013. In 2013, 36% of citations were to articles that are at least 10 years old; this fraction has grown 28% since 1990. The fraction of older citations increased over 1990-2013 for 7 out of 9 broad areas and 231 out of 261 subject categories.

Second, the increase over the second half (2002-2013) was double the increase in the first half (1990-2001).

Third, the trend of a growing impact of older articles also holds for even older articles. In 2013, 21% of citations were to articles >= 15 years old with an increase of 30% since 1990 and 13% of citations were to articles >= 20 years old with an increase of 36%.

Now that finding and reading relevant older articles is about as easy as finding and reading recently published articles, significant advances aren’t getting lost on the shelves and are influencing work worldwide for years after.

Deeply encouraging results!

If indexing and retrieval could operate at a sub-article level, following chains of research across the literature would be even easier.

How to Make More Published Research True

Tuesday, October 21st, 2014

How to Make More Published Research True by John P. A. Ioannidis. (DOI: 10.1371/journal.pmed.1001747)

If you think the title is provocative, check out the first paragraph:

The achievements of scientific research are amazing. Science has grown from the occupation of a few dilettanti into a vibrant global industry with more than 15,000,000 people authoring more than 25,000,000 scientific papers in 1996–2011 alone [1]. However, true and readily applicable major discoveries are far fewer. Many new proposed associations and/or effects are false or grossly exaggerated [2],[3], and translation of knowledge into useful applications is often slow and potentially inefficient [4]. Given the abundance of data, research on research (i.e., meta-research) can derive empirical estimates of the prevalence of risk factors for high false-positive rates (underpowered studies; small effect sizes; low pre-study odds; flexibility in designs, definitions, outcomes, analyses; biases and conflicts of interest; bandwagon patterns; and lack of collaboration) [3]. Currently, an estimated 85% of research resources are wasted [5]. (footnote links omitted, emphasis added)

I doubt anyone can disagree with the need for reform in scientific research, but it is one thing to call for reform in general versus the specific.

The following story depends a great deal on cultural context, Southern religious cultural context, but I will tell the story and then attempt to explain if necessary.

One Sunday morning service the minister was delivering a powerful sermon on sins that his flock could avoid. He touched on drinking and smoking at length and as he ended each of those, an older woman in the front pew would “Amen!” very loudly. The same response was given to his condemnation of smoking. Finally, the sermon touched on dipping snuff and chewing tobacco. Dead silence from the older woman on the front row. The sermon ended some time later, hymns were sung and the congregation was dismissed.

As the congregation exited the church, the minister stood at the door, greeting one and all. Finally the older woman from the front pew appeared and the minister greeted her warmly. She had after all, appeared to enjoy most of his sermon. After some small talk, the minister did say: “You liked most of my sermon but you became very quite when I mentioned dipping snuff and chewing tobacco. If you don’t mind, can you tell me what was different about that part?” To which the old woman replied: “I was very happy while you were preaching but then you went to meddling.”

So long as the minister was talking about the “sins” that she did not practice, that was preaching. When the minister starting talking about “sins” she committed like dipping snuff or chewing tobacco, that was “meddling.”

I suspect that Ioannidis’ preaching will find widespread support but when you get down to actual projects and experiments, well, you have gone to “meddling.”

In order to root out waste, it will be necessary to map out who benefits from such projects, who supported them, who participated, and their relationships to others and other projects.

Considering that universities are rumored to get at least fifty (50) to (60) percent of grants as administrative overhead, they are unlikely to be your allies in creating such mappings or reducing waste in any way. Appeals to funders may be effective, save some funders, like the NIH, have an investment in the research structure as it exists.

Whatever the odds of change, naming names, charting relationships over time and interests in projects is at least a step down the road to useful rather than remunerative scientific research.

Topic map excel at modeling relationships, whether known at the outset of your tracking or lately discovered, unexpectedly.

PS: With a topic map you can skip endless committee meetings with each project to agree on how to track that project and their methodologies for waste, should any waste exists. Yes, the first line of a tar baby (in it’s traditional West African sense) defense by universities and others, let’s have a pre-meeting to plan our first meeting, etc.

Tools for Reproducible Research [Reproducible Mappings]

Saturday, April 19th, 2014

Tools for Reproducible Research by Karl Broman.

From the post:

A minimal standard for data analysis and other scientific computations is that they be reproducible: that the code and data are assembled in a way so that another group can re-create all of the results (e.g., the figures in a paper). The importance of such reproducibility is now widely recognized, but it is still not so widely practiced as it should be, in large part because many computational scientists (and particularly statisticians) have not fully adopted the required tools for reproducible research.

In this course, we will discuss general principles for reproducible research but will focus primarily on the use of relevant tools (particularly make, git, and knitr), with the goal that the students leave the course ready and willing to ensure that all aspects of their computational research (software, data analyses, papers, presentations, posters) are reproducible.

As you already know, there is a great deal of interest in making scientific experiments reproducible in fact as well as in theory.

At the time time, there has been an increasing interest in reproducible data analysis as it concerns the results from reproducible experiments.

One logically follows on from the other.

Of course, reproducible data analysis as far as any combination of data from different sources, would simply cookie cutter follow the combining of data in a reported experiment.

But what if a user wants to replicate the combining (mapping) of data with other data? From different sources? That could be followed by rote by others but they would not know the underlying basis for the choices made in the mapping.

Experiments take a great deal of effort to identify the substances used in an experiment. When data is combined from different sources, why not do the same for the data?

I first saw this in a tweet by YihuI Xie.

Reproducible Research/(Mapping?)

Thursday, April 17th, 2014

Implementing Reproducible Research edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng.

From the webpage:

In many of today’s research fields, including biomedicine, computational tools are increasingly being used so that the results can be reproduced. Researchers are now encouraged to incorporate software, data, and code in their academic papers so that others can replicate their research results. Edited by three pioneers in this emerging area, this book is the first one to explore this groundbreaking topic. It presents various computational tools useful in research, including cloud computing, data repositories, virtual machines, R’s Sweave function, XML-based programming, and more. It also discusses legal issues and case studies.

There is a growing concern over the ability of scientists to reproduce the published results of other scientists. The Economist rang one of many alarm bells when it published: Trouble at the lab [Data Skepticism].

From the introduction to Reproducible Research:

Literate statistical programming is a concept introduced by Rossini () that builds on the idea of literate programming as described by Donald Knuth. With literate statistical programming, one combines the description of a statistical analysis and the code for doing the statistical analysis into a single document. Subsequently, one can take the combined document and produce either a human-readable document (i.e. PDF) or a machine readable code file. An early implementation of this concept was the Sweave system of Leisch which uses R as its programming language and LATEX as its documentation language (). Yihui Xie describes his knitr package which builds substantially on Sweave and incorporates many new ideas developed since the initial development of Sweave. Along these lines, Tanu Malik and colleagues describe the Science Object Linking and Embedding framework for creating interactive publications that allow authors to embed various aspects of computational research in document, creating a complete research compendium. Tools

Of course, we all cringe when we read that a drug company can reproduce only 1/4 of 67 “seminal” studies.

What has me curious is why we don’t have the same reaction when enterprise IT systems require episodic remapping, which requires the mappers to relearn what was known at the time of the last remapping? We all know that enterprise (and other) IT systems change and evolve, but practically speaking, no effort is make to capture the knowledge that would reduce the time, cost and expense of every future remapping.

We can see the expense and danger of science not being reproducible, but when our own enterprise data mappings are not reproducible, that’s just the way things are.

Take inspiration from the movement towards reproducible science and work towards reproducible semantic mappings.

I first saw this in a tweet by Victoria Stodden.

12 Steps for Research Programming

Thursday, March 13th, 2014

How effective is your research programming workflow? by Philip Guo.

From the post:

For my Ph.D. dissertation, I investigated research programming, a common type of programming activity where people write computer programs to obtain insights from data. Millions of professionals in fields ranging from science, engineering, business, finance, public policy, and journalism, as well as numerous students and computer hobbyists, all perform research programming on a daily basis.

Inspired by The Joel Test for rating software engineering teams, here is my informal “Philip test” to determine whether your research programming workflow is effective:

  1. Do you have reliable ways of taking, organizing, and reflecting on notes as you’re working?
  2. Do you have reliable to-do lists for your projects?
  3. Do you write scripts to automate repetitive tasks?
  4. Are your scripts, data sets, and notes backed up on another computer?
  5. Can you quickly identify errors and inconsistencies in your raw data sets?
  6. Can you write scripts to acquire and merge together data from different sources and in different formats?
  7. Do you use version control for your scripts?
  8. If you show analysis results to a colleague and they offer a suggestion for improvement, can you adjust your script, re-run it, and produce updated results within an hour?
  9. Do you use assert statements and test cases to sanity check the outputs of your analyses?
  10. Can you re-generate any intermediate data set from the original raw data by running a series of scripts?
  11. Can you re-generate all of the figures and tables in your research paper by running a single command?
  12. If you got hit by a bus, can one of your lab-mates resume your research where you left off with less than a week of delay?

Philip suggests a starting point in his post.

His post alone is pure gold I would say.

Came to this by following a tweet by Neil Saunders that pointed to: How effective is my research programming workflow? The Philip Test – Part 1 and from there I found the link to Philips post.

This sounds a lot like the recent controversy over the ability to duplicate research published in scientific journals. Can someone else replicate your results?

Office of Incisive Analysis

Wednesday, March 12th, 2014

Office of Incisive Analysis Office Wide – Broad Agency Announcement (BAA) IARPA-BAA-14-02
BAA Release Date: March 10, 2014

FedBizOpps Reference

IARPA-BAA-14-02 with all Supporting Documents

From the webpage:


IARPA invests in high-risk, high-payoff research that has the potential to provide our nation with an overwhelming intelligence advantage over future adversaries. This BAA solicits abstracts/proposals for Incisive Analysis.

IA focuses on maximizing insights from the massive, disparate, unreliable and dynamic data that are – or could be – available to analysts, in a timely manner. We are pursuing new sources of information from existing and novel data, and developing innovative techniques that can be utilized in the processes of analysis. IA programs are in diverse technical disciplines, but have common features: (a) Create technologies that can earn the trust of the analyst user by providing the reasoning for results; (b) Address data uncertainty and provenance explicitly.

The following topics (in no particular order) are of interest to IA:

  • Methods for estimation and communication of uncertainty and risk;
  • Methods for understanding the process of analysis and potential impacts of technology;
  • Methods for measuring and improving human judgment and human reasoning;
  • Multidisciplinary approaches to processing noisy audio and speech;
  • Methods and approaches to quantifiable representations of uncertainty simultaneously accounting for multiple types of uncertainty;
  • Discovering, tracking and sorting emerging events and participating entities found in reports;
  • Accelerated system development via machine learning;
  • Testable methods for identifying individuals’ intentions;
  • Methods for developing understanding of how knowledge and ideas are transmitted and change within groups, organizations, and cultures;
  • Methods for analysis of social, cultural, and linguistic data;
  • Methods to construct and evaluate speech recognition systems in languages without a formalized orthography;
  • Multidisciplinary approaches to assessing linguistic data sets;
  • Mechanisms for detecting intentionally falsified representations of events and/or personas;
  • Methods for understanding and managing massive, dynamic data in images, video, and speech;
  • Analysis of massive, unreliable, and diverse data;
  • Methods to make machine learning more useful and automatic;
  • 4D geospatial/temporal representations to facilitate change detection and analysis;
  • Novel approaches for mobile augmented reality applied to analysis and collection;
  • Methods for assessments of relevancy and reliability of new data;
  • Novel approaches to data and knowledge management facilitating discovery, retrieval and manipulation of large volumes of information to provide greater access to interim analytic and processing products.

This announcement seeks research ideas for topics that are not addressed by emerging or ongoing IARPA programs or other published IARPA solicitations. It is primarily, but not solely, intended for early stage research that may lead to larger, focused programs through a separate BAA in the future, so periods of performance generally will not exceed 12 months.

Offerors should demonstrate that their proposed effort has the potential to make revolutionary, rather than incremental, improvements to intelligence capabilities. Research that primarily results in evolutionary improvement to the existing state of practice is specifically excluded.

Contracting Office Address:
Office of Incisive Analysis
Intelligence Advanced Research Projects Activity
Office of the Director of National Intelligence
Washington, DC 20511
Fax: 301-851-7673

Primary Point of Contact:

The “topics … of interest” that caught my eye for topic maps are:

  • Methods for measuring and improving human judgment and human reasoning;
  • Discovering, tracking and sorting emerging events and participating entities found in reports;
  • Methods for developing understanding of how knowledge and ideas are transmitted and change within groups, organizations, and cultures;
  • Methods for analysis of social, cultural, and linguistic data;
  • Novel approaches to data and knowledge management facilitating discovery, retrieval and manipulation of large volumes of information to provide greater access to interim analytic and processing products.

Thinking capturing the insights of users as they use and add content to a topic map as “evolutionary change.”


Business Information Key Resources

Friday, February 21st, 2014

Business Information Key Resources by Karen Blakeman.

From the post:

On one of my recent workshops I was asked if I used Google as my default search tool, especially when conducting business research. The short answer is “It depends”. The long answer is that it depends on the topic and type of information I am looking for. Yes, I do use Google a lot but if I need to make sure that I have covered as many sources as possible I also use Google alternatives such as Bing, Millionshort, Blekko etc. On the other hand and depending on the type of information I require I may ignore Google and its ilk altogether and go straight to one or more of the specialist websites and databases.

Here are just a few of the free and pay-per-view resources that I use.

Starting points for research are a matter of subject, cost, personal preference, recommendations from others, etc.

What are your favorite starting points for business information?

Medical research—still a scandal

Sunday, February 9th, 2014

Medical research—still a scandal by Richard Smith.

From the post:

Twenty years ago this week the statistician Doug Altman published an editorial in the BMJ arguing that much medical research was of poor quality and misleading. In his editorial entitled, “The Scandal of Poor Medical Research,” Altman wrote that much research was “seriously flawed through the use of inappropriate designs, unrepresentative samples, small samples, incorrect methods of analysis, and faulty interpretation.” Twenty years later I fear that things are not better but worse.

Most editorials like most of everything, including people, disappear into obscurity very fast, but Altman’s editorial is one that has lasted. I was the editor of the BMJ when we published the editorial, and I have cited Altman’s editorial many times, including recently. The editorial was published in the dawn of evidence based medicine as an increasing number of people realised how much of medical practice lacked evidence of effectiveness and how much research was poor. Altman’s editorial with its concise argument and blunt, provocative title crystallised the scandal.

Why, asked Altman, is so much research poor? Because “researchers feel compelled for career reasons to carry out research that they are ill equipped to perform, and nobody stops them.” In other words, too much medical research was conducted by amateurs who were required to do some research in order to progress in their medical careers.

Ethics committees, who had to approve research, were ill equipped to detect scientific flaws, and the flaws were eventually detected by statisticians, like Altman, working as firefighters. Quality assurance should be built in at the beginning of research not the end, particularly as many journals lacked statistical skills and simply went ahead and published misleading research.

If you are thinking things are better today, consider a further comment from Richard:

The Lancet has this month published an important collection of articles on waste in medical research. The collection has grown from an article by Iain Chalmers and Paul Glasziou in which they argued that 85% of expenditure on medical research ($240 billion in 2010) is wasted. In a very powerful talk at last year’s peer review congress John Ioannidis showed that almost none of thousands of research reports linking foods to conditions are correct and how around only 1% of thousands of studies linking genes with diseases are reporting linkages that are real. His famous paper “Why most published research findings are false” continues to be the most cited paper of PLoS Medicine.

Not that I think open access would be a panacea for poor research quality but at least it would provide the opportunity for discovery.

All this talk about medical research reminds me of the Big Mechanism DARPA. Assume the research data on pathways is no better or no worse than mapping genes to diseases, DARPA will be spending $42 million to mine data with 1% accuracy.

A better use of those “Big Mechanism” dollars would be to test solutions to produce better medical research for mining.

1% sounds like low-grade ore to me.

Docear 1.0 (stable),…

Thursday, October 17th, 2013

Docear 1.0 (stable), a new video, new manual, new homepage, new details page, … by Joeran Beel.

From the post:

It’s been almost two years since we released the first private Alpha of Docear and today, October 17 2013, Docear 1.0 (stable) is finally available for Windows, Mac, and Linux to download. We are really proud of what we accomplished in the past years and we think that Docear is better than ever. In addition to all the enhancements we made during the past years, we completely rewrote the manual with step-by-step instructions including an overview of supported PDF viewers, we changed the homepage, we created a new video, and we made the features & details page much more comprehensive. For those who already use Docear 1.0 RC4, there are not many changes (just a few bug fixes). For new users, we would like to explain what Docear is and what makes it so special.

Docear is a unique solution to academic literature management that helps you to organize, create, and discover academic literature. The three most distinct features of Docear are:

  1. A single-section user-interface that differs significantly from the interfaces you know from Zotero, JabRef, Mendeley, Endnote, … and that allows a more comprehensive organization of your electronic literature (PDFs) and the annotations you created (i.e highlighted text, comments, and bookmarks).
  2. A ‘literature suite concept’ that allows you to draft and write your own assignments, papers, theses, books, etc. based on the annotations you previously created.
  3. A research paper recommender system that allows you to discover new academic literature.

Aside from Docear’s unique approach, Docear offers many features more. In particular, we would like to point out that Docear is free, open source, not evil, and Docear gives you full control over your data. Docear works with standard PDF annotations, so you can use your favorite PDF viewer. Your reference data is directly stored as BibTeX (a text-based format that can be read by almost any other reference manager). Your drafts and folders are stored in Freeplane’s XML format, again a text-based format that is easy to process and understood by several other applications. And although we offer several online services such as PDF metadata retrieval, backup space, and online viewer, we do not force you to register. You can just install Docear on your computer, without any registration, and use 99% of Docear’s functionality.

But let’s get back to Docear’s unique approach for literature management…

Impressive “academic literature management” package!

I have done a lot of research over the years but unaided in large part by citation management software. Perhaps it is time to try a new approach.

Just scanning the documentation it does not appear that I can share my Docear annotations with another user.

Unless we were fortunate enough to have used the same terminology the same way while doing our research.

That is to say any research project I undertake will result in the building of a silo that is useful to me, but that others will have to duplicate.

If true, I just scanned the documentation, that is an observation and not a criticism.

I will keep track of my experience with a view towards suggesting changes that could make Docear more transparent.


Saturday, September 28th, 2013

MANTRA: Free, online course on how to manage digital data by Sarah Dister.

From the post:

Research Data MANTRA is a free, online course with guidelines on how to manage the data you collect throughout your research. The course is particularly appropriate for those who work or are planning to work with digital data.

Once you have finalized the course, you will:

  • Be aware of the risk of data loss and data protection requirements.
  • Know how to store and transport your data safely and securely (backup and encryption).
  • Have experience in using data in software packages such as R, SPSS, NVivo, or ArcGIS.
  • Recognise the importance of good research data management practice in your own context.
  • Be able to devise a research data management plan and apply it throughout the projects life.
  • Be able to organise and document your data efficiently during the course of your project.
  • Understand the benefits of sharing data and how to do it legally and ethically.

Data management may not be as sexy as “big data” but without it, there would be no “big data” to make some of us sexy. 😉

NSA — Untangling the Web: A Guide to Internet Research

Wednesday, May 15th, 2013

NSA — Untangling the Web: A Guide to Internet Research

A Freedom of Information Act (FOIA) request caused the NSA to disgorge its guide to web research, which is some six years out of date.

From the post:

The National Security Agency just released “Untangling the Web,” an unclassified how-to guide to Internet search. It’s a sprawling document, clocking in at over 650 pages, and is the product of many years of research and updating by a NSA information specialist whose name is redacted on the official release, but who is identified as Robyn Winder of the Center for Digital Content on the Freedom of Information Act request that led to its release.

It’s a droll document on many levels. First and foremost, it’s funny to think of officials who control some of the most sophisticated supercomputers and satellites ever invented turning to a .pdf file for tricks on how to track down domain name system information on an enemy website. But “Untangling the Web” isn’t for code-breakers or wire-tappers. The target audience seems to be staffers looking for basic factual information, like the preferred spelling of Kazakhstan, or telephonic prefix information for East Timor.

I take it as guidance on how “good” does your application or service need to be to pitch to the government?

I keep thinking to attract government attention, an application needs to fall just short of solving P = NP?

On the contrary, the government needs spell checkers, phone information and no doubt lots of other dull information, quickly.

Perhaps an app that signals fresh doughnuts from bakeries within X blocks would be just the thing. 😉

Google’s Hybrid Approach to Research [Lessons For Topic Map Research?]

Friday, November 2nd, 2012

Google’s Hybrid Approach to Research by Alfred Spector, Peter Norvig, and Slav Petrov.

From the start of the article:

In this Viewpoint, we describe how we organize computer science research at Google. We focus on how we integrate research and development and discuss the benefits and risks of our approach. The challenge in organizing R&D is great because CS is an increasingly broad and diverse field. It combines aspects of mathematical reasoning, engineering methodology, and the empirical approaches of the scientific method. The empirical components are clearly on the upswing, in part because the computer systems we construct have become so large that analytic techniques cannot properly describe their properties, because the systems now dynamically adjust to the difficult-to-predict needs of a diverse user community, and because the systems can learn from vast datasets and large numbers of interactive sessions that provide continuous feedback.

We have also noted that CS is an expanding sphere, where the core of the field (theory, operating systems, and so forth) continues to grow in depth, while the field keeps expanding into neighboring application areas. Research results come not only from universities, but also from companies, both large and small. The way research results are disseminated is also evolving and the peer-reviewed paper is under threat as the dominant dissemination method. Open source releases, standards specifications, data releases, and novel commercial systems that set new standards upon which others then build are increasingly important.

This seems particularly useful:

Thus, we have structured the Google environment as one where new ideas can be rapidly verified by small teams through large-scale experiments on real data, rather than just debated. The small-team approach benefits from the services model, which enables a few engineers to create new systems and put them in front of users.

Particularly in terms of research and development for topic maps.

I confess to a fondness for the “…just debated” side but point out that developers aren’t users. For interface requirements or software capabilities.

Selling what you have debated or written isn’t the same thing as selling what customers want. You can verify that lesson with with the Semantic Web folks.

Semantic impedance is going to grow along with “big data.”

Topic maps need to be poised to deliver a higher ROI in resolving semantic impedance than ad hoc solutions. And to delivery that ROI in the context of “big data” tools.

Research dead ahead.

Book Review – “Universal Methods of Design”

Saturday, September 1st, 2012

Book Review – “Universal Methods of Design” by Cyd Harrell.

From the review:

I’ve never been one to use a lot of inspirational tools, like decks of design method cards. Day to day, I figure I have a very solid understanding of core practices and can make others up if I need to. But I’ve also been the leader of a fast-paced team that has been asked to solve all kinds of difficult problems through research and design, so sticking to my personal top five techniques was never an option. After all, only the most basic real-world research goals can be attained without combining and evolving methods.

So I was quite intrigued when I received a copy of Bella Martin and Bruce Hanington’s Universal Methods of Design, which presents summaries of 100 different research and analysis methods as two-page spreads in a nice, large-format hardback. Could this be the ideal reference for a busy research team with a lot of chewy problems to solve?

In short: yes. It functions as a great reference when we hear of a method none of us is familiar with, but more importantly it’s an excellent “unsticker” when we run into a challenge in the design or analysis of a study. I have a few quibbles with organization that I’ll get to in a minute, but in general this is a book that every research team should have on hand.

See the review for Cyd’s quibble.

For a copy near you, see: “Universal Methods of Design.”

Data-Intensive Librarians for Data-Intensive Research

Friday, August 10th, 2012

Data-Intensive Librarians for Data-Intensive Research by Chelcie Rowell.

From the post:

A packed house heard Tony Hey and Clifford Lynch present on The Fourth Paradigm: Data-Intensive Research, Digital Scholarship and Implications for Libraries at the 2012 ALA Annual Conference.

Jim Gray coined The Fourth Paradigm in 2007 to reflect a movement toward data-intensive science. Adapting to this change would, Gray noted, require an infrastructure to support the dissemination of both published work and underlying research data. But the return on investment for building the infrastructure would be to accelerate the transformation of raw data to recombined data to knowledge.

In outlining the current research landscape, Hey and Lynch underscored how right Gray was.

Hey led the audience on a whirlwind tour of how scientific research is practiced in the Fourth Paradigm. He showcased several projects that manage data from capture to curation to analysis and long-term preservation. One example he mentioned was the Dataverse Network Project that is working to preserve diverse scholarly outputs from published work to data, images and software.

Lynch reflected on the changing nature of the scientific record and the different collaborative structures that will be needed to define, generate and preserve that record. He noted that we tend to think of the scholarly record in terms of published works. In light of data-intensive science, Lynch said the definition must be expanded to include the datasets which underlie results and the software required to render data.

I wasn’t able to find a video of the presentations and/or slides but while you wait for those to appear, you can consult the homepages of Lynch and Hey for related materials.

Librarians already have searching and bibliographic skills, which are appropriate to the Fourth Paradigm.

What if they were to add big data design, if not processing, skills to their resumes?

What if articles in professional journals carried a byline in addition to the authors: Librarian(s): ?

NSF, NIH to Hold Webinar on Big Data Solicitation

Monday, April 30th, 2012

NSF, NIH to Hold Webinar on Big Data Solicitation by Erwin Gianchandani.

Guidance on BIGDATA Solicitation

<= $25 Million Webinar: Tuesday, May 8th, from 11am to 12pm ET. Registration closes 11:59pm PDT on Monday, May 7th.

From the post:

Late last month, the Administration unveiled a $200 million Big Data R&D Initiative, committing new funding to improve “our ability to extract knowledge and insights from large and complex collections of digital data.” The initiative includes a joint solicitation by the National Science Foundation (NSF) and National Institutes of Health (NIH), providing up to $25 million for Core Techniques and Technologies for Advancing Big Data Science and Engineering (BIGDATA). Now NSF and NIH have announced a webinar “to describe the goals and focus of the BIGDATA solicitation, help investigators understand its scope, and answer any questions potential Principal Investigators (PIs) may have.” The webinar will take place next week — on Tuesday, May 8th, from 11am to 12pm ET.

So, how clever are you really?

(The post has links to other materials you probably need to read before the webinar.)

Google in the World of Academic Research (Lead by Example?)

Thursday, April 5th, 2012

Google in the World of Academic Research by Whitney Grace.

From the post:

Librarians, teachers, and college professors all press their students not to use Google to research their projects, papers, and homework, but it is a dying battle. All students have to do is type in a few key terms and millions of results are displayed. The average student or person, for that matter, is not going to scour through every single result. If they do not find what they need, they simply rethink their initial key words and hit the search button again.

The Hindu recently wrote about, “Of Google and Scholarly Search,” the troubles researchers face when they only use Google and makes several suggestions for alternate search engines and databases.

The perennial complaint (academics used to debate the perennial philosophy, now the perennial complaint).

Is Google responsible for superficial searching and consequently superficial results?

Or do superficial Google results reflect our failure to train students in “doing” research?

What research models do students have to follow? In terms of research behavior?

In my next course, I will do a research problem by example. Good as well as bad results. What worked and what didn’t. And yes, Google will be in the mix of methods.

Why not? With four and five work queries and domain knowledge, I get pretty good results from Google. You?

Research Tip: Conference Proceedings (ACM DL)

Monday, January 2nd, 2012

To verify the expansion of the acronyms for Jeff Haung’s Best Paper Awards in Computer Science [2011], I used the ACM Digital Library.

If the conference is listed under conferences in the Digital Library, following the link results in a listing of the top ten (10) paper downloads in the last six (6) weeks and the top ten (10) “most cited article” listings.

Be aware it isn’t always the most recent papers that are the most downloaded.

Another way to keep abreast of what is of interest in a particular area of computing.

Lifting the veil on my “system”

Sunday, December 11th, 2011

Lifting the veil on my “system” by Meredith Farkas.

From the post:

I am a huge fan of research log and research process reflection assignments. Because research is a means to an end (the paper) and because people are often doing it in a rush, there is little reflection on process. What worked? What didn’t? What can I take from this experience for the next time I have to do something similar? Because this reflection is not usually written into the curriculum, students don’t learn enough from their mistakes or even the good things they did. Having a research log helps students become better researchers in the future and, most importantly, helps them to develop a “system” that works for them.

I definitely remember the many years that I did not have a system for research and writing. Most reference librarians have probably encountered a frantic student who realizes just before his/her paper is due that s/he can’t track down some of the sources they need to cite. Yeah, that was me (though I would have been too embarrassed to come to the reference desk). I probably never followed the same path twice and wasted a lot of time doing things over again because I wasn’t organized. Looking back, I wish a nice librarian had provided an session for me on developing a system for finding, organizing, reading and synthesizing information, because I wasted a lot of time and sweat needlessly.

What do you think? Would a topic mapping tool do better? Worse? About the same?

While you are at it, give Meredith some feedback as well.