Spoiler Alert: This paper discusses a possible find of anthrax and plague DNA in the New York Subway. It concludes that either a related but harmless strain wasn’t considered and/or there was random sequencing error. In either case, it is a textbook example of the need for data skepticism.
Searching for anthrax in the New York City subway metagenome by Robert A Petit, III, Matthew Ezewudo, Sandeep J. Joseph, Timothy D. Read.
From the introduction:
In February 2015 Chris Mason and his team published an in-depth analysis of metagenomic data (environmental shotgun DNA sequence) from samples isolated from public surfaces in the New York City (NYC) subway system. Along with a ton of really interesting findings, the authors claimed to have detected DNA from the bacterial biothreat pathogens Bacillus anthracis (which causes anthrax) and Yersinia pestis (causes plague) in some of the samples. This predictably led to a huge interest from the press and scientists on social media. The authors followed up with an re-analysis of the data on microbe.net, where they showed some results that suggested the tools that they were using for species identification overcalled anthrax and plague.
…
The NYC subway metagenome study raised very timely questions about using unbiased DNA sequencing for pathogen detection. We were interested in this dataset as soon as the publication appeared and started looking deeper into why the analysis software gave false positive results and indeed what exactly was found in the subway samples. We decided to wrap up the results of our preliminary analysis and put it on this site. This report focuses on the results for B. anthracis but we also did some preliminary work on Y.pestis and may follow up on this later.
…
The article gives a detailed accounting of the tools and issues involved in the identification of DNA fragments from pathogens. It is hard core science but it also illustrates how iffy hard core science can be. Sure, you have the data, that doesn’t mean you will reach the correct conclusion from it.
The authors also mention a followup study by Chris Mason, the author of the original paper, entitled:
The long road from Data to Wisdom, and from DNA to Pathogen by Christopher Mason.
From the introduction:
There is an oft-cited DIKW). Just because you have data, it takes some processing to get quality information, and even good information is not necessarily knowledge, and knowledge often requires context or application to become wisdom.
And from his conclusion:
But, perhaps the bigger issue is social. I confess I grossly underestimated how the press would sensationalize these results, and even the Department of Health (DOH) did not believe it, claiming it simply could not be true. We sent the MTA and the DOH our first draft upon submission in October 2014, the raw and processed data, as well as both of our revised drafts in December 2014 and January 2015, and we did get some feedback, but they also had other concerns at the time (Ebola guy in the subway). This is also different from what they normally do (PCR for specific targets), so we both learned from each other. Yet, upon publication, it was clear that Twitter and blogs provided some of the same scrutiny as the three reviewers during the two rounds of peer review. But, they went even deeper and dug into the raw data, within hours of the paper coming online, and I would argue that online reviewers have become an invaluable part of scientific publishing. Thus, published work is effectively a living entity before (bioRxiv), during (online), and after publication (WSJ, Twitter, and others), and online voices constitute an critical, ensemble 4th reviewer.
Going forward, the transparency of the methods, annotations, algorithms, and techniques has never been more essential. To this end, we have detailed our work in the supplemental methods, but we have also posted complete pipelines in this blog post on how to go from raw data to annotated species on Galaxy. Even better, the precise virtual machines and instantiation of what was run on a server needs to be tracked and guaranteed to be 100% reproducible. For example, for our .vcf characterizations of the human alleles, we have set up our entire pipeline on Arvados/Curoverse, free to use, so that anyone can take a .vcf file and run the exact same ancestry analyses and get the exact same results. Eventually, tools like this can automate and normalize computational aspects of metagenomics work, which is an ever-increasingly important component of genomics.
Mason’s”
Data –>Information –>Knowledge –>Wisdom (DIKW).
sounds like:
evidence based data science.
to me.
Another quick point, note that Chris Mason and team made all their data available for others to review and Chris states that informal review was a valuable contributor to the scientific process.
That is an illustration of the value of transparency. Contrast that with the Obama Administration’s default position of opacity. Which one do you think serves a fact finding process better?
Perhaps that is the answer. The Obama administration isn’t interested in a fact finding process. It has found the “facts” that it wants and reaches its desired conclusions. What is there left to question or discuss?