Archive for the ‘Data Auditing’ Category

Data Auditing and Contamination in Genome Databases

Thursday, October 2nd, 2014

Contamination of genome databases highlight the need for data auditing trails.


Abundant Human DNA Contamination Identified in Non-Primate Genome Databases by Mark S. Longo, Michael J. O’Neill, Rachel J. O’Neill ( (Longo MS, O’Neill MJ, O’Neill RJ (2011) Abundant Human DNA Contamination Identified in Non-Primate Genome Databases. PLoS ONE 6(2): e16410. doi:10.1371/journal.pone.0016410) (herein, Longo.

During routine screens of the NCBI databases using human repetitive elements we discovered an unlikely level of nucleotide identity across a broad range of phyla. To ascertain whether databases containing DNA sequences, genome assemblies and trace archive reads were contaminated with human sequences, we performed an in depth search for sequences of human origin in non-human species. Using a primate specific SINE, AluY, we screened 2,749 non-primate public databases from NCBI, Ensembl, JGI, and UCSC and have found 492 to be contaminated with human sequence. These represent species ranging from bacteria (B. cereus) to plants (Z. mays) to fish (D. rerio) with examples found from most phyla. The identification of such extensive contamination of human sequence across databases and sequence types warrants caution among the sequencing community in future sequencing efforts, such as human re-sequencing. We discuss issues this may raise as well as present data that gives insight as to how this may be occurring.

Mining of public sequencing databases supports a non-dietary origin for putative foreign miRNAs: underestimated effects of contamination in NGS. by Tosar JP, Rovira C, Naya H, Cayota A. (RNA. 2014 Jun;20(6):754-7. doi: 10.1261/rna.044263.114. Epub 2014 Apr 11.)

The report that exogenous plant miRNAs are able to cross the mammalian gastrointestinal tract and exert gene-regulation mechanism in mammalian tissues has yielded a lot of controversy, both in the public press and the scientific literature. Despite the initial enthusiasm, reproducibility of these results was recently questioned by several authors. To analyze the causes of this unease, we searched for diet-derived miRNAs in deep-sequencing libraries performed by ourselves and others. We found variable amounts of plant miRNAs in publicly available small RNA-seq data sets of human tissues. In human spermatozoa, exogenous RNAs reached extreme, biologically meaningless levels. On the contrary, plant miRNAs were not detected in our sequencing of human sperm cells, which was performed in the absence of any known sources of plant contamination. We designed an experiment to show that cross-contamination during library preparation is a source of exogenous RNAs. These contamination-derived exogenous sequences even resisted oxidation with sodium periodate. To test the assumption that diet-derived miRNAs were actually contamination-derived, we sought in the literature for previous sequencing reports performed by the same group which reported the initial finding. We analyzed the spectra of plant miRNAs in a small RNA sequencing study performed in amphioxus by this group in 2009 and we found a very strong correlation with the plant miRNAs which they later reported in human sera. Even though contamination with exogenous sequences may be easy to detect, cross-contamination between samples from the same organism can go completely unnoticed, possibly affecting conclusions derived from NGS transcriptomics.

Whether the contamination of these databases is significant or not, is a matter for debate. See the comments to Longo.

Even if errors are “easy to spot,” the question remains for both users and curators of these databases, how to provide data auditing for corrections/updates?

At a minimum, one would expect to know:

  • Database/dataset values for any given date?
  • When values changed?
  • What values changed?
  • Who changed those values?
  • On what basis were the changes made?
  • Comments on the changes
  • Links to literature concerning the changes
  • Do changes have an “audit” trail that includes both the original and new values?

If there is no “audit” trail, on what basis would I “trust” the data on a particular date?

Suggestions on current correction practices?

I first saw this in a post by Mick Watson.