Archive for the ‘DNA’ Category

DNA Injection Attack (Shellcode in Data)

Thursday, August 10th, 2017

BioHackers Encoded Malware in a String of DNA by Andy Greenberg.

From the post:

WHEN BIOLOGISTS SYNTHESIZE DNA, they take pains not to create or spread a dangerous stretch of genetic code that could be used to create a toxin or, worse, an infectious disease. But one group of biohackers has demonstrated how DNA can carry a less expected threat—one designed to infect not humans nor animals but computers.

In new research they plan to present at the USENIX Security conference on Thursday, a group of researchers from the University of Washington has shown for the first time that it’s possible to encode malicious software into physical strands of DNA, so that when a gene sequencer analyzes it the resulting data becomes a program that corrupts gene-sequencing software and takes control of the underlying computer. While that attack is far from practical for any real spy or criminal, it’s one the researchers argue could become more likely over time, as DNA sequencing becomes more commonplace, powerful, and performed by third-party services on sensitive computer systems. And, perhaps more to the point for the cybersecurity community, it also represents an impressive, sci-fi feat of sheer hacker ingenuity.

“We know that if an adversary has control over the data a computer is processing, it can potentially take over that computer,” says Tadayoshi Kohno, the University of Washington computer science professor who led the project, comparing the technique to traditional hacker attacks that package malicious code in web pages or an email attachment. “That means when you’re looking at the security of computational biology systems, you’re not only thinking about the network connectivity and the USB drive and the user at the keyboard but also the information stored in the DNA they’re sequencing. It’s about considering a different class of threat.”

Very high marks for imaginative delivery but at its core, this is shellcode in data.

Shellcode in an environment the authors describe as follows:

Our results, and particularly our discovery that bioinformatics software packages do not seem to be written with adversaries in mind, suggest that the bioinformatics pipeline has to date not received significant adversarial pressure.

(Computer Security, Privacy, and DNA Sequencing: Compromising Computers with Synthesized DNA, Privacy Leaks, and More)

Question: Can you name any data pipelines that have been subjected to adversarial pressure?

The reading of DNA and transposition into machine format reminds me that a data pipeline could ingest apparently non-hostile data and as a result of transformations/processing, produce hostile data at some point in the data stream.

Transformation into shellcode, now that’s a very interesting concept.

Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform

Wednesday, May 2nd, 2012

Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform by Anthony J. Cox, Markus J. Bauer, Tobias Jakobi, and Giovanna Rosone.



The Burrows-Wheeler transform (BWT) is the foundation of many algorithms for compression and indexing of text data, but the cost of computing the BWT of very large string collections has prevented these techniques from being widely applied to the large sets of sequences often encountered as the outcome of DNA sequencing experiments. In previous work, we presented a novel algorithm that allows the BWT of human genome scale data to be computed on very moderate hardware, thus enabling us to investigate the BWT as a tool for the compression of such datasets.


We first used simulated reads to explore the relationship between the level of compression and the error rate, the length of the reads and the level of sampling of the underlying genome and compare choices of second-stage compression algorithm.

We demonstrate that compression may be greatly improved by a particular reordering of the sequences in the collection and give a novel `implicit sorting’ strategy that enables these benefits to be realised without the overhead of sorting the reads. With these techniques, a 45x coverage of real human genome sequence data compresses losslessly to under 0.5 bits per base, allowing the 135.3Gbp of sequence to fit into only 8.2Gbytes of space (trimming a small proportion of low-quality bases from the reads improves the compression still further).

This is more than 4 times smaller than the size achieved by a standard BWT-based compressor (bzip2) on the untrimmed reads, but an important further advantage of our approach is that it facilitates the building of compressed full text indexes such as the FM-index on large-scale DNA sequence collections.

Important work for several reasons.

First, if the human genome is thought of as “big data,” it opens the possibility that compressed full text indexes can be build for other instances of “big data.”

Second, indexing is similar to topic mapping in the sense that pointers to information about a particular subject are gathered to a common location. Indexes often account for synonyms (see also) and distinguish the use of the same word for different subjects (polysemy).

Third, depending on the granularity of tokenizing and indexing, index entries should be capable of recombination to create new index entries.

Source code for this approach:

Code to construct the BWT and SAP-array on large genomic data sets is part of the BEETL library, available as a github respository at