## Archive for the ‘Compression’ Category

### Vulnerable 7-Zip As Poster Child For Open Source

Friday, May 13th, 2016

From the post:

But users be warned. Cisco Talos recently discovered multiple vulnerabilities in 7-Zip that are more serious than regular security flaws. As explained in a blog post by Marcin Noga and Jaeson Schultz, two members of the Cisco Talos Security Intelligence & Research Group:

“These type of vulnerabilities are especially concerning since vendors may not be aware they are using the affected libraries. This can be of particular concern, for example, when it comes to security devices or antivirus products. 7-Zip is supported on all major platforms, and is one of the most popular archive utilities in-use today. Users may be surprised to discover just how many products and appliances are affected.”

Cisco Talos has identified two flaws in particular. The first (CVE-2016-2335) is an out-of-bounds read vulnerability that exists in the way 7-Zip handles Universal Disk Format (UDF) files. An attacker could potentially exploit this vulnerability to achieve arbitrary code execution.

The “many products and appliances” link results in:

If you use the suggested search string:

Every instance of software running a vulnerable 7-Zip library is subject to this hack. A number likely larger than the total 2,490,000 shown by these two searches.

For open source software, you can check to see if it has been upgraded to 7-Zip, version 16.0.

If you have non-open source software, how are you going to check for the upgrade?

Given the lack of liability under the usual EULA, are you really going to take a vendor’s word for the upgrade?

The vulnerable 7-Zip library is a great poster child for open source software.

Not only for the discovery of flaws but to verify vendors have properly patched those flaws.

### How gzip uses Huffman coding

Monday, February 23rd, 2015

How gzip uses Huffman coding by Julia Evans.

From the post:

I wrote a blog post quite a while ago called gzip + poetry = awesome where I talked about how the gzip compression program uses the LZ77 algorithm to identify repetitions in a piece of text.

In case you don’t know what LZ77 is (I sure didn’t), here’s the video from that post that gives you an example of gzip identifying repetitions in a poem!

Julia goes beyond the video to illustrate how Huffman encoding is used by gzip to compress a text.

She includes code, pointers to other resources, basically all you need to join her in exploring the topic at hand. An education style that many manuals and posts would do well to adopt.

### Compressed Text Indexes: From Theory to Practice!

Friday, October 3rd, 2014

Abstract:

A compressed full-text self-index represents a text in a compressed form and still answers queries efficiently. This technology represents a breakthrough over the text indexing techniques of the previous decade, whose indexes required several times the size of the text. Although it is relatively new, this technology has matured up to a point where theoretical research is giving way to practical developments. Nonetheless this requires significant programming skills, a deep engineering effort, and a strong algorithmic background to dig into the research results. To date only isolated implementations and focused comparisons of compressed indexes have been reported, and they missed a common API, which prevented their re-use or deployment within other applications.

The goal of this paper is to fill this gap. First, we present the existing implementations of compressed indexes from a practitioner’s point of view. Second, we introduce the Pizza&Chili site, which offers tuned implementations and a standardized API for the most successful compressed full-text self-indexes, together with effective testbeds and scripts for their automatic validation and test. Third, we show the results of our extensive experiments on these codes with the aim of demonstrating the practical relevance of this novel and exciting technology.

A bit dated (2007) but definitely worth your attention. The “cited-by” results from the ACM Digital Library will bring you up to date.

BTW, I was pleased to find the Pizza&Chili Corpus: Compressed Indexes and their Testbeds, both Italian and Chilean mirrors are still online!

I have seen document links survive that long but rarely an online testbed.

### Kolmogorov Complexity – A Primer

Thursday, December 6th, 2012

Kolmogorov Complexity – A Primer by Jeremy Kun.

From the post:

Previously on this blog (quite a while ago), we’ve investigated some simple ideas of using randomness in artistic design (psychedelic art, and earlier randomized css designs), and measuring the complexity of such constructions. Here we intend to give a more thorough and rigorous introduction to the study of the complexity of strings. This naturally falls into the realm of computability theory and complexity theory, and so we refer the novice reader to our other primers on the subject (Determinism and Finite Automata, Turing Machines, and Complexity Classes; but Turing machines will be the most critical to this discussion).

Jeremy sets the groundwork necessary for a later post in this series. (covering machine learning)

Digest this for a couple of days and I will point out the second post.

### Compressive Genomics [Compression as Merging]

Wednesday, July 11th, 2012

Compressive genomics by Po-Ru Loh, Michael Baym, and Bonnie Berger (Nature Biotechnology 30, 627–630 (2012) doi:10.1038/nbt.2241)

From the introduction:

In the past two decades, genomic sequencing capabilities have increased exponentially[cites omitted] outstripping advances in computing power[cites omitted]. Extracting new insights from the data sets currently being generated will require not only faster computers, but also smarter algorithms. However, most genomes currently sequenced are highly similar to ones already collected[cite omitted]; thus, the amount of new sequence information is growing much more slowly.

Here we show that this redundancy can be exploited by compressing sequence data in such a way as to allow direct computation on the compressed data using methods we term ‘compressive’ algorithms. This approach reduces the task of computing on many similar genomes to only slightly more than that of operating on just one. Moreover, its relative advantage over existing algorithms will grow with the accumulation of genomic data. We demonstrate this approach by implementing compressive versions of both the Basic Local Alignment Search Tool (BLAST)[cite omitted] and the BLAST-Like Alignment Tool (BLAT)[cite omitted], and we emphasize how compressive genomics will enable biologists to keep pace with current data.

Software available at: Compression-accelerated BLAST and BLAT.

A new line of attack on searching “big data.”

Making “big data” into “smaller data” and enabling analysis of it while still “smaller data.”

Enabling the searching of highly similar genomes by compression is a form of merging isn’t it? That is a sequence (read subject) that occurs multiple times over similar genomes is given a single representative, while preserving its relationship to all the individual genome instances.

What makes merger computationally tractable here and yet topic may systems, at least some of them, are reported to have scalability issues: Scalability of Topic Map Systems by Marcel Hoyer?

What other examples of computationally tractable merging would you suggest? Including different merging approaches/algorithms. Thinking it might be a useful paper/study to work from scalable merging examples towards less scalable ones. Perhaps to discover what choices have an impact on scalability.

### Elegant exact string match using BWT

Thursday, May 17th, 2012

Elegant exact string match using BWT by Santhosh Kumar.

From the post:

This post describes an elegant and fast algorithm to perform exact string match. Why another string matching algorithm? To answer the question, let’s first understand the problem we are trying to solve.

In short, the problem is to match billions of short strings (about 50-100 characters long) to a text which is 3 billion characters long. The 3 billion character string (also called reference) is known ahead and is fixed (at least for a species). The shorter strings (also called reads) are generated as a result of an experiment. The problem arises due to the way the sequencing technology works, which in its current form, breaks the DNA into small fragments and ‘reads’ them. The information about where the fragments came from is lost and hence the need to ‘map’ them back to the reference sequence.

We need an algorithm that allows repeatedly searching on a text as fast as possible. We are allowed to perform some preprocessing on the text once if that will help us achieve this goal. BWT search is one such algorithm. It requires a one-time preprocessing of the reference to build an index, after which the query time is of the order of the length of the query (instead of the reference).

Burrows Wheeler transform is a reversible string transformation that has been widely used in data compression. However the application of BWT to perform string matching was discovered fairly recently in this paper. This technique is the topic of this post. Before we get to the searching application, a little background on how BWT is constructed and some properties of BWT.

Complete with careful illustrations of the operation of the Burrows Wheeler transform (BWT).

A separate post to follow details finding the BWT index of a long string efficiently.

Definitely a series to follow.

### Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform

Wednesday, May 2nd, 2012

Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform by Anthony J. Cox, Markus J. Bauer, Tobias Jakobi, and Giovanna Rosone.

Abstract:

Motivation

The Burrows-Wheeler transform (BWT) is the foundation of many algorithms for compression and indexing of text data, but the cost of computing the BWT of very large string collections has prevented these techniques from being widely applied to the large sets of sequences often encountered as the outcome of DNA sequencing experiments. In previous work, we presented a novel algorithm that allows the BWT of human genome scale data to be computed on very moderate hardware, thus enabling us to investigate the BWT as a tool for the compression of such datasets.

Results

We first used simulated reads to explore the relationship between the level of compression and the error rate, the length of the reads and the level of sampling of the underlying genome and compare choices of second-stage compression algorithm.

We demonstrate that compression may be greatly improved by a particular reordering of the sequences in the collection and give a novel implicit sorting’ strategy that enables these benefits to be realised without the overhead of sorting the reads. With these techniques, a 45x coverage of real human genome sequence data compresses losslessly to under 0.5 bits per base, allowing the 135.3Gbp of sequence to fit into only 8.2Gbytes of space (trimming a small proportion of low-quality bases from the reads improves the compression still further).

This is more than 4 times smaller than the size achieved by a standard BWT-based compressor (bzip2) on the untrimmed reads, but an important further advantage of our approach is that it facilitates the building of compressed full text indexes such as the FM-index on large-scale DNA sequence collections.

Important work for several reasons.

First, if the human genome is thought of as “big data,” it opens the possibility that compressed full text indexes can be build for other instances of “big data.”

Second, indexing is similar to topic mapping in the sense that pointers to information about a particular subject are gathered to a common location. Indexes often account for synonyms (see also) and distinguish the use of the same word for different subjects (polysemy).

Third, depending on the granularity of tokenizing and indexing, index entries should be capable of recombination to create new index entries.

Source code for this approach:

Code to construct the BWT and SAP-array on large genomic data sets is part of the BEETL library, available as a github respository at git@github.com:BEETL/BEETL.git`.