Compression Info

Compression Overview

Data of any sort, but most particularly multi-media data, require large amounts of storage space and tax the bandwith of most transport systems.

Compression assists in remedying this situation: data is compressed to be stored or transmitted, and then decompressed when it is ready to use.

With the explosion of internet use and increased data transmission to/from space, the search for fast, efficient compression has been intensified.

Numerous compression schemes have been proposed and implemented; some burdened with patent/royalty issues, others that may be freely used.

Compression Factors

Modern compression generally employs 3 steps in compressing data:

Elimination of noise and insignificant data

The capture of data, especially automated capture, introduces extraneous information, artifacts, and information that might not be needed by a target recepient -- transmission of such information results in wasted bandwith. Such noise can often be eliminated by simple filtering.

Examples are: voice data -- frequencies out of the band of normal human speech can be eliminated; visual data - variations that humans are insensitive to can be clipped/truncated.

Reduction of least-significant data

Humans are often able to "fill in" missing data in a well-known context; we can fill in missing text, understand speech with audio gaps, construct edges and color where non are present.

Lossy compression relies on knowing what parts of the "significant" data can be reduced, while maintaining "reasonable" presentation. Humans are most sensitive to variations in brightness, less so to changes in hue, and least sensitive to color saturation; by collapsing colors of similar saturation, far fewer bits of data are required to repesent a given color.

In general, humans are least sensitive to low frequency changes (changes over distance or time), and most sensitive to hight frequency variations. Wavelet filtering provides lossy compression that collapses low frequency data.

Collapse of redundant data

Having eliminated noise and least significant data, compressors then focus on removing redundant information and generating signatures of data patterns.

Communication data often contains words or symbols that can represented by a reference to a sample of that data in an agreed upon dictionary. This reference is generally much smaller than the original sample; dictionary encoding is the basis of fax compression and many other modern compression schemes.

Such samples are often repeated in a data stream; runlength encoding provides simple compression by representing the data via a reference to the sample and a number indicating the repeat count.

Data often also includes patterns of samples. Patterns can often be approximated by defining curves that include the significant data points of the sample. In complex patterns, such curves can generally be represented by far fewer data points than the original sample. Signature compression has the additional advantage that it can fabricate "reasonable" data when zooming in or slowing down data presentation beyond the limits of its original significance -- this is particularly useful in rendering compressed textures in 3D models.

A more computationally expensive version of pattern encoding is fractal compression, where samples are replicated recursively at various scales, rotations and translations. This provides excellent compression for some data.

Specifications

ZLIB

DEFLATE

GZIP 4.3

Other References

comp.compression.research FAQ

PKWare ZIP Libraries

Data Archiving