What part do math and probability play in the compression of text?

This hardball was tossed at mrlucky's cranium by Rick Bowser's client, David Block. He asked me to explain what part math and probability play in the compression of text, such as in ZIP files. He wondered if the computer has done a lot of reading and can predict the language that follows, or what? (You know,"Uh-oh," says Pkzip, "it's Hemingway. Lotta short punchy sentences coming up!")

Well, careful reading has convinced me that I don't really want to touch this question with a ten bit word, but here goes!

When data is compressed, the goal is to reduce redundancy, leaving only the informational content. Information theory has a slew of formulae and algorithms to compute things like the entropy of the source letter and probability of occurrence.Wanna see one? SUM(i=1 to n) [-p(a(i)) lg p(a(i))] is the formula to compute the entropy of the source, called H.

lg, by the way, represents the base 2 logarithm.

The goal in statistical data compression techniques isto use probability to reduce the average code length used to represent the symbols of an alphabet. Huffman compression and Shannon-Fano coding are specific examples of so-called optimal code generating schemes. They are used in most software archiving tools.

Math is also key in "arithmetic coding", which uses a different, equally baffling scheme to represent a message as a sheaf of probabilities. When coupled with a suitable technique for building a model of the data to be compressed (such as a table of standard letter frequency, or an adaptive model that recomputes frequency based on the actual message), compression is achieved. DMC (Dynamic Markov Coding) and PPM (Prediction by Partial Matching) are two effective modeling tools for arithmetic coding.

Substitutional compressors can actually be understood by mere mortals. The concept here is to replace an occurrence of a particular phrase or group of bytes in a piece of data with a reference to previous occurrences of that phrase. Jacob Ziv and Abraham Lempel are the two fellows who proposed, in 1977 and 1978, the two schemes which are the main tools for this class of data compression.

LZ78-based schemes (the most common of which is LZW compression), work by entering phrases into a "dictionary" and then, when a repeat occurrence of that particular phrase is found, outputs the dictionary index instead of the phrase itself. The miracle of this technique is the fact that the dictionary is recreated by the decoding process, rather than by being explicitly transmitted. Refinements of LZ78 use so-called "least frequently-used" algorithms (LRU) to manage the dictionary.

LZ77-based schemes performs similar tracking of data by sliding a fixed-size "window" over the data, seeking matching in a more dynamic fashion. The sliding window technique automatically creates the LRU effect which must be done explicitly in LZ78 schemes.

Continuing development of this technology consists of combining some or all of these models to net increased compression with faster decoding.

Math and probability, therefore, are at the heart of data compression, but not in terms of the analysis of meaning or form. The same techniques have validity in analyzing sources of pollution along a river, or leaks in a pipeline. Their elegance speaks to far more fundamental concepts than simple words and phrases.

ask mrluckyindex of current questionsindex of archived questionsdocket of unanswered questionsdo your own darned google searchask mrlucky a question!