Compression

Thu Jul 14 07:07:26 EDT 2016

On Thu, Jul 14, 2016 at 6:16 PM, Steven D'Aprano
<steve+comp.lang.python at pearwood.info> wrote:
> How about some really random data?
>
> py> import string
> py> data = ''.join(random.choice(string.ascii_letters) for i in range(21000))
> py> len(codecs.encode(data, 'bz2'))
> 15220
>
> That's actually better than I expected: it's found some redundancy and saved
> about a quarter of the space.

What it found was an imbalance in the frequencies of byte values - you
used 52 values lots of times, and the other 204 never. Huffman coding
means those 52 values will get fairly short codes, and if you happened
to have just one or two other byte values, they'd be represented by
longer codes. It's like the Morse code - by encoding some letters with
very short sequences (dot followed by end-of-letter for E, dash
followed by end-of-letter for T) and others with much longer sequences
(dash-dot-dot-dash-EOL for X), it manages a fairly compact
representation of typical English text. The average Morse sequence
length for a letter is 3.19, but on real-world data... well, I used
the body of your email as sample text (yes, I'm aware it's not all
English), and calculated a weighted average of 2.60. (Non-alphabetics
are ignored, and the text is case-folded.) Using the entire text of
Gilbert & Sullivan's famous operettas, or the text of "The Beauty
Stone", or the wikitext source of the Wikipedia article on Morse code,
gave similar results (ranging from 2.56 to 2.60); interestingly, a
large slab of Lorem Ipsum skewed the numbers slightly lower (2.52),
not higher as I was afraid it would, being more 'random'.

Further example: os.urandom() returns arbitrary byte values, and (in
theory, at least) has equal probability of returning every possible
value. Base 64 encoding that data makes three bytes come out as four.
Check this out:

>>> data = os.urandom(21000)
>>> len(base64.b64encode(data)) # just to make sure
28000
>>> len(codecs.encode(data, 'bz2'))
21458
>>> len(codecs.encode(base64.b64encode(data), 'bz2'))
21290

When you remove the redundancy in b64-encoded data, you basically...
get back what you started with. (Curiously, several repeated
os.urandommings showed consistent results to the above - 214xx for
direct 'compression' vs 212xx for b64-then-compress. But in both
cases, it's larger than the 21000 bytes of input.)

ChrisA