Compression

Thu Jul 14 04:16:44 EDT 2016

I thought I'd experiment with some of Python's compression utilities. First I 
thought I'd try compressing some extremely non-random data:

py> import codecs
py> data = "something non-random."*1000
py> len(data)
21000
py> len(codecs.encode(data, 'bz2'))
93
py> len(codecs.encode(data, 'zip'))
99

That's really good results. Both the bz2 and Gzip compressors have been able to 
compress nearly all of the redundancy in the data.

What if we shuffle the data so it is more random?

py> import random
py> data = list(data)
py> random.shuffle(data)
py> data = ''.join(data)
py> len(data); len(codecs.encode(data, 'bz2'))
21000
10494

How about some really random data?

py> import string
py> data = ''.join(random.choice(string.ascii_letters) for i in range(21000))
py> len(codecs.encode(data, 'bz2'))
15220

That's actually better than I expected: it's found some redundancy and saved 
about a quarter of the space. What if we try compressing data which has already 
been compressed?

py> cdata = codecs.encode(data, 'bz2')
py> len(cdata); len(codecs.encode(cdata, 'bz2'))
15220
15688

There's no shrinkage at all; compression has actually increased the size.

What if we use some data which is random, but heavily biased?

py> values = string.ascii_letters + ("AAAAAABB")*100
py> data = ''.join(random.choice(values) for i in range(21000))
py> len(data); len(codecs.encode(data, 'bz2'))
21000
5034

So we can see that the bz2 compressor is capable of making use of deviations 
from uniformity, but the more random the initial data is, the less effective is 
will be.

-- 
Steve