[Tutor] Possible Useless Python Challenge. [huffman encoding]

Mon, 7 May 2001 22:56:14 -0700

Daniel,
You rock! I thought that LZH and LHA was an old dead compression algorithm
that only hackers still used  ;)

-----Original Message-----
From: tutor-admin@python.org [mailto:tutor-admin@python.org]On Behalf Of
Daniel Yoo
Sent: Monday, May 07, 2001 10:05 PM
To: Rob Andrews
Cc: Deirdre Saoirse Moen; tutor@python.org
Subject: Re: [Tutor] Possible Useless Python Challenge. [huffman encoding]

On Mon, 7 May 2001, Rob Andrews wrote:

> Sometimes I barely understand a word of what's currently being discussed
> and still think it sounds nifty. This is such a time.
>
> Deirdre Saoirse Moen wrote:
> >
> > >On Sun, 6 May 2001, Tesla Coil wrote:
> > >
> > >>  KiSS data sets are bundled using LZH compression,
> > >>  if there's a module for that, I haven't located it.
> > >
> > >I haven't found an LZH decompressor for Python.  In fact, the GnomeKISS
> > >people are depending on a separate LHA extraction program, so perhaps a
> > >Python KISS program should delegate the handling of LZH compression to
the
> > >LHA utility.
> >
> > I've written both Lempel-Ziv and Huffman compressors and
> > decompressors, but not LZH per se and not in Python. Given that
> > Python has all the requisite bitwise operators, it would be possible,
> > but it'd probably be better to write a wrapper for an existing set of
> > C functions.

*laugh* I have a little bit of code that almost implements Huffman
encoding.  (VanL, this uses some of the tree code I mentioned earlier.)
I was writing this during the afternoon.  Coincidence, huh?

I'll try to give the flavor of what Huffman encoding is.  We need to give
a little background though.  [note, this is LONG.]

When we represent letters --- one character strings --- Python (and mostly
every other computer language) internally uses a numeric code called
ASCII: American Standard Code for Information Interchange, to encode each
character.

Our computers are bit crunchers, so this means that every character we
store needs to be saved as a number.  For example, we could look at the
string:

    "hello world"

as a list of ASCII numbers:

###
>>> map(ord, 'hello world')
[104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100]
###

but this is still one step removed from how computers REALLY represent
characters; they represent them, not just in numeric form, but in crude,
binary, base-two!  Ones and zeros, and that's really weird when we first
see it.  Here's a small program that will try to reveal what's underneath:

###
>>> def binaryStringHelper(n):
...     if n == 0: return ''
...     else: return binaryStringHelper(n >> 1) + str(n % 2)
...
>>> def toBinaryString(n):
...     if n == 0: return string.zfill(0, 8)
...     return string.zfill(int(binaryStringHelper(n)), 8)
...
>>> pprint.pprint(map(toBinaryString, range(10)))
['00000000',          ## ...stands for 0
 '00000001',          ## ...stands for 1
 '00000010',          ##               2
 '00000011',          ##               3
 '00000100',          ##             ...
 '00000101',
 '00000110',
 '00000111',
 '00001000',          ##               8
 '00001001']          ##               9
###

This is doing some hacky bit manipulation which isn't really important:
the important thing is that this program above allows us to visualize all
those one's and zero's that the computer actually uses to represent a
character.  We can apply this on our string:

###
>>> pprint.pprint(map(toBinaryString, map(ord, 'hello world')))
['01101000',
 '01100101',
 '01101100',
 '01101100',
 '01101111',
 '00100000',
 '01110111',
 '01101111',
 '01110010',
 '01101100',
 '01100100']
###

and this is how computers today truly store 'hello world' in its memory,
as streams of ones and zeros.  Every character is eight bits, so any
string of length 'n' will take up '8*n' bits.  Our small sentence, 'hello
world', takes up 88 bits!

Ok, this doesn't sound as world-threatening as we're making it out to be,
but what is neat about Huffman encoding is that we can usually do much
better than 88 bits.

What Huffman realized is that we're not taking advantage of a certain
curious property about human language: we use certain words over and over,
certain letters over and over, and our language is fraught with
repetitious repetition. What if we don't always represent a character by 8
bits, but by something that depends on how frequently the letter occurs in
our language?  Instead of having every letter be represented by eight
bits, we can make a "variable length" encoding.  That is, perhaps we can
use the following code:

    'h' ===> 0000
    'w' ===> 0001
    'd' ===> 0010
    'e' ===> 0011
    'o' ===> 01
    'r' ===> 100
    ' ' ===> 101
    'l' ===> 11

That is, instead of using all eight bits on every letter, we can use a
shorter binary sequence for really frequent characters, and longer codes
for letters that we don't use as often.  For letters that don't occur at
all, we don't even need to give them a code.  Huffman's encoding gives us
a way of figuring out really good codes that do this for us.  What this
means is, that, instead of representing 'hello world' as:

###
>>> string.join(map(toBinaryString, map(ord, 'hello world')), '')
'011010000110010101101100011011000110111100100000011101110110111101110010011
0110001100100'
###

we can do it like this:

###
>>> code = {'h' : '0000', 'w' : '0001', 'd' : '0010', 'e' : '0011',
            'o' : '01',   'r' : '101',  ' ' : '101',  'l' : '11'}
>>> string.join(map(lambda c: code[c], 'hello world'), '')
'00000011111101101000101101110010'
####

which is a heck of a lot shorter than what we had previously.  This is the
heart of Huffman encoding: to find a binary representation for each number
so that a message doesn't take so much space to store.

Of course this is handwavy, because whoever wants to read our message
needs to have a copy of the dictionary!  They need to be able to convert
our binary message back into something readable, so we usually send off
the dictionary alongside the encoded text.  When a message gets large
enough, though, that overhead becomes negigible compared to the amount of
space we save.

Finally, the code itself needs to guarantee that, given a bit string,
there's only one way to rehydrate it back to it's nice spongy, wordy form.
Not coinciently, Huffman encoding guarantees this for us.  Really smart
guy, and really neat algorithm.

Does this make sense?  I get the feeling I talked too much on this one
already.  *grin*

The incomplete code to do Huffman encoding should be attached to this
message.  As soon as I get it in a nicer form, I'll send it off to Useless
Python.