Comments on base52 encoder/decoder ?

Bengt Richter bokr at oz.net
Mon Jan 6 17:18:27 EST 2003


In a recent thread (Re: Simple encryption proposal. Comments ?)
I described a base52 encoding that I had used previously to encode binary
in pure alpha [A-Za-z]* using three characters to encode arbitrary length
binary byte strings.

Paul Rubin responded with a Python implementation. I thanked him and,
suggested a minor python change for a minor speed improvement, and he
said he wasn't really trying for speed and that a serious implementation
if done should probably be done in C in the binascii module.

I took binascii.c and made a test module that had the functionality
described (though differing from Paul's implementation in minor ways
(ordering of code alphabet, handling of odd byte). That's the background.

Since then, I got to thinking that there were a _lot_ of spare codes
in the base52 space, and I added some optional parameters to take
advantage.

What it boils down to is that there are enough codes for two sets
of encodings plus 9023 special codes. You can think of it as a single
bit attribute for arbitrary subsequences of binary bytes with no extra
encoding characters vs doing them separately as before, and 9023 integer
codes can be inserted also at a cost of 3 code characters apiece. You
can think of them as available escape codes.

An example follows, with some explanation along the way.
First we'll jump ahead ;-)

 >>> from base52x import a2b_base52 as a2b
 >>> from base52x import b2a_base52 as b2a

 >>> b2a('plain')+b2a('meta',1)+b2a('tagged',123)
 'waoKNdJzCietjNNwkpLAxJpDJfI'

The second (code) argument can be 0-9025, with 0 being for default base52 encoding.
A code of 1 is associated with every character in the 'meta' string, and a code of
2 to 9025 becomes a special 3-character prefixed code to the associated string, which
is itself encoded with code 0.

Decoding also has an otpional 'mode' parameter. The default of mode 0 expects to
decode a default-encoded string, and returns just the string. Mode 3, however, returns
a list with strings prefixed by encoding codes:

 >>> a2b('waoKNdJzCietjNNwkpLAxJpDJfI',3)
 [0, 'plain', 1, 'meta', 123, 0, 'tagged']

Modes 1 and two return lists of either just
code 0 or code 1 encoded strings and leaves out special codes.

 >>> a2b('waoKNdJzCietjNNwkpLAxJpDJfI',1)
 ['plain', 'tagged']

 >>> a2b('waoKNdJzCietjNNwkpLAxJpDJfI',2)
 ['meta']

but:

 >>> a2b('waoKNdJzCietjNNwkpLAxJpDJfI',0)
 Traceback (most recent call last):
   File "<stdin>", line 1, in ?
 base52x.Error: Unexpected metadata in mode 0 base52x string

Mode 0 insists on pure code 0 encoding. We can grab the first part
of the above, which is mode 0:

 >>> a2b('waoKNdJzC',0)
 'plain'

So that's basically it. Of course you can substitute arbitrary binary
for 'plain' 'meta' and 'tagged' above, and 2-9025 in place of the 123,
and you can concatenate them arbitrarily, and also insert arbitrary
non-alphabet characters in the code for spaces and newlines etc. They
will be ignored.

Any ideas for uses? E.g., I could see packing .gif file names in the meta part
and the binary in the plain part, or encoding mixed unicode and other
kinds of strings and stuff by keeping type info in metadata prefixes.

If this sounds useful, I'm willing to make a version to incorporate into
binascii. I think I'll add an optional line length parameter for b2a, defaulting
to 78 for insertion of \n at those points. Zero will mean no inserted linefeeds.

BTW, the above is now in C.

Regards,
Bengt Richter




More information about the Python-list mailing list