Comments on base52 encoder/decoder ?

Wed Jan 8 12:30:20 EST 2003

On 08 Jan 2003 06:12:36 -0800, Paul Rubin <phr-n2002b at NOSPAMnightsong.com> wrote:

>bokr at oz.net (Bengt Richter) writes:
>> What it boils down to is that there are enough codes for two sets
>> of encodings plus 9023 special codes. You can think of it as a single
>> bit attribute for arbitrary subsequences of binary bytes with no extra
>> encoding characters vs doing them separately as before, and 9023 integer
>> codes can be inserted also at a cost of 3 code characters apiece. You
>> can think of them as available escape codes.
>
>I'd vote for getting rid of the fancy stuff with the integer codes,
>and also making the odd-character the last character in the encoded
>string rather than the first.  The reason is that lets you encode a
>stream where you don't know the length in advance.  

You can encode a stream where you don't know the length, but it's up
to you to concatenate the chunks returned by the conversion. When you
pass a string to b2a_base52, that length is by definition known, though
it's not known how many times you will repeat the operation. Otherwise
you are asking for a stateful filter that will need a close() method.

In the non-stateful situation, it doesn't matter if the odd byte is done at the
beginning or end (except doing it at the beginning makes the code a little shorter ;-).

Of course, if you encode large amounts of data by passing length-one strings
to b2a_base52, it will be very inefficient. But you wouldn't do that normally.

You can concatenate arbitrary encodings and decode them as one input, and
get what you expect. If all your pieces were even length, your combined encoding
would be identical. If you have e.g., two odd chunks, it means the whole could
have been encoded in 3 less code characters if the chunks had been precombined,
that's all. E.g.,

 >>> from base52x import b2a_base52 as b2a
 >>> from base52x import a2b_base52 as a2b
 >>>

Two minimal odd chunks:
 >>> b2a('1')+b2a('2')
 'wZdwZe'

Decode combination:
 >>> a2b('wZdwZe')
 '12'

Encode even equivalent:
 >>> b2a('12')
 'EiK'

And decode:
 >>> a2b('EiK')
 '12'

And non-alpha is ignored:

 >>> a2b("""
 ... E
 ...  i
 ...   K
 ...
 ... ;-)
 ...
 ... """)
 '12'

IOW, in general, a2b(b2a(x)+b2a(y))==(x+y)

BTW, how about the option to insert \n after every line_len characters?
And if so, should there be a final \n guaranteed there or not there?
Or should that be left to the outside too?

Regards,
Bengt Richter