Working with bytes.

Mon Apr 5 05:29:14 EDT 2004

Piet van Oostrum <piet at cs.uu.nl> wrote:

>AV>  

[snip]

>Which includes quite a few NON-ASCII characters. 
>So what is ASCII-compliant about it?
>You can't store 7 bits per byte and still be ASCII-compliant. At least if
>you don't want to include control characters.

Thanks, and yes you are right. I thought that getting rid of control
codes just meant switching to the high bit codes, but of course
control codes are part of the lower bit population and can't be
removed that way. Worse than that: high bit codes are not
ASCII-compliant at all!

However the code below has the 8'th and 7'th bit always set to 0 and 1
respectively, so it should produce ASCII-compliant output using 6 bits
per byte.

I wonder whether it would be possible to use more than six bits per
byte but less than seven? There seem to be some character codes left
and these could be used too?

Anton

from itertools import islice

def _bits(i):
    return [('01'[i>>j & 1]) for j in range(8)][::-1] 

_table = dict([(chr(i),_bits(i)) for i in range(256)])

def _bitstream(bytes):
    for byte in bytes:
        for bit in _table[byte]:
            yield bit

def _drop_first_two(gen):
    while 1:
        gen.next()
        gen.next()
        for x in islice(gen,6):
            yield x

def sixes(bytes):
    """ stream normal bytes to bytes where bits 8,7  are 0,1 """
    gen = _bitstream(bytes)
    while 1:
        R = list(islice(gen,6))
        if not R:  break
        s = '01'+ "".join(R) + '0' * (6-len(R))
        yield chr(int(s,2))

def eights(bytes,n):
    """ the reverse of the sixes function :-| """
    gen = _bitstream(bytes)
    df = _drop_first_two(gen)
    for i in xrange(n):
        s =  ''.join(islice(df,8))
        yield chr(int(s,2))

def test():
    from random import randint
    size = 20
    R = [chr(randint(0,255)) for i in xrange(size)]
    bytes = ''.join(R)
    sx =  ''.join(sixes(bytes))
    check = ''.join(eights(sx,size))
    assert  check == bytes
    print sx

if __name__ == '__main__': 
    test()

output:

VMtdh[LII~Qexdyg}xFRhXRIVx