"More About Unicode in Python 2 and 3"

Steven D'Aprano steve+comp.lang.python at pearwood.info
Mon Jan 6 11:24:19 EST 2014


Roy Smith wrote:

> In article <mailman.5001.1388976943.18130.python-list at python.org>,
>  Chris Angelico <rosuav at gmail.com> wrote:
> 
>> It can't be both things. It's either bytes or it's text.
> 
> I've never used Python 3, so forgive me if these are naive questions.
> Let's say you had an input stream which contained the following hex
> values:
> 
> $ hexdump data
> 0000000 d7 a8 a3 88 96 95
> 
> That's EBCDIC for "Python".  What would I write in Python 3 to read that
> file and print it back out as utf-8 encoded Unicode?

There's no one EBCDIC encoding. Like the so-called "extended ASCII"
or "ANSI" encodings that followed, IBM had many different versions of
EBCDIC customised for different machines and markets -- only even more
poorly documented. But since the characters in that are all US English
letters, any EBCDIC dialect ought to do it:

py> b = b'\xd7\xa8\xa3\x88\x96\x95'
py> b.decode('CP500')
'Python'


To read it from a file:

text = open("somefile", encoding='CP500').read()

And to print out the UTF-8 encoded bytes:

print(text.encode('utf-8'))



> Or, how about a slightly different example:
> 
> $ hexdump data
> 0000000 43 6c 67 75 62 61
> 
> That's "Python" in rot-13 encoded ascii.  How would I turn that into
> cleartext Unicode in Python 3?


In Python 3.3, you can do this:

py> b = b'\x43\x6c\x67\x75\x62\x61'
py> s = b.decode('ascii')
py> print(s)
Clguba
py> import codecs
py> codecs.decode(s, 'rot-13')
'Python'

(This may not work in Python 3.1 or 3.2, since rot13 and assorted other
string-to-string and byte-to-byte codecs were mistakenly removed. I say
mistakenly, not in the sense of "by accident", but in the sense of "it was
an error of judgement". Somebody was under the misapprehension that the
codec machinery could only work on Unicode <-> bytes.)

If you don't want to use the codec, you can do it by hand:

def rot13(astring):
    result = []
    for c in astring:
        i = ord(c)
        if ord('a') <= i <= ord('m') or ord('A') <= i <= ord('M'):
            i += 13
        elif ord('n') <= i <= ord('z') or ord('N') <= i <= ord('Z'):
            i -= 13
        result.append(chr(i))
    return ''.join(result)

But why would you want to do it the slow way?



-- 
Steven




More information about the Python-list mailing list