"More About Unicode in Python 2 and 3"

Sun Jan 5 23:51:09 EST 2014

On Mon, Jan 6, 2014 at 3:24 PM, Roy Smith <roy at panix.com> wrote:
> I've never used Python 3, so forgive me if these are naive questions.
> Let's say you had an input stream which contained the following hex
> values:
>
> $ hexdump data
> 0000000 d7 a8 a3 88 96 95
>
> That's EBCDIC for "Python".  What would I write in Python 3 to read that
> file and print it back out as utf-8 encoded Unicode?

*deletes the two paragraphs that used to be here* Turns out Python 3
_does_ have an EBCDIC decoder... but it's not called EBCDIC.

>>> b"\xd7\xa8\xa3\x88\x96\x95".decode("cp500")
'Python'

This sounds like a good one for getting an alias, either "ebcdic" or
"EBCDIC". I didn't know that this was possible till I googled the
problem and saw someone else's solution.

To print that out as UTF-8, just decode and then encode:

>>> b"\xd7\xa8\xa3\x88\x96\x95".decode("cp500").encode("utf-8")
b'Python'

In the specific case of files on the disk, you could open them with
encodings specified, in which case you don't need to worry about the
details.

with open("data",encoding="cp500") as infile:
    with open("data_utf8","w",encoding="utf-8") as outfile:
        outfile.write(infile.read())

Of course, this is assuming that Unicode has a perfect mapping for
every EBCDIC character. I'm not familiar enough with EBCDIC to be sure
that that's true, but I strongly suspect it is. And if it's not,
you'll get an exception somewhere along the way, so you'll know
something's gone wrong. (In theory, a "transcode" function might be
able to give you a warning before it even sees your data -
transcode("utf-8", "iso-8859-3") could alert you to the possibility
that not everything in the source character set can be encoded. But
that's a pretty esoteric requirement.)

> Or, how about a slightly different example:
>
> $ hexdump data
> 0000000 43 6c 67 75 62 61
>
> That's "Python" in rot-13 encoded ascii.  How would I turn that into
> cleartext Unicode in Python 3?

That's one of the points that's under dispute. Is rot13 a
bytes<->bytes encoding, or is it str<->str, or is it bytes<->str? The
issue isn't clear. Personally, I think it makes good sense as a
str<->str translation, which would mean that the process would be
somewhat thus:

>>> rot13={}
>>> for i in range(13):
        rot13[65+i]=65+i+13
        rot13[65+i+13]=65+i
        rot13[97+i]=97+i+13
        rot13[97+i+13]=97+i

>>> data = b"\x43\x6c\x67\x75\x62\x61" # is there an easier way to turn a hex dump into a bytes literal?
>>> data.decode().translate(rot13)
'Python'

This is treating rot13 as a translation of Unicode codepoints to other
Unicode codepoints, which is different from an encode operation (which
takes abstract Unicode data and produces concrete bytes) or a decode
operation (which does the reverse). But this is definitely a grey
area. It's common for cryptographic algorithms to work with bytes,
meaning that their "decoded" text is still bytes. (Or even less than
bytes. The famous Enigma machines from World War II worked with the 26
letters as their domain and range.) Should the Python codecs module
restrict itself to the job of translating between bytes and str, or is
it a tidy place to put those other translations as well?

ChrisA