a question about Chinese characters in a Python Program

Paul Boddie paul at boddie.org.uk
Mon Oct 20 10:45:22 EDT 2008


On 20 Okt, 15:30, est <electronix... at gmail.com> wrote:
>
> Thanks for the long comment Paul, but it didn't help massive errors in
> Python encoding.
>
> IMHO it's even better to output wrong encodings rather than halt the
> WHOLE damn program by an exception

I disagree. Maybe I'll now get round to uploading an amusing pictorial
example of this strategy just to illustrate where it can lead. CJK
characters may be more demanding to deal with than various European
characters, but I've seen public advertisements (admittedly aimed at
IT course applicants) which made jokes about stuff like "å" and "ø"
appearing in documents instead of the intended European characters, so
it's fairly safe to say that people do care what gets written out from
computer programs.

> When debugging encoding problems, the solution is simple. If
> characters display wrong, switch to another encoding, one of them must
> be right.
>
> But it's tiring in python to deal with encodings, you have to wrap
> EVERY SINGLE character expression with try ... except ... just imagine
> what pain it is.

If everything is in Unicode then you don't have to think about
encodings. I recommend using things like codecs.open to ensure that
input and output even produce and consume Unicode objects when dealing
with files.

> Just like the example I gave in Google Groups, u'\ue863' can NEVER be
> encoded into '\xfe\x9f'. Not a chance, because python REFUSE to handle
> a byte that is greater than range(128).

Aside from the matter of which encoding you'd need to use to convert
u'\ue863' into '\xfe\x9f', it has nothing to do with any implicit byte
value range. To get from a Unicode object to a sequence of bytes
(since that is the external representation of the text for other
programs), Python has to perform a conversion. As a safe (but
obviously conservative) default, Python only attempts to convert each
Unicode character to a byte value using the ASCII character value
table which is only defined for characters 0 to 127 - there's no such
thing as "8-bit ASCII".

Python doesn't attempt to automatically convert using other character
tables (encodings, in other words), since there is quite a large
possibility that the result, if not produced for the correct encoding,
will not produce the desired visual effect. If I start with, say,
character "ø" and encode it using UTF-8, I get a sequence of bytes
which, if interpreted by a program expecting ISO-8859-15 will appear
as "ø". If I encode the character using ISO-8859-15 and then feed the
resulting byte sequence to a program expecting UTF-8, it will probably
either complain or produce an incorrect visual effect. The reason why
ASCII is safer (although not entirely safe) is because many encodings
support ASCII as a subset of themselves.

> Strangely the 'mbcs' encoding system can. Does 'mbcs' have magic or
> something? But it's Windows-specific

I thought Microsoft used some UTF-16 variant. That would explain how
it can handle more or less everything.

> Dealing with character encodings is really simple. AFAIK early
> encoding before Unicode, although they have many names, are all based
> on hacks. Take Chinese characters as an example. They are called
> GB2312 encoding, in fact it is totally compatible with range(256)
> ANSI. (There are minor issues like display half of a wide-character in
> a question mark ? but at least it's readable) If you just output
> serials of byte array, it IS GB2312. The same is true with BIG5, JIS,
> etc.

>From the Wikipedia page, it appears that you need to convert GB2312
values to EUC-CN by a relatively straightforward process, and can then
output the resulting byte sequence in an ASCII compatible way,
provided that you filter out all the byte values greater than 127:
these filtered bytes would produce nonsense for anyone using a program
not expecting EUC-CN. UTF-8 has some similar properties, but as I
noted above, you wouldn't want to read most of the output if your
program wasn't expecting UTF-8.

> Like I said, str() should NOT throw an exception BY DESIGN, it's a
> basic language standard. str() is not only a convert to string
> function, but also a serialization in most cases.(e.g. socket) My
> simple suggestion is: If it's a unicode character, output as UTF-8;
> other wise just ouput byte array, please do not encode it with really
> stupid range(128) ASCII. It's not guessing, it's totally wrong.

I think it's unfortunate that "str" is now potentially unreliable for
certain uses, but to just output an arbitrary byte sequence (unless by
byte array you mean a representation of the numeric values) is the
wrong thing to do unless you don't care about the output; in which
case, you could just as well use "repr" instead. I think the output of
"str" vs. "unicode" especially with regard to Unicode objects was
discussed extensively on the python-dev mailing list at one point.

I don't disagree that people sometimes miss a way of having Python or
some library "do the right thing" when writing stuff out. I could
imagine a wrapper for Python accepting UTF-8 whose purpose is to
"blank out" characters which the console cannot handle, and people
might use this wrapper explicitly because that is the "right thing"
for them. Indeed, such a program may already exist for a more general
audience since I imagine that it could be fairly useful.

Paul



More information about the Python-list mailing list