a question about Chinese characters in a Python Program

Mon Oct 20 09:30:09 EDT 2008

On Oct 20, 6:47 pm, Paul Boddie <p... at boddie.org.uk> wrote:
> On 20 Okt, 07:32, est <electronix... at gmail.com> wrote:
>
>
>
> > Personally I call it a serious bug in python
>
> Normally I'd entertain the possibility of bugs in Python, but your
> reasoning is a bit thin (inhttp://bugs.python.org/issue3648):"Why
> cann't Python just define ascii to range(256)"
>
> I do accept that it can be awkward to output text to the console, for
> example, but you have to consider that the console might not be
> configured to display any character you can throw at it. My console is
> configured for ISO-8859-15 (something like your magical "ascii to
> range(256)" only where someone has to decide what those 256 characters
> actually are), but that isn't going to help me display CJK characters.
> A solution might be to generate UTF-8 and then get the user to display
> the output in an appropriately configured application, but even then
> someone has to say that it's UTF-8 and not some other encoding that's
> being used. As discussed in another recent thread, Python 2.x does
> make some reasonable guesses about such matters to the extent that
> it's possible automatically (without magical knowledge).
>
> There is also the problem about use of the "str" built-in function or
> any operation where some Unicode object may be converted to a plain
> string. It is now recommended that you only convert to plain strings
> when you need to produce a sequence of bytes (for output, for
> example), and that you indicate how the Unicode values are encoded as
> bytes (by specifying an encoding). Python 3.x doesn't really change
> this: it just makes the Unicode/text vs. bytes distinction more
> obvious.
>
> Paul

Thanks for the long comment Paul, but it didn't help massive errors in
Python encoding.

IMHO it's even better to output wrong encodings rather than halt the
WHOLE damn program by an exception

When debugging encoding problems, the solution is simple. If
characters display wrong, switch to another encoding, one of them must
be right.

But it's tiring in python to deal with encodings, you have to wrap
EVERY SINGLE character expression with try ... except ... just imagine
what pain it is.

Just like the example I gave in Google Groups, u'\ue863' can NEVER be
encoded into '\xfe\x9f'. Not a chance, because python REFUSE to handle
a byte that is greater than range(128).

Strangely the 'mbcs' encoding system can. Does 'mbcs' have magic or
something? But it's Windows-specific

Dealing with character encodings is really simple. AFAIK early
encoding before Unicode, although they have many names, are all based
on hacks. Take Chinese characters as an example. They are called
GB2312 encoding, in fact it is totally compatible with range(256)
ANSI. (There are minor issues like display half of a wide-character in
a question mark ? but at least it's readable) If you just output
serials of byte array, it IS GB2312. The same is true with BIG5, JIS,
etc.

Like I said, str() should NOT throw an exception BY DESIGN, it's a
basic language standard. str() is not only a convert to string
function, but also a serialization in most cases.(e.g. socket) My
simple suggestion is: If it's a unicode character, output as UTF-8;
other wise just ouput byte array, please do not encode it with really
stupid range(128) ASCII. It's not guessing, it's totally wrong.