a question about Chinese characters in a Python Program
Ben Finney
bignose+hates-spam at benfinney.id.au
Mon Oct 20 16:27:47 EDT 2008
est <electronixtar at gmail.com> writes:
> IMHO it's even better to output wrong encodings rather than halt the
> WHOLE damn program by an exception
I can't agree with this. The correct thing to do in the face of
ambiguity is for Python to refuse to guess.
> When debugging encoding problems, the solution is simple. If
> characters display wrong, switch to another encoding, one of them
> must be right.
That's debugging problems not in the program but in the *data*, which
Python is helping with by making the problems apparent as soon as
feasible to do so.
> But it's tiring in python to deal with encodings, you have to wrap
> EVERY SINGLE character expression with try ... except ... just imagine
> what pain it is.
That sounds like a rather poor program design. Much better to sanitise
the inputs to the program at a few well-defined points, and know from
that point that the program is dealing internally with Unicode.
> Dealing with character encodings is really simple.
Given that your solutions are baroque and complicated, I don't think
even you yourself can believe that statement.
> Like I said, str() should NOT throw an exception BY DESIGN, it's a
> basic language standard.
Any code should throw an exception if the input is both ambiguous and
invalid by the documented specification.
> str() is not only a convert to string function, but also a
> serialization in most cases.(e.g. socket) My simple suggestion is:
> If it's a unicode character, output as UTF-8; other wise just ouput
> byte array, please do not encode it with really stupid range(128)
> ASCII. It's not guessing, it's totally wrong.
Your assumption would require that UTF-8 be a lowest *common*
denominator for most output devices Python will be connected to.
That's simply not the case; the lowest common denominator is still
ASCII.
I yearn for a future where all output devices can be assumed, in the
absence of other information, to understand a common Unicode encoding
(e.g. UTF-8), but we're not there yet and it would be a grave mistake
for Python to falsely behave as though we were.
--
\ “I went to a fancy French restaurant called ‘Déjà Vu’. The head |
`\ waiter said, ‘Don't I know you?’” —Steven Wright |
_o__) |
Ben Finney
More information about the Python-list
mailing list