a question about Chinese characters in a Python Program

Steven D'Aprano steve at REMOVE-THIS-cybersource.com.au
Mon Oct 20 11:46:28 EDT 2008


On Mon, 20 Oct 2008 06:30:09 -0700, est wrote:

> Like I said, str() should NOT throw an exception BY DESIGN, it's a basic
> language standard.

int() is also a basic language standard, but it is perfectly acceptable 
for int() to raise an exception if you ask it to convert something into 
an integer that can't be converted:

int("cat")

What else would you expect int() to do but raise an exception?

If you ask str() to convert something into a string which can't be 
converted, then what else should it do other than raise an exception? 
Whatever answer you give, somebody else will argue it should do another 
thing. Maybe I want failed characters replaced with '?'. Maybe Fred wants 
failed characters deleted altogether. Susan wants UTF-16. George wants 
Latin-1.

The simple fact is that there is no 1:1 mapping from all 65,000+ Unicode 
characters to the 256 bytes used by byte strings, so there *must* be an 
encoding, otherwise you don't know which characters map to which bytes.

ASCII has the advantage of being the lowest common denominator. Perhaps 
it doesn't make too many people very happy, but it makes everyone equally 
unhappy.



> str() is not only a convert to string function, but
> also a serialization in most cases.(e.g. socket) My simple suggestion
> is: If it's a unicode character, output as UTF-8; 

Why UTF-8? That will never do. I want it output as UCS-4.


> other wise just ouput
> byte array, please do not encode it with really stupid range(128) ASCII.
> It's not guessing, it's totally wrong.

If you start with a byte string, you can always get a byte string:

>>> s = '\x96 \xa0 \xaa'  # not ASCII characters
>>> s
'\x96 \xa0 \xaa'
>>> str(s)
'\x96 \xa0 \xaa'



-- 
Steven




More information about the Python-list mailing list