a question about Chinese characters in a Python Program
Steven D'Aprano
steve at REMOVE-THIS-cybersource.com.au
Mon Oct 20 11:46:28 EDT 2008
On Mon, 20 Oct 2008 06:30:09 -0700, est wrote:
> Like I said, str() should NOT throw an exception BY DESIGN, it's a basic
> language standard.
int() is also a basic language standard, but it is perfectly acceptable
for int() to raise an exception if you ask it to convert something into
an integer that can't be converted:
int("cat")
What else would you expect int() to do but raise an exception?
If you ask str() to convert something into a string which can't be
converted, then what else should it do other than raise an exception?
Whatever answer you give, somebody else will argue it should do another
thing. Maybe I want failed characters replaced with '?'. Maybe Fred wants
failed characters deleted altogether. Susan wants UTF-16. George wants
Latin-1.
The simple fact is that there is no 1:1 mapping from all 65,000+ Unicode
characters to the 256 bytes used by byte strings, so there *must* be an
encoding, otherwise you don't know which characters map to which bytes.
ASCII has the advantage of being the lowest common denominator. Perhaps
it doesn't make too many people very happy, but it makes everyone equally
unhappy.
> str() is not only a convert to string function, but
> also a serialization in most cases.(e.g. socket) My simple suggestion
> is: If it's a unicode character, output as UTF-8;
Why UTF-8? That will never do. I want it output as UCS-4.
> other wise just ouput
> byte array, please do not encode it with really stupid range(128) ASCII.
> It's not guessing, it's totally wrong.
If you start with a byte string, you can always get a byte string:
>>> s = '\x96 \xa0 \xaa' # not ASCII characters
>>> s
'\x96 \xa0 \xaa'
>>> str(s)
'\x96 \xa0 \xaa'
--
Steven
More information about the Python-list
mailing list