unicode strings and strings mix
Martin v. Löwis
loewis at informatik.hu-berlin.de
Tue Jun 18 07:33:32 EDT 2002
Roman Suzi <rnd at onego.ru> writes:
> What I wanted to ask is that it should be considered UNSAFE
> to mix strings and Unicode-strings without explicit {en,de}code
> methods, because the program must not depend on locale setting.
Indeed. Changing the system encoding is evil, for this precise reason.
> For example, a developer is using "ASCII" 128-255 for his national
> alphabet (Oleg pointed siteconfig.py) and writes a program which runs just
> fine for him. However, the same program will fail for almost everyone
> else.
Correct. The I18N people in Python are apparently divided as to how
serious the impact of such a change is. Oleg (and others) point that
out as a solution, and for some usages, this is indeed a
solution. Myself (and others) strongly advise against changing the
system encoding, since it means that you write a program that works
fine on one installation, but fails on another. Explicit is better
than implicit.
> I think, it is a source for very subtle errors in Python programs as it
> easily makes them not 8-bit clean!
The current default for the system encoding ("ascii") was chosen
precisely for the reason to prohibit such errors in the default Python
installation. ASCII only covers the bytes 0-127 (anything above that
is *not* part of the American Standard Code for Information
Interchange).
ASCII happens to be a subset of most other encodings, so if you have a
string consisting only of characters below 128, you can most likely
convert it to Unicode correctly - that's why "ascii" is a reasonably
safe encoding as a default. There are exceptions, though - iso-2022-jp
could be mistaken for ASCII.
Regards,
Martin
More information about the Python-list
mailing list