unicode strings and strings mix

Martin v. Löwis loewis at informatik.hu-berlin.de
Tue Jun 18 07:33:32 EDT 2002


Roman Suzi <rnd at onego.ru> writes:

> What I wanted to ask is that it should be considered UNSAFE
> to mix strings and Unicode-strings without explicit {en,de}code
> methods, because the program must not depend on locale setting.

Indeed. Changing the system encoding is evil, for this precise reason.

> For example, a developer is using "ASCII" 128-255 for his national
> alphabet (Oleg pointed siteconfig.py) and writes a program which runs just
> fine for him. However, the same program will fail for almost everyone
> else.

Correct. The I18N people in Python are apparently divided as to how
serious the impact of such a change is. Oleg (and others) point that
out as a solution, and for some usages, this is indeed a
solution. Myself (and others) strongly advise against changing the
system encoding, since it means that you write a program that works
fine on one installation, but fails on another. Explicit is better
than implicit.

> I think, it is a source for very subtle errors in Python programs as it
> easily makes them not 8-bit clean!

The current default for the system encoding ("ascii") was chosen
precisely for the reason to prohibit such errors in the default Python
installation. ASCII only covers the bytes 0-127 (anything above that
is *not* part of the American Standard Code for Information
Interchange). 

ASCII happens to be a subset of most other encodings, so if you have a
string consisting only of characters below 128, you can most likely
convert it to Unicode correctly - that's why "ascii" is a reasonably
safe encoding as a default. There are exceptions, though - iso-2022-jp
could be mistaken for ASCII.

Regards,
Martin



More information about the Python-list mailing list