unicode strings and strings mix

Roman Suzi rnd at onego.ru
Tue Jun 18 05:36:13 EDT 2002


On 18 Jun 2002, Martin v. [iso-8859-1] LЖwis wrote:

>Gerhard HДring <gerhard at bigfoot.de> writes:
>
>> 'x' and 'A' are in the ASCII range, so this shouldn't produce an
>> exception. I also cannot reproduce it with sys.getdefaultencoding() ==
>> "ascii".

sorry. message was in koi8-r but I just wanted to have 
chars with high bit set.

>These where not 'x' and 'A', but '\xd7\xc1\xd7\xc1\xd7'. Since the
>article was posted in KOI8-R, Roman probably meant those bytes to
>denote CYRILLIC SMALL LETTER VE and CYRILLIC SMALL LETTER A,
>respectively.
>
>Of course, when Python add strings, it can't possibly know that this
>is how the byte string was meant to be interpreted, so you need to
>write
>
>unichr(0x3345) + unicode('\xd7\xc1\xd7\xc1\xd7', 'koi8-r')
>
>The result string cannot be represented in KOI8-R, though, since it
>contains SQUARE MAHHA.

That was randomly choosen. 
What I wanted to ask is that it should be considered UNSAFE
to mix strings and Unicode-strings without explicit {en,de}code
methods, because the program must not depend on locale setting.

For example, a developer is using "ASCII" 128-255 for his national
alphabet (Oleg pointed siteconfig.py) and writes a program which runs just
fine for him. However, the same program will fail for almost everyone
else.

I think, it is a source for very subtle errors in Python programs as it
easily makes them not 8-bit clean! Any astray 

u"unicode-string" + "simple \234string"

can raise UnicodeException.

What do you think?

Sincerely yours, Roman Suzi
-- 
\_ Russia \_ Karelia \_ Petrozavodsk \_ rnd at onego.ru \_
\_ Tuesday, June 18, 2002 \_ Powered by Linux RedHat 7.2 \_
\_ "Laughing stock: cattle with a sense of humour." \_






More information about the Python-list mailing list