[Python-Dev] decoding errors when comparing strings

M.-A. Lemburg mal@lemburg.com
Sat, 15 Jul 2000 19:15:05 +0200


Fredrik Lundh wrote:
> 
> with the locale aware encoding mechanisms switched off
> (sys.getdefaultencoding() == "ascii"), I stumbled upon some
> interesting behaviour:
> 
> first something that makes sense:
> 
>     >>> u"abc" == "abc"
>     1
> 
>     >>> u"едц" == "abc"
>     0
> 
> but how about this one:
> 
>     >>> u"abc" == "едц"
>     Traceback (most recent call last):
>       File "<stdin>", line 1, in ?
>     UnicodeError: ASCII decoding error: ordinal not in range(128)
> 
> or this one:
> 
>     >>> u"едц" == "едц"
>     Traceback (most recent call last):
>       File "<stdin>", line 1, in ?
>     UnicodeError: ASCII decoding error: ordinal not in range(128)
> 
> ignoring implementation details for a moment, is this really the
> best we can do?

This is merely due to the fact that on your Latin-1 platform,
"д" and u"д" map to the same ordinals. The unicode-escape
codec (which is used by the Python parser) takes single
characters in the whole 8-bit range as Unicode ordinals, so
u"д" really maps to unichr(ord("д")).

The alternative would be forcing usage of escapes for non-ASCII
Unicode character literals and issuing an error for all non-ASCII
ones.

BTW, I have a feeling that we should mask the decoding errors
during compares in favour of returning 0... 

...otherwise the current dictionary would bomb (it doesn't do any
compare error checking !) in case a Unicode string happens to have
the same hash value as an 8-bit string key. (Can't test this right now,
but this is what should happen according to the C sources.)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/