[Python-Dev] unicode hell/mixing str and unicode as dictionary keys

Mon Aug 7 17:41:39 CEST 2006

David Hopwood schrieb:
> I disagree. Unicode strings should always be considered distinct from
> non-ASCII byte strings. Implicitly encoding or decoding in order to
> perform a comparison is a bad idea; it is expensive and will often do
> the wrong thing.

That's a pretty irrelevant position at this point; Python has had
the notion of a system encoding since Unicode was introduced,
and we are not going to remove that just before a release candidate
of Python 2.5.

The question at hand is not whether certain object should compare
unequal, but whether comparing them should raise an exception.

>>> Which of the two conversions is selected is arbitrary; [...]
>
> It would not be arbitrary. In the common case where the byte encoding
> uses "precomposed" characters, using "U.encode(system_encoding) == B"
> will tend to succeed in more cases than "B.decode(system_encoding) == U",
> because alternative representations of the same abstract character in
> Unicode will be mapped to the same precomposed character.

No, they won't (although they should, perhaps):

py> u'o\u0308'.encode("latin-1")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u0308' in
position 1: ordinal not in range(256)

In addition, it's also possible to find encodings (e.g. iso-2022) where
different byte sequences decode to the same Unicode string.

Regards,
Martin