[Python-Dev] Re: Unicode and comparisons

M.-A. Lemburg mal@lemburg.com
Tue, 04 Apr 2000 18:47:51 +0200


"Martin v. Loewis" wrote:
> 
> > Question: is this behaviour acceptable or should I go even further
> > and mask decoding errors during compares and contains tests too ?
> 
> I always thought it is a core property of cmp that it works between
> all objects.

It does, but not necessarily without exceptions. I could easily
mask the decoding errors too and then have cmp() work exactly
as for strings, but the outcome may be different to what the
user had expected due to the failing conversion. Sorting order
may then look quite unsorted...

> Because of that,
> 
> >>> x=[u'1','aäöü']
> >>> x.sort()
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeError: UTF-8 decoding error: invalid data
> 
> fails. As always in cmp, I'd expect to get a consistent outcome here
> (ie. cmp should give a total order on objects).
> 
> OTOH, I'm not so sure why cmp between plain and unicode strings needs
> to perform UTF-8 conversion? IOW, why is it desirable that
> 
> >>> 'a' == u'a'
> 1

This is needed to enhance inter-operability between Unicode
and normal strings. Note that they also have the same hash
value (provided both use the ASCII code range), making them
interchangeable in dictionaries:

>>> d={u'a':1}
>>> d['a'] = 2
>>> d[u'a']
2
>>> d['a']
2

This is per design.
 
> Anyway, I'm not objecting to that outcome - I only think that, to get
> cmp consistent, it may be necessary to drop this result. If it is not
> necessary, the better.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/