'20' <= 100

Steven Taschuk staschuk at telusplanet.net
Fri May 2 11:33:55 EDT 2003


Quoth John Roth:
> "Grant Edwards" <grante at visi.com> wrote in message
> news:slrnbb3hve.13k.grante at tuxtop.visi.com...
  [...]
> > Isn't there also something about comparing two strings of
> > different encodings also raising an exception?
> 
> Hadn't heard that one, and I don't understand why it would
> do so. Unless I've missed something completely, strings
> don't carry encoding information with them. They're either
> unicode or single byte, and the latter can always be converted
> to the former for comparison purposes.

Almost true; normal strings might not be convertible to unicode
strings, and such cases break comparisons.

Example:

    >>> iso8859_1 = '\xfc' # latin small letter u with diaeresis in latin-1
    >>> unicode = u'\xfc' # same letter in unicode
    >>> iso8859_1 == unicode
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
    UnicodeError: ASCII decoding error: ordinal not in range(128)

Here the conversion to Unicode fails because Python guesses that
iso8859_1 is encoded in ASCII.  But it contains a character which
is not valid ASCII.  <denzel>Boom.</denzel>

This can be "solved" if you know the encoding of the string:

    >>> iso8859_1.decode('iso-8859-1')
    u'\xfc'
    >>> iso8859_1.decode('iso-8859-1') == unicode
    1

But suppose your normal string actually *isn't* encoded
characters?  What if it's just some bytes?  Then this doesn't
work.  You could instead go the other way:

    >>> bytes = '\x00\xff\x00\xff'
    >>> bytes == unicode.encode('utf-8')
    0

... if you can pick some encoding for the Unicode string that does
what you want.

But in general, it's not at all clear what it means for an
arbitrary sequence of bytes (a normal string) and an arbitrary
sequence of characters (a Unicode string) to be equal, much less
what it means to compare them for order.  The techniques above
give different answers depending on what encodings you use.

Those who want comparisons never to raise exceptions might be
happy with a solution in which, say, all normal strings are < all
Unicode strings.  This makes all the comparisons determinate, at
least, but it would lead to counterintuitive behaviour in simple
cases: we'd have, for example, 'x' < u'x', while naïvely one would
expect 'x' == u'x' (which is true at present).

-- 
Steven Taschuk                             staschuk at telusplanet.net
"I may be wrong but I'm positive."  -- _Friday_, Robert A. Heinlein





More information about the Python-list mailing list