Is 0 > None??

Alex Martelli aleax at aleax.it
Wed Sep 5 05:47:57 EDT 2001


"Lulu of the Lotus-Eaters" <mertz at gnosis.cx> wrote in message
news:mailman.999624965.18872.python-list at python.org...
    ...
> Still, the solutions to the conversion limits seem somewhat awkward.  I

Yes, I agree transparent Unicode conversion IS somewhat awkward -- I'm
not aware of any language having provided a better solution while
keeping both single-byte and Unicode strings, though (newer languages
such as Java, supporting Unicode-only, have it easier!-).

> can change my own 'site.py' or 'sitecustomize.py', but I cannot count on
> it being changed on a user's machine (and it wouldn't be very good
> manners to go mucking with it just to run my program).  I guess the best
> solution is to add something like:
>
>     def cleancoding(fleep, mycodec='latin-1'):
>         try: return unicode(fleep, mycodec)
>         except TypeError: return fleep
>     lst[:] = map(cleancoding, lst)
>
> To all my old programs.  But adding it everywhere necessary is a bit
> cumbersome, and probably slows programs down a bit.

Sure.  It seems to me that wrapping things in a try/except UnicodeError:
at some level, and only providing such "cleaning" of heterogenous
lists if and when needed, is going to be quite a bit faster.


> Moreover, it is not clear that forcing everything in lists into a
> 'latin-1' encoding is really semantically correct either.  There is no
> real harm in it; but if I am working with a binary format, 'chr(128)' is

So you're sorting heterogeneous strings that include both "strings
in a binary format" AND other strings that are to be interpreted as
strings of characters instead?  Because Unicode strings are most
definitely of the latter persuasion, NOT supporting arbitrary blobs
of bytes (if nothing else, because they can't be made of an ODD
number of bytes, but there's more to it than that:-).

So how do you DISTINGUISH between the two apparently-identical kinds
of strings, when sorting or otherwise bulk-operating on your highly
heterogeneous lists, quite apart from Unicode issues?  Getting a
better understanding of that may well help deciding on a strategy
which will make the Unicode problems go away too.  For example,
TODAY, how do you want "\x80e" to compare to "\x80e" when the
former means "Uppercase-C-cedilla e" and the latter means "two
bytes, 128 then 101"?  Presumably, despite the deep semantic rift,
you want them to compare equal, since there's no way Python is
ever going to be able to distinguish, when just plain strings is
all you give to Python to be sorted.  Well then, hadn't this better
be extended to comparing-equal to u"\u0080e" too, then?  In this
case, widening these plain strings to Unicode with 'latin-1' or
other similar encoding (e.g. 'iso-8859-1') may well be semantically
correct -- or as "correct" as this deep heterogeneity will allow?


Alex






More information about the Python-list mailing list