Suspected Unicode problem when reading text from Excell

Thu Sep 6 13:06:56 EDT 2001

Thu, 06 Sep 2001 16:04:05 GMT, Fredrik Lundh <fredrik at pythonware.com> pisze:

> the default encoding affects *all* operations.

Well, only str, unicode, and some str/unicode methods.

> making operations like L.sort() and str(S) dependent on the host
> platform is a really lousy idea.

They already depend on the system configuration if one changes the
default encoding in sitecustomize.

L.sort() would not depend on the platform if Unicode strings and byte
strings always compared unequal, which is necessary for sort() to
reliably work on sequences containing mixed str/unicode objects at all.

The only setting which makes mixed comparisons defined and transitive,
without changing the current meaning of comparisons for str<->str and
unicode<->unicode, and keeping 'xyz' == u'xyz', is conversion using
Latin1. But we agree that it would be a bad idea to use Latin1 always.

Using ASCII doesn't allow to compare some mixed strings at all. Other
encodings don't preserve comparisons across string flavors, and thus
comparisons are not transitive.

That's why I propose mixed str/unicode to compare unequal. It's not
ideal, but other alternatives simply don't work - now I can't use
L.sort() to reliably sort a mixed list. It needs fixing because it
doesn't work now.

I expect str(unicode_object) and unicode(str_object) to depend on
the system. Because different systems represent the same character
differently. There is no single byte string which represents the
given Unicode character on all systems, so we have a choice: give
correct results which depend on the system, give wrong results,
or fail.

> (I think it's more likely that setdefaultencoding will go away in
> a not too distant future.  like -U, it was added to make it easy
> for the python-dev team to experiment; not for daily use by
> application developers...)

Ok, so what would be the default encoding?

> what you really want is to set stdout/stderr/stdin to convert
> unicode strings based on the current locale, but leave all other
> operations alone.

No, not only stdout/strerr/stdin, but all files, sockets, filenames -
unless explicitly set otherwise. A file should ideally maintain its
conversion and transparently use it for Unicode strings sent to it,
and similarly for reading.

The default should depend on the locale - this is what the locale is
for! The very purpose of setting charset in the locale is that a single
switch tells all programs what the preferred encoding is, so the admin
doesn't need to go to each program, look into its documentation and
see how to tell it which encoding it should use by default.

ASCII is *not* enough for everyone.

-- 
 __("<  Marcin Kowalczyk * qrczak at knm.org.pl http://qrczak.ids.net.pl/
 \__/
  ^^                      SYGNATURA ZASTĘPCZA
QRCZAK