[I18n-sig] Perhaps the locale should matter?

Fri, 05 May 2000 11:54:48 -0400

Here's a different idea that seeks a further compromise between the
7-bit and 8-bit camps.

I just realized that the existence of methods like islower() etc. on
Unicode really force the encoding issue for Unicode strings -- these
don't contain arbitrary sequences of 16-bit quantities, they contain
Unicode characters, with some of the associated semantices.  (How much
is open to debate, see Fredrik's post about the four levels of Unicode
conformance.)

If we apply this to 8-bit strings, we see that the locale plays an
important role.  With the default ("C") locale, islower() etc. only
take ASCII into account, everything else is not considered a letter or
digit or space.  However in many other locales (for the LC_CTYPE
category), islower() etc. assume a specific character encoding!  (This
is all completely up to the C library's locale interpretation, Python
doesn't add anything except an API.)  I've only tested this for a few
European locales; these all seem to assume Latin-1.

I wonder if we could make the default conversion from 8-bit to Unicode
depend on the locale?  This would be a compromise between my ASCII
proposal and the Latin-1 proposal.  My reasoning is that the locale is
an existing Python feature.  Code that is broken when the locale
differs from the default has been broken for a long time.  We might
not *like* a global setting for this kind of feature, but: "We've
already got one!"  [Imitates thick French accent.]

If the program explicitly set the locale, it is a clear signal that it
is interesting in manipulating characters in a particular locale, and
we might as well honor this.

Problem: I have no idea how to go from the locale setting (a
two-charater language abbreviation) to a specific character encoding
-- but that might conceivably a fixed table.

--Guido van Rossum (home page: http://www.python.org/~guido/)