Some information about locale (was Re: [Python-Dev] repr vs. str and locales again)

Fredrik Lundh Fredrik Lundh" <effbot@telia.com
Mon, 22 May 2000 17:37:01 +0200


Guido van Rossum <guido@python.org> wrote:
> > Peter Funk wrote:
> > > AFAIK locale and friends conform to POSIX.1.  Calling this obsolescent...
> > > hmmm... may offend a *LOT* of people.  Try this on comp.os.linux.advocacy ;-)
> > 
> > you're missing the point -- now that we've added unicode support to
> > Python, the old 8-bit locale *ctype* stuff no longer works.  while some
> > platforms implement a wctype interface, it's not widely available, and it's
> > not always unicode.
> 
> Huh?  We were talking strictly 8-bit strings here.  The locale support
> hasn't changed there.

I meant that the locale support, even though it's part of POSIX, isn't
good enough for unicode support...

> > so in order to provide platform-independent unicode support, Python 1.6
> > comes with unicode-aware and fully portable replacements for the ctype
> > functions.
> 
> For those who only need Latin-1 or another 8-bit ASCII superset, the
> Unicode stuff is overkill.

why?

besides, overkill or not:

> > the code is already in there...

> > note that this leaves us with four string flavours in 1.6:
> > 
> > - 8-bit binary arrays.  may contain binary goop, or text in some strange
> >   encoding.  upper, strip, etc should not be used.
> 
> These are not strings.

depends on who you're asking, of course:

>>> b = fetch_binary_goop()
>>> type(b)
<type 'string'>
>>> dir(b)
['capitalize', 'center', 'count', 'endswith', 'expandtabs', ...

> > - 8-bit text strings using the system encoding.  upper, strip, etc works
> >   as long as the locale is properly configured.
> > 
> > - 8-bit unicode text strings.  upper, strip, etc may work, as long as the
> >   system encoding is a subset of unicode -- which means US ASCII or
> >   ISO Latin 1.
> 
> This is a figment of your imagination.  You can use 8-bit text strings
> to contain Latin-1, but you have to set your locale to match.

if that's a supported feature (instead of being deprecated in favour
for unicode), maybe we should base the default unicode/string con-
versions on the locale too?

background:

until now, I've been convinced that the goal should be to have two
"string-like" types: binary arrays for binary goop (including encoded
text), and a Unicode-based string type for text.  afaik, that's the
solution used in Tcl and Perl, and it's also "conceptually compatible"
with things like Java, Windows NT, and XML (and everything else from
the web universe).

given that, it has been clear to me that anything that is not compatible
with this model should be removed as soon as possible (and deprecated
as soon as we understand why it won't fly under the new scheme).

but if backwards compatibility is more important than a minimalistic
design, maybe we need three different "string-like" types:

-- binary arrays (still implemented by the 8-bit string type in 1.6)

-- 8-bit old-style strings (using the "system encoding", as defined
   by the locale.  if the locale is not set, they're assumed to contain
   ASCII)

-- unicode strings (possibly using a "polymorphic" internal representation)

this also solves the default conversion problem: use the locale environ-
ment variables to determine the default encoding, and call
sys.set_string_encoding from site.py (see my earlier post for details).

what have I missed this time?

</F>

PS. shouldn't sys.set_string_encoding be sys.setstringencoding?

>>> sys
... 'set_string_encoding', 'setcheckinterval', 'setprofile', 'settrace', ...

looks a little strange...