[Python-3000] string module trimming

Guido van Rossum guido at python.org
Thu Apr 19 00:08:29 CEST 2007


On 4/18/07, Jim Jewett <jimjjewett at gmail.com> wrote:
> On 4/18/07, Guido van Rossum <guido at python.org> wrote:
> > On 4/18/07, Jim Jewett <jimjjewett at gmail.com> wrote:
> > > On 4/17/07, Guido van Rossum <guido at python.org> wrote:
> > > > The locale module doesn't deal with Unicode, only with 8-bit characters (not
> > > > multi-byte characters). You'll lose this anyway. Certainly
> > > > string.letters is not going to provide this functionality.
>
> > > But for languages in Latin1, 8-bit characters are sufficient --
> > > anything with more than 8 bits is by definition not a (local) letter.
>
> > Latin-1 is just another encoding (and not a very useful one given that
> > it can't encode all of Unicode). I don't want to define a feature that
> > only works for Latin-1.
>
> Today, string.letters works most easily with ASCII supersets, and is
> effectively limited to 8-bit encodings.  Once everything is unicode, I
> don't think that 8-bit restriction should apply any more.

But we already went over this. There are over 40K letters in Unicode.
It simply makes no sense to have a string.letters approaching that
size.

> > > I won't swear that localizations currently replace string.letters with
> > > the appropriately ordered (slight) superset, but it is a valid use
> > > case, and string* (or text*) is clearly the right place.
>
> > The right solution for locale-dependent collation for sure isn't
> > having a string containing all the letters in the right order. There
> > are plenty of languages where that approach doesn't even work.
>
> Theoretically, English is one of those non-working languages.   (Names
> in bibliographic entries are supposed to be alphabetized according to
> language of origin.)
>
> In practice, ordered-list-of-chars works well enough, often enough.
> It often works better than sorting by code point, which is the only
> obvious alternative.
>
> Unless I missed it (and I may have), unicode itself sort of ducks the
> question about how to sort strings.  Python really needs to provide
> *an* answer, but I'm not sure it is possible to provide the (single)
> correct answer.

The Unicode standard certainly has a solution, but it is complicated
and I don't believe it is currently implemented in core Python.

> string.letters is one workaround, and I don't think we should remove
> it until a better solution (or workaround) is available.

I disagree. The correct solution is to implement the Unicode support
for locale-specific sorting.

Remember that the locale module supports only a single, global locale
at a time. This renders it totally useless in many apps requiring
locale support (such as web servers).

--
--Guido van Rossum (home page: http://www.python.org/~guido/)


More information about the Python-3000 mailing list