[Python-3000] string module trimming

Jason Orendorff jason.orendorff at gmail.com
Thu Apr 19 20:51:24 CEST 2007


On 4/18/07, Jim Jewett <jimjjewett at gmail.com> wrote:
> On 4/18/07, Guido van Rossum <guido at python.org> wrote:
> > On 4/18/07, Jim Jewett <jimjjewett at gmail.com> wrote:
>
> > > Today, string.letters works most easily with ASCII supersets, and is
> > > effectively limited to 8-bit encodings.  Once everything is unicode, I
> > > don't think that 8-bit restriction should apply any more.
>
> > But we already went over this. There are over 40K letters in Unicode.
> > It simply makes no sense to have a string.letters approaching that
> > size.
>
> Agreed.  But there aren't 40K (alphabetic) letters in any particular
> locale.  Most individual languages will have less than 100.

But isn't the hip thing these days to use UTF-8 as the character set,
as in "LC_ALL=en_US.UTF-8"?  I picked a random Linux machine
and this happened to be the default LANG.  A quick C test revealed
that iswalpha() thinks there are 45,974 letters in that locale.

Anyway, let's discuss your use cases.


One: UIs that involve displaying all the letters.

> There are also reasons to want only "local" letters.  For example, in
> a French interface, I might want to include the extra French letters,
> but not the Greek.

But POSIX locales don't claim to provide this information,
as far as I can tell.

It seems like for a quick throwaway program, you're better off
ignoring languages.  And for even a slightly serious program,
string.letters is wrong in a dozen ways.

Seriously, a table of alphabets that's saner than string.letters
is pretty trivial to write:

  alphabets = {
      'en': list("ABCDEFGHIJKLMNOPQRSTUVWXYZ"),
      'es': ("A B C Ch D E F G H I J K L Ll M "
          + "N \u00d1 O P Q R S T U V W X Y Z").split(),
      ...
      }

...and you can go from there.


Two: Collation.

Collation can be done right: provide a function text.sort_key()
that converts a str into an opaque thing that has the desired
ordering.  You would use it like so:

  names.sort(key=text.sort_key)
  records.sort(key=lambda rec: text.sort_key(rec.title))

Jython can implement this using java.text.Collator.  IronPython can
use CultureInfo.CompareInfo.GetSortKey().  CPython can call
LCMapString() on Windows and wcsxfrm() on POSIX, falling back on just
returning the string itself.

-j


More information about the Python-3000 mailing list