Is unicode.lower() locale-independent?

John Machin sjmachin at lexicon.net
Sat Jan 12 20:46:02 EST 2008


On Jan 13, 9:49 am, Carl Banks <pavlovevide... at gmail.com> wrote:
> On Sat, 12 Jan 2008 13:51:18 -0800, John Machin wrote:
> > On Jan 12, 11:26 pm, Torsten Bronger <bron... at physik.rwth-aachen.de>
> > wrote:
> >> Hallöchen!
>
> >> Fredrik Lundh writes:
> >> > Robert Kern wrote:
>
> >> >>> However it appears from your bug ticket that you have a much
> >> >>> narrower problem (case-shifting a small known list of English words
> >> >>> like VOID) and can work around it by writing your own
> >> >>> locale-independent casing functions. Do you still need to find out
> >> >>> whether Python unicode casings are locale-dependent?
>
> >> >> I would still like to know. There are other places where .lower() is
> >> >> used in numpy, not to mention the rest of my code.
>
> >> > "lower" uses the informative case mappings provided by the Unicode
> >> > character database; see
>
> >> >    http://www.unicode.org/Public/4.1.0/ucd/UCD.html
>
> >> > afaik, changing the locale has no influence whatsoever on Python's
> >> > Unicode subsystem.
>
> >> Slightly off-topic because it's not part of the Unicode subsystem, but
> >> I was once irritated that the none-breaking space (codepoint xa0 I
> >> think) was included into string.whitespace.  I cannot reproduce it on
> >> my current system anymore, but I was pretty sure it occured with a
> >> fr_FR.UTF-8 locale.  Is this possible?  And who is to blame, or must my
> >> program cope with such things?
>
> > The NO-BREAK SPACE is treated as whitespace in the Python unicode
> > subsystem. As for str objects, the default "C" locale doesn't know it
> > exists; otherwise AFAIK if the character set for the locale has it, it
> > will be treated as whitespace.
>
> > You were irritated because non-break SPACE was included in
> > string.whiteSPACE? Surely not! It seems eminently logical to me.
>
> To me it seems the point of a non-breaking space is to have something
> that's printed as whitespace but not treated as it.

To me it seems the point of a no-break space is that it's treated as a
space in all respects except that it doesn't "break".

>
> > Perhaps
> > you were irritated because str.split() ignored the "no-break"? If like
> > me you had been faced with removing trailing spaces from text columns in
> > databases, you surely would have been delighted that str.rstrip()
> > removed the trailing-padding-for-nicer-layout no-break spaces that the
> > users had copy/pasted from some clown's website :-)
>
> > What was the *real* cause of your irritation?
>
> If you want to use str.split() to split words, you will foil the user who
> wants to not break at a certain point.

Which was exactly my point -- but this would happen only rarely or not
at all in my universe (names, addresses, product descriptions, etc in
databases).

>
> Your use of rstrip() is a lot more specialized, if you ask me.

Not very specialised at all in my universe -- a standard
transformation that one normally applies to database text is to remove
all leading and trailing whitespace, and compress runs of 1 or more
whitespace characters to a single normal space. Your comment seems to
imply that trailing non-break spaces are significant and should be
preserved ...



More information about the Python-list mailing list