[issue3511] Incorrect charset range handling with ignore case flag?

Jeffrey C. Jacobs report at bugs.python.org
Wed Sep 24 15:15:04 CEST 2008


Jeffrey C. Jacobs <timehorse at users.sourceforge.net> added the comment:

I think this is even more complicated when you consider that
localization my be an issue.  Consider "Á": is this grammatically before
 "A" or after "a"?  From a character set point of view, it is typically
after "a" but when Locale is taken into account, all that is done is
there is a change to relative ordering, so Á appears somewhere before A
and B.  But when this is done, does that mean that [9-Á] is going to
cover ALL uppercase and ALL lowercase and ALL characters with ord from
91 to 96 and 123 to 127 and all kinds of other UNICODE symbols?  And how
will this effect case-insensitivity.

In a sense, I think it may only be safe to say that character class
ranges are ONLY appropriate over Alphabetic character ranges or numeric
character ranges, since the order of the ASCII symbols between 0 and 47,
56 and 64, 91 adn 96 and 123 and 127, though well-defined, are none the
less implementation dependent.  When we bring UNICODE into this, things
get even more befuddled with some Latin characters in Latin-1, some in
Latin-2, Cyrillic, Hebrew, Arabic, Chinese, Japanese and Korean
character sets just to name a few of the most common!  And how does a
total ordering of characters apply to them?

In the end, I think it's just dangerous to define character group ranges
that span the gap BETWEEN numbers and alphabetics.  Instead, I think a
better solution is simply to implement Emacs / Perl style named
character classes as in issue 2636 sub-item 8.

I do agree this is a problem, but as I see it, the solution may not be
that simple, especially in a UNICODE world.

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue3511>
_______________________________________


More information about the Python-bugs-list mailing list