Correct handling of case in unicode and regexps

Sat Feb 23 13:12:41 EST 2013

On 2013-02-23 17:51, Devin Jeanpierre wrote:
> On Sat, Feb 23, 2013 at 12:41 PM, MRAB <python at mrabarnett.plus.com>
> wrote:
>> Getting full case folding to work can be tricky. There's always
>> going to be a limit to what's worth doing.
>>
>> There are also areas where it's not clear what the result should
>> be. You've already mentioned matching 's' against 'ß' (fails) and
>> matching 'ss' against 'ß' (succeeds), but how about matching
>> '(s)(s)' against 'ß' (fails)?
>>
>> For the record, Perl also says that 'ss' matches 'ß', but 's+' does
>> not.
>
> I would find it helpful to know the exact rules. The regex module
> docs say that it works, but don't say what it means to "work".
>
The basic rule is that a series of characters in the regex must match a
series of characters in the text, with no partial matches in either.

For example, 'ss' can match 'ß', but 's' can't match 'ß' because that
would be matching part of 'ß'.

In a regex like 's+', you're asking it to match one or more repetitions
of 's', but that would mean that 's' would have to match part of 'ß' in
the first iteration and the remainder of 'ß' in the second iteration.

Although it's theoretically possible to do that, the code is already
difficult enough. The cost outweighs the potential benefit.

If you'd like to have a go at implementing it, the code _is_ open
source. :-)