Correct handling of case in unicode and regexps

Sat Feb 23 15:23:29 EST 2013

On 2013-02-23 18:57, Devin Jeanpierre wrote:
> On Sat, Feb 23, 2013 at 1:12 PM, MRAB <python at mrabarnett.plus.com>
> wrote:
>> The basic rule is that a series of characters in the regex must
>> match a series of characters in the text, with no partial matches
>> in either.
>>
>> For example, 'ss' can match 'ß', but 's' can't match 'ß' because
>> that would be matching part of 'ß'.
>>
>> In a regex like 's+', you're asking it to match one or more
>> repetitions of 's', but that would mean that 's' would have to
>> match part of 'ß' in the first iteration and the remainder of 'ß'
>> in the second iteration.
>
> That makes sense. I'll have to think about this and run some tests
> through regex, as well.
>
>> Although it's theoretically possible to do that, the code is
>> already difficult enough. The cost outweighs the potential
>> benefit.
>>
>> If you'd like to have a go at implementing it, the code _is_ open
>> source. :-)
>
> Actually, the reason it's relevant to me is that I'm reimplementing
> the re module using a more automata theoretic approach (it's my
> second attack at the problem). Also, I've read the _sre source code
> and it's unpleasant. Is regex much better?
>
I like to think so! ;-)

Part of the problem may be that it also supports fuzzy (approximate)
matching.

> At least the way I'm planning on going about it, supporting this is
> easier, as long as one can figure out what it means to match halfway
> inside a ß. Since case folding is a homomorphism*, I can case fold
> the regex** and case fold the input and then I'm done. Case folding
> of the input can be done character by character, and to emulate the
> regex module behavior I'd need to check at certain places whether or
> not I'm in the middle of a casefolding expansion, and fail if so. On
> the other hand, if I don't emulate the regex module's behavior in at
> least some cases, I'd need to figure out what the value of a match
> of 's' against 'ß' would be.
>
It seems like a reasonable restriction to me that start and end
positions should be integers, i.e. forbid partial matches within a
character.

This means that matching '(ss)' against 'ß' would be OK, as would '(s+)'
against 'ß', but '(s)' or '(s)(s)' against 'ß' would not, otherwise you
could get:

>>> match('(s)', 'ß').span(0)
(0, 0.5)
>>> match('(s)(s)', 'ß').span(0, 1, 2)
((0, 1), (0, 0.5), (0.5, 1)).

> [*]  i.e. it can be done character by character (see Unicode 3.13
> Default Case Algorithms) [**] Not as trivial as it sounds, but still
>  easy. [ßa-z] goes to e.g. [a-z]|ss (not [ssa-z]).
>