[Python-Dev] Re: re.split on empty patterns

Sun Aug 8 00:36:17 CEST 2004

"A.M. Kuchling" <amk at amk.ca> writes:
> >>> re.split('x*', 'abxxxcdefxxx')
> ['ab', 'cdef', '']
> >>> re.split('x*', 'abxxxcdefxxx', emptyok=True)
> ['', 'a', 'b', '', 'c', 'd', 'e', 'f', '', '']
> 
> (I think the result of the second match points up a bug in the patch;
> the empty strings in the middle seem wrong to me.  Assume that gets
> fixed.)

I believe it's correct in the sense that it matches the design I had in mind.
Of course, this could be the wrong design choice.

I think (some of) the alternatives can be stated this way:

1.  Empty matches are not considered at all for splitting.  (present behavior)

2.  Empty matches are only considered when they are not adjacent to non-empty
    matches.  (what you have in mind, I think)

3.  Empty matches are always considered.  (the patch behavior)

(It's understood that matches that fall within other matches are not
considered.  For empty matches, an adjacent match doesn't count as "within".)

I would say that alternatives 2 and 3 are better than 1 because they retain
more information, and I would argue that alternative 3 is better than
alternative 2 for the same reason.  I don't have a killer example, but here is
a somewhat abstract one that shows the difference between 2 and 3:

    # alternative 2:
    re.structmatch(r'xxx|(?=abc)', 'zzxxxabczz') --> ['zz', 'bbczz']
    re.structmatch(r'xxx|(?=abc)', 'zzxxxbbczz') --> ['zz', 'bbczz']

    # alternative 3:
    re.structmatch(r'xxx|(?=abc)', 'zzxxxabczz') --> ['zz', '', 'bbczz']
    re.structmatch(r'xxx|(?=abc)', 'zzxxxbbczz') --> ['zz', 'bbczz']

In English, the third alternative allows you to notice that there were two
adjacent matches, even if the second match was empty.  With the second
alternative, this is missed.

Of course, because of the match algorithm, we cannot notice an empty match
immediately followed by a non-empty match, so there's kind of an asymmetry
here.  With alternative 2 that asymmetry wouldn't be present, since we'd fail
to notice empty matches on either side.

Alternative 2 does have the advantage of matching the expectations of a naive
user a little better.  Alternative 3 is more powerful, but perhaps a little
less obvious.

(If you're reading this, what do you think?)

> Anyway, we therefore can't just make this the default in 2.4.  We
> could trigger a warning when emptyok is not supplied and a split
> pattern results in a zero-length match; users could supply
> emptyok=False to avoid the warning.  Patterns that never have a
> zero-length match would never get the warning.  2.5 could then set
> emptyok to True.  

I like this compromise.  A variant would be to warn if the pattern *could*
match empty, rather than warning when it *does* match empty.  I'm not sure
whether it would be easy to determine this, though.

> Note: raising the warning might cause a serious performance hit for
> patterns that get zero-length matches a lot, which would make 2.4
> slower in certain cases.

The above variant would ameliorate this, though in theory no one should be
using patterns that can empty match anyway.

Cheers,
Mike