[Python-Dev] Re: re.split on empty patterns

Mike Coleman mkc at mathdogs.com
Sun Aug 8 00:44:06 CEST 2004


Tim Peters <tim.peters at gmail.com> writes:
> Yes it is!  Or, more accurately, it can be, when it's intended to
> match an empty string.  It's a bit fuzzy because regexps are so
> error-prone, and writing a regexp that matches an empty string by
> accident is easy.

I won't dispute that regexps are error-prone (having just proved it recently),
but I would point out that any calls to the current re.split with a pattern
that can match empty are probably kind of fuzzy anyway.  Consider, for
example, that re.search doesn't have an option that says "find matches but
skip any empty matches".

Along the same lines, I think re.findall doesn't really do what its doc
claims.  The doc says

    Empty matches are included in the result unless they touch the beginning
    of another match.

but, for example

    >>> re.findall(r'(?=.)|b', 'abcabc')
    ['', '', '', '', '', '']
    [24592 refs]
    >>> re.findall(r'b|(?=.)', 'abcabc')
    ['', 'b', '', '', 'b', '']
    [24601 refs]

with a recent python.  It's not clear what "another match" would be here.  The
only way its beginning could be touched by an empty match would be if they
both started in the same position.  So presumably that other match could only
be discovered by asking the RE engine to "find me the first match in this
position that's not empty".  It looks like the *actual* method being used,
though, is more like the method used in my patch:

    for each position that's not interior to some previous match:
        find the first match at this position (if any)
             and add it to the list

We could find the "longest" match instead of the "first" match, but that would
be a lot more expensive in general, and not necessarily easier to understand.


> In the meantime, "emptyok" is an odd
> name since it's always "ok" to have an empty match.  "split0=True"
> reads better to me, since the effect is to split on a 0-length match. 
> split_on_empty_match would be too wordy.

I was thinking "ok" in the sense of "okay to split on them" (empty matches).
I don't have any strong preference on the name, though--I'd just like to get
the feature in relatively soon (with semantics that everyone's happy with).

('split0' does look slightly '3133t', though.  :-)

Cheers,
Mike



More information about the Python-Dev mailing list