[Python-Dev] re.split on empty patterns

A.M. Kuchling amk at amk.ca
Sat Aug 7 16:51:42 CEST 2004


The re.split() method ignores zero-length pattern matches.  Patch
#988761 adds an emptyok flag to split that causes zero-length matches
to trigger a split.  For example:

>>> re.split(r'\b', 'this is a sentence')# does nothing; \b is always length 0
['this is a sentence']
>>> re.split(r'\b', 'this is a sentence', emptyok=True)
['', 'this', ' ', 'is', ' ', 'a', ' ', 'sentence', '']

Without the patch, the various zero-length assertions are 
pretty useless; with it, they can serve a purpose with split():

>>> re.split(r'(?m)$', 'line1\nline2\n', emptyok=True)
['line1', '\nline2', '\n', '']
>>> # Split file into sections
>>> re.split("(?m)(?=^[[])", """[section1]
foo=bar

[section2]
coyote=wiley
""", emptyok=True)
['', '[section1]\nfoo=bar\n\n', '[section2]\ncoyote=wiley\n']

Zero-length matches often result in a '' at the beginning or end, or
between characters, but I think users can handle that.  IMHO this
feature is clearly useful, and would be happy to commit the patch
as-is.

Question: do we want to make this option the new default?  Existing
patterns that can produce zero-length matches would change their
meanings:

>>> re.split('x*', 'abxxxcdefxxx')
['ab', 'cdef', '']
>>> re.split('x*', 'abxxxcdefxxx', emptyok=True)
['', 'a', 'b', '', 'c', 'd', 'e', 'f', '', '']

(I think the result of the second match points up a bug in the patch;
the empty strings in the middle seem wrong to me.  Assume that gets
fixed.)

Anyway, we therefore can't just make this the default in 2.4.  We
could trigger a warning when emptyok is not supplied and a split
pattern results in a zero-length match; users could supply
emptyok=False to avoid the warning.  Patterns that never have a
zero-length match would never get the warning.  2.5 could then set
emptyok to True.  

Note: raising the warning might cause a serious performance hit for
patterns that get zero-length matches a lot, which would make 2.4
slower in certain cases.

Thoughts?  Does this need a PEP?

--amk




More information about the Python-Dev mailing list