[Python-Dev] Regular expressions: splitting on zero-width patterns

Tue Nov 28 15:04:49 EST 2017

The two largest problems in the re module are splitting on zero-width 
patterns and complete and correct support of the Unicode standard. These 
problems are solved in regex. regex has many other features, but they 
are less important.

I want to tell the problem of splitting on zero-width patterns. It 
already was discussed on Python-Dev 13 years ago [3] and maybe later. 
See also issues: [4], [5], [6], [7], [8].

In short it doesn't work. Splitting on the pattern r'\b' doesn't split 
the text at boundaries of words, and splitting on the pattern 
r'\s+|(?<=-)' will split the text on whitespaces, but will not split 
words with hypens as expected.

In Python 3.4 and earlier:

 >>> re.split(r'\b', 'Self-Defence Class')
['Self-Defence Class']
 >>> re.split(r'\s+|(?<=-)', 'Self-Defence Class')
['Self-Defence', 'Class']
 >>> re.split(r'\s*', 'Self-Defence Class')
['Self-Defence', 'Class']

Note that splitting on r'\s*' (0 or more whitespaces) actually split on 
r'\s+' (1 or more whitespaces). Splitting on patterns that only can 
match the empty string (like r'\b' or r'(?<=-)') never worked, while 
splitting

Starting since Python 3.5 splitting on a pattern that only can match the 
empty string raises a ValueError (this never worked), and splitting a 
pattern that can match the empty string but not only emits a 
FutureWarning. This taken developers a time for replacing their patterns 
r'\s*' to r'\s+' as they should be.

Now I have created a final patch [9] that makes re.split() splitting on 
zero-width patterns.

 >>> re.split(r'\b', 'Self-Defence Class')
['', 'Self', '-', 'Defence', ' ', 'Class', '']
 >>> re.split(r'\s+|(?<=-)', 'Self-Defence Class')
['Self-', 'Defence', 'Class']
 >>> re.split(r'\s*', 'Self-Defence Class')
['', 'S', 'e', 'l', 'f', '-', 'D', 'e', 'f', 'e', 'n', 'c', 'e', 'C', 
'l', 'a', 's', 's', '']

The latter case the result is differ too much from the previous result, 
and this likely not what the author wanted to get. But users had two 
Python releases for fixing their code. FutureWarning is not silent by 
default.

Because these patterns produced errors or warnings in the recent two 
releases, we don't need an additional parameter for compatibility.

But the problem was not just with re.split(). Other functions also 
worked not good with patterns that can match the empty string.

 >>> re.findall(r'^|\w+', 'Self-Defence Class')
['', 'elf', 'Defence', 'Class']
 >>> list(re.finditer(r'^|\w+', 'Self-Defence Class'))
[<re.Match object; span=(0, 0), match=''>, <re.Match object; span=(1, 
4), match='elf'>, <re.Match object; span=(5, 12), match='Defence'>, 
<re.Match object; span=(13, 18), match='Class'>]
 >>> re.sub(r'(^|\w+)', r'<\1>', 'Self-Defence Class')
'<>S<elf>-<Defence> <Class>'

After matching the empty string the following character will be skipped 
and will be not included in the next match. My patch fixes these 
functions too.

 >>> re.findall(r'^|\w+', 'Self-Defence Class')
['', 'Self', 'Defence', 'Class']
 >>> list(re.finditer(r'^|\w+', 'Self-Defence Class'))
[<re.Match object; span=(0, 0), match=''>, <re.Match object; span=(0, 
4), match='Self'>, <re.Match object; span=(5, 12), match='Defence'>, 
<re.Match object; span=(13, 18), match='Class'>]
 >>> re.sub(r'(^|\w+)', r'<\1>', 'Self-Defence Class')
'<><Self>-<Defence> <Class>'

I think this change don't need preliminary warnings, because it change 
the behavior of more rarely used patterns. No re tests have been broken. 
I was needed to add new tests for detecting the behavior change.

But there is one spoonful of tar in a barrel of honey. I didn't expect 
this, but this change have broken a pattern used with re.sub() in the 
doctest module: r'(?m)^\s*?$'. This was fixed by replacing it with 
r'(?m)^[^\S\n]+?$'). I hope that such cases are pretty rare and think 
this is an avoidable breakage.

The new behavior of re.split() matches the behavior of regex.split() 
with the VERSION1 flag, the new behavior of re.findall() and 
re.finditer() matches the behavior of corresponding functions in the 
regex module (independently from the version flag). But the new behavior 
of re.sub() doesn't match exactly the behavior of regex.sub() with any 
version flag. It differs from the old behavior as you can see in the 
example above, but is closer to it that to regex.sub() with VERSION1. 
This allowed to avoid braking existing tests for re.sub().

 >>> regex.sub(r'(\W+|(?<=-))', r':', 'Self-Defence Class') 

'Self:Defence:Class' 

 >>> regex.sub(r'(?V1)(\W+|(?<=-))', r':', 'Self-Defence Class') 

'Self::Defence:Class'
 >>> re.sub(r'(\W+|(?<=-))', r':', 'Self-Defence Class')
'Self:Defence:Class'

As re.split() it never matches the empty string adjacent to the previous 
match. re.findall() and re.finditer() only don't match the empty string 
adjacent to the previous empty string match. In the regex module 
regex.sub() is mutually consistent with regex.findall() and 
regex.finditer() (with the VERSION1 flag), but regex.split() is not 
consistent with them. In the re module re.split() and re.sub() will be 
mutually consistent, as well as re.findall() and re.finditer(). This is 
more backward compatible. And I don't know reasons for preferring the 
behavior of re.findall() and re.finditer() over the behavior of 
re.split() in this corner case.

Would be nice to get this change in 3.7.0a3 for wider testing. Please 
make a review of the patch [9] or tell your thoughts about this change.

[1] https://docs.python.org/3/library/re.html
[2] https://pypi.python.org/pypi/regex/
[3] https://mail.python.org/pipermail/python-dev/2004-August/047272.html
[4] https://bugs.python.org/issue852532
[5] https://bugs.python.org/issue988761
[6] https://bugs.python.org/issue1647489
[7] https://bugs.python.org/issue3262
[8] https://bugs.python.org/issue25054
[9] https://github.com/python/cpython/pull/4471