Strange re behavior: normal?

Fri Aug 15 09:20:00 EDT 2003

Robin Munn wrote:
> How is re.split supposed to work? This wasn't at all what I expected:
> 
> [rmunn at localhost ~]$ python
> Python 2.2.2 (#1, Jan 12 2003, 12:07:20)
> [GCC 3.2] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> 
>>>>import re
>>>>re.split(r'\W+', 'a b c d')
> 
> ['a', 'b', 'c', 'd']
> 
>>>># Expected result.
> 
> ...
> 
>>>>re.split(r'\b', 'a b c d')
> 
> ['a b c d']
> 
>>>># Huh?
> 
> 
> Since \b matches the empty string, but only at the beginning and end of
> a word, I would have expected re.split(r'\b', 'a b c d' to produce
> either:
> 
>     ['', 'a', ' ', 'b', ' ', 'c', ' ', 'd', '']
> 
> or:
> 
>     ['a', ' ', 'b', ' ', 'c', ' ', 'd']
> 
> But I didn't expect that re.split(r'\b', 'a b c d') would yield no splits
> whatsoever. The module doc says "split(pattern, string[, maxsplit = 0]):
> split string by the occurrences of pattern". re.findall() seems to think
> that \b occurs eight times in 'a b c d':
> 
> 
>>>>re.findall(r'\b', 'a b c d')
> 
> ['', '', '', '', '', '', '', '']
> 
> So why doesn't re.split() think so? I'm puzzled.
> 

It looks like re.split is not splitting on zero-length matches.  I get 
similar behavior (in Python 2.2.2) if I try:

   re.split('x*', 'a b c d')

or

   re.split('(?=c)', 'a b c d')

I don't have the Python source handy to verify this hypothesis, though.

If this is correct, it should at least be documented.

David