Strange re behavior: normal?
David C. Fox
davidcfox at post.harvard.edu
Fri Aug 15 09:20:00 EDT 2003
Robin Munn wrote:
> How is re.split supposed to work? This wasn't at all what I expected:
>
> [rmunn at localhost ~]$ python
> Python 2.2.2 (#1, Jan 12 2003, 12:07:20)
> [GCC 3.2] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>
>>>>import re
>>>>re.split(r'\W+', 'a b c d')
>
> ['a', 'b', 'c', 'd']
>
>>>># Expected result.
>
> ...
>
>>>>re.split(r'\b', 'a b c d')
>
> ['a b c d']
>
>>>># Huh?
>
>
> Since \b matches the empty string, but only at the beginning and end of
> a word, I would have expected re.split(r'\b', 'a b c d' to produce
> either:
>
> ['', 'a', ' ', 'b', ' ', 'c', ' ', 'd', '']
>
> or:
>
> ['a', ' ', 'b', ' ', 'c', ' ', 'd']
>
> But I didn't expect that re.split(r'\b', 'a b c d') would yield no splits
> whatsoever. The module doc says "split(pattern, string[, maxsplit = 0]):
> split string by the occurrences of pattern". re.findall() seems to think
> that \b occurs eight times in 'a b c d':
>
>
>>>>re.findall(r'\b', 'a b c d')
>
> ['', '', '', '', '', '', '', '']
>
> So why doesn't re.split() think so? I'm puzzled.
>
It looks like re.split is not splitting on zero-length matches. I get
similar behavior (in Python 2.2.2) if I try:
re.split('x*', 'a b c d')
or
re.split('(?=c)', 'a b c d')
I don't have the Python source handy to verify this hypothesis, though.
If this is correct, it should at least be documented.
David
More information about the Python-list
mailing list