Strange re behavior: normal?

Thu Aug 14 16:33:05 EDT 2003

How is re.split supposed to work? This wasn't at all what I expected:

[rmunn at localhost ~]$ python
Python 2.2.2 (#1, Jan 12 2003, 12:07:20)
[GCC 3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> re.split(r'\W+', 'a b c d')
['a', 'b', 'c', 'd']
>>> # Expected result.
...
>>> re.split(r'\b', 'a b c d')
['a b c d']
>>> # Huh?

Since \b matches the empty string, but only at the beginning and end of
a word, I would have expected re.split(r'\b', 'a b c d' to produce
either:

    ['', 'a', ' ', 'b', ' ', 'c', ' ', 'd', '']

or:

    ['a', ' ', 'b', ' ', 'c', ' ', 'd']

But I didn't expect that re.split(r'\b', 'a b c d') would yield no splits
whatsoever. The module doc says "split(pattern, string[, maxsplit = 0]):
split string by the occurrences of pattern". re.findall() seems to think
that \b occurs eight times in 'a b c d':

>>> re.findall(r'\b', 'a b c d')
['', '', '', '', '', '', '', '']

So why doesn't re.split() think so? I'm puzzled.

-- 
Robin Munn <rmunn at pobox.com> | http://www.rmunn.com/ | PGP key 0x6AFB6838
-----------------------------+-----------------------+----------------------
"Remember, when it comes to commercial TV, the program is not the product.
YOU are the product, and the advertiser is the customer." - Mark W. Schumann