regex \b behaviour in python

Thu Jun 19 18:50:53 EDT 2008

On Jun 19, 8:46 pm, André Malo <auch-ic... at g-kein-spam.com> wrote:
> * Walter Cruz wrote:
> > irb(main):001:0>"walter ' cruz".split(/\b/)
> > => ["walter", " ' ", "cruz"]
>
> > and in php:
>
> > Array
> > (
> >     [0] =>
> >     [1] => walter
> >     [2] =>  '
> >     [3] => cruz
> >     [4] =>
> > )
>
> > But in python the behaviour of \b is differente from ruby or php.
>
> My python here does the same, actually:
>
> $ cat foo.py
> import re
>
> x = "walter ' cruz"
> s = 0
> r = []
> for m in re.finditer(r'\b', x):
>     p = m.start()
>     if s != p:
>         r.append(x[s:p])
>         s = p
>
> print r
>
> $ python2.4 foo.py
> ['walter', " ' ", 'cruz']
> $ python2.5 foo.py
> ['walter', " ' ", 'cruz']
> $
>
Another way is:

>>> re.split(r"(\W+)", "walter ' cruz")
['walter', " ' ", 'cruz']

\W+ matches the non-word characters and the capturing parentheses
causes them also to be returned.

I'm surprised that splitting on \b doesn't work as expected, so it
might be that re.split has been defined only to split on one or more
characters. Is it something that should it be 'fixed'?