regex \b behaviour in python

MRAB google at mrabarnett.plus.com
Thu Jun 19 18:50:53 EDT 2008


On Jun 19, 8:46 pm, André Malo <auch-ic... at g-kein-spam.com> wrote:
> * Walter Cruz wrote:
> > irb(main):001:0>"walter ' cruz".split(/\b/)
> > => ["walter", " ' ", "cruz"]
>
> > and in php:
>
> > Array
> > (
> >     [0] =>
> >     [1] => walter
> >     [2] =>  '
> >     [3] => cruz
> >     [4] =>
> > )
>
> > But in python the behaviour of \b is differente from ruby or php.
>
> My python here does the same, actually:
>
> $ cat foo.py
> import re
>
> x = "walter ' cruz"
> s = 0
> r = []
> for m in re.finditer(r'\b', x):
>     p = m.start()
>     if s != p:
>         r.append(x[s:p])
>         s = p
>
> print r
>
> $ python2.4 foo.py
> ['walter', " ' ", 'cruz']
> $ python2.5 foo.py
> ['walter', " ' ", 'cruz']
> $
>
Another way is:

>>> re.split(r"(\W+)", "walter ' cruz")
['walter', " ' ", 'cruz']

\W+ matches the non-word characters and the capturing parentheses
causes them also to be returned.

I'm surprised that splitting on \b doesn't work as expected, so it
might be that re.split has been defined only to split on one or more
characters. Is it something that should it be 'fixed'?



More information about the Python-list mailing list