Strange re behavior: normal?

Sun Aug 17 14:40:31 EDT 2003

Michael Janssen wrote:

> Well, I belive it's good choice, to not split a string by an empty
> string, but when you really want (version with empty results on start
> and end omitted):
>
> def boundary_split(s):
>      back = []
>      while 1:
>          try:
>              # r'.\b' and +1 prevents endless loop
>              pos = re.search(r'.\b', s, re.DOTALL).start()+1
>          except AttributeError:
>              if s: back.append(s)
>              break
>          back.append(s[:pos])
>          s = s[pos:]
>      return back

note that \b is defined in terms of \w and \W, so you can replace the
above with:

def boundary_split(text):
    return re.findall("\w+|\W+", text)

> What's the good of splitting by boundaries? Someone else wanted this a
> few days ago on tutor and I can't figure out a reason by now.

the function extracts the words from a text, but includes the non-word
parts in the list as well (unlike, e.g. text.split() and re.findall("\w+")).

might be useful if you're writing some kind of text filter.

    for part in re.findall("\w+|\W+", text):
        ...

here's an alternative pattern, which might be easier to use:

    for word, sep in re.findall("(\w+)(\W*)", text):
        ...

</F>

PS. for proper support of non-ASCII text, prefix the pattern with (?u)
for ISO-8859-1 or Unicode strings, or (?L) to support localized text
(locale.setlocale).