re.split() not keeping matched text

Sun Jul 25 16:55:21 EDT 2004

On Sun, 25 Jul 2004, Robert Oschler wrote:

> Given the following program:
> 
> --------------
> 
> import re
> 
> x = "The dog ran. The cat eats! The bird flies? Done."
> l = re.split("[.?!]", x)
> 
> for s in l:
>   print s.strip()
> # for
> ---------------

> I want to keep the punctuation marks.
> 
> Where am I going wrong here?

What you need is some magic with the (?<=...), or 'look-behind assertion' 
operator:

re.split(r'(?<=[.?!])\s*')

What this regex is saying is "match a string of spaces that follows one of 
[.?!]".  This way, it will not consume the punctuation, but will consume 
the spaces (thus killing two birds with one stone by obviating the need 
for the subsequent s.strip()).

Unfortunately, there is a slight bug, where if the punctuation is not
followed by whitespace, re.split won't split, because the regex returns a
zero-length string.  There is a patch to fix this (SF #988761, see the end
of the message for a link), but until then, you can prevent the error by
using:

re.split(r'(?<=[.?!])\s+')

This won't match end-of-character marks not followed by whitespace, but 
that may be preferable behaviour anyways (e.g. if you're parsing Python 
documentation).

Hope this helps.

Patch #988761: 
http://sourceforge.net/tracker/index.php?func=detail&aid=988761&group_id=5470&atid=305470