re.sub does not replace all occurences

Christoph Krammer redtiger84 at googlemail.com
Tue Aug 7 13:28:24 EDT 2007


Hello everybody,

I wanted to use re.sub to strip all HTML tags out of a given string. I
learned that there are better ways to do this without the re module,
but I would like to know why my code is not working. I use the
following:

def stripHtml(source):
  source = re.sub("[\n\r\f]", " ", source)
  source = re.sub("<.*?>", "", source, re.S | re.I | re.M)
  source = re.sub("&(#[0-9]{1,3}|[a-z]{3,6});", "", source, re.I)
  return source

But the result still has some tags in it. When I call the second line
multiple times, all tags disappear, but since HTML tags cannot be
overlapping, I do not understand this behavior. There is even a
difference when I omit the re.I (IGNORECASE) option. Without this
option, some tags containing only capital letters (like </FONT>) were
kept in the string when doing one processing run but removed when
doing multiple runs.

Perhaps anyone can tell me why this regex is behaving like this.

Thanks and regards,
 Christoph




More information about the Python-list mailing list