re.sub does not replace all occurences

Marc 'BlackJack' Rintsch bj_666 at gmx.net
Tue Aug 7 13:49:42 EDT 2007


On Tue, 07 Aug 2007 10:28:24 -0700, Christoph Krammer wrote:

> Hello everybody,
> 
> I wanted to use re.sub to strip all HTML tags out of a given string. I
> learned that there are better ways to do this without the re module,
> but I would like to know why my code is not working. I use the
> following:
> 
> def stripHtml(source):
>   source = re.sub("[\n\r\f]", " ", source)
>   source = re.sub("<.*?>", "", source, re.S | re.I | re.M)
>   source = re.sub("&(#[0-9]{1,3}|[a-z]{3,6});", "", source, re.I)
>   return source
> 
> But the result still has some tags in it. When I call the second line
> multiple times, all tags disappear, but since HTML tags cannot be
> overlapping, I do not understand this behavior. There is even a
> difference when I omit the re.I (IGNORECASE) option. Without this
> option, some tags containing only capital letters (like </FONT>) were
> kept in the string when doing one processing run but removed when
> doing multiple runs.

Can you give some example HTML where it fails?

Ciao,
	Marc 'BlackJack' Rintsch



More information about the Python-list mailing list