re.sub does not replace all occurences

Neil Cerutti horpner at yahoo.com
Tue Aug 7 14:18:43 EDT 2007


On 2007-08-07, Christoph Krammer <redtiger84 at googlemail.com> wrote:
> Hello everybody,
>
> I wanted to use re.sub to strip all HTML tags out of a given string. I
> learned that there are better ways to do this without the re module,
> but I would like to know why my code is not working. I use the
> following:
>
> def stripHtml(source):
>   source = re.sub("[\n\r\f]", " ", source)
>   source = re.sub("<.*?>", "", source, re.S | re.I | re.M)
>   source = re.sub("&(#[0-9]{1,3}|[a-z]{3,6});", "", source, re.I)
>   return source
>
> But the result still has some tags in it. When I call the
> second line multiple times, all tags disappear, but since HTML
> tags cannot be overlapping, I do not understand this behavior.
> There is even a difference when I omit the re.I (IGNORECASE)
> option. Without this option, some tags containing only capital
> letters (like </FONT>) were kept in the string when doing one
> processing run but removed when doing multiple runs.
>
> Perhaps anyone can tell me why this regex is behaving like
> this.

>>> import re
>>> help(re.sub)
Help on function sub in module re:

sub(pattern, repl, string, count=0)
    Return the string obtained by replacing the leftmost
    non-overlapping occurrences of the pattern in string by the
    replacement repl.  repl can be either a string or a callable;
    if a callable, it's passed the match object and must return
    a replacement string to be used.

And from the Python Library Reference for re.sub:

    The pattern may be a string or an RE object; if you need to
    specify regular expression flags, you must use a RE object,
    or use embedded modifiers in a pattern; for example,
    "sub("(?i)b+", "x", "bbbb BBBB")" returns 'x x'. 

    The optional argument count is the maximum number of pattern
    occurrences to be replaced; count must be a non-negative
    integer. If omitted or zero, all occurrences will be
    replaced. Empty matches for the pattern are replaced only
    when not adjacent to a previous match, so "sub('x*', '-',
    'abc')" returns '-a-b-c-'. 

In other words, the fourth argument to sub is count, not a set of
re flags.

-- 
Neil Cerutti



More information about the Python-list mailing list