Stripping HTML with RE

Tue Nov 9 18:20:06 EST 2004

Steveo <stephen_p_barrett <at> hotmail.com> writes:
> 
> I wanted to allow all H1 and H2 tags so i changed it to:
> 
> re.compile("<[^H1|^H2]*?>")
> 
> This seemed to work but it also allowed the HTML tag(basically anythin
> with an H or a 1 or a 2)  How can I get this to strip all tags except
> H1 and H2.  Any Help you could give would be great.

You probably want a lookahead assertion.  From the docs at
http://docs.python.org/lib/re-syntax.html:

(?!...)
    Matches if ... doesn't match next. This is a negative lookahead assertion.
For example, Isaac (?!Asimov) will match 'Isaac ' only if it's not followed by
'Asimov'.

So I would write your example something like:

>>> re.sub(r'</?(?!H1|H2|/H1|/H2)[^>]*>', r'', '<a>sdfsa</a>')
'sdfsa'
>>> re.sub(r'</?(?!H1|H2|/H1|/H2)[^>]*>', r'', '<H1>sdfsa</a>')
'<H1>sdfsa'
>>> re.sub(r'</?(?!H1|H2|/H1|/H2)[^>]*>', r'', '<H1>sdfsa</H2>')
'<H1>sdfsa</H2>'

(I was too lazy to compile the re, but of course that's what you'd normally want
to do.)

Steve