Stripping HTML with RE
Steven Bethard
steven.bethard at gmail.com
Tue Nov 9 18:20:06 EST 2004
Steveo <stephen_p_barrett <at> hotmail.com> writes:
>
> I wanted to allow all H1 and H2 tags so i changed it to:
>
> re.compile("<[^H1|^H2]*?>")
>
> This seemed to work but it also allowed the HTML tag(basically anythin
> with an H or a 1 or a 2) How can I get this to strip all tags except
> H1 and H2. Any Help you could give would be great.
You probably want a lookahead assertion. From the docs at
http://docs.python.org/lib/re-syntax.html:
(?!...)
Matches if ... doesn't match next. This is a negative lookahead assertion.
For example, Isaac (?!Asimov) will match 'Isaac ' only if it's not followed by
'Asimov'.
So I would write your example something like:
>>> re.sub(r'</?(?!H1|H2|/H1|/H2)[^>]*>', r'', '<a>sdfsa</a>')
'sdfsa'
>>> re.sub(r'</?(?!H1|H2|/H1|/H2)[^>]*>', r'', '<H1>sdfsa</a>')
'<H1>sdfsa'
>>> re.sub(r'</?(?!H1|H2|/H1|/H2)[^>]*>', r'', '<H1>sdfsa</H2>')
'<H1>sdfsa</H2>'
(I was too lazy to compile the re, but of course that's what you'd normally want
to do.)
Steve
More information about the Python-list
mailing list