Stripping HTML with RE

Steven Bethard steven.bethard at gmail.com
Tue Nov 9 18:28:27 EST 2004


I wrote:
> >>> re.sub(r'</?(?!H1|H2|/H1|/H2)[^>]*>', r'', '<a>sdfsa</a>')
> 'sdfsa'

Maybe slightly better:

>>> re.sub(r'<(?!/?(?:H1|H2))[^>]*>', r'', '<a>sdfsa</a>')
'sdfsa'
>>> re.sub(r'<(?!/?(?:H1|H2))[^>]*>', r'', '<H1>sdfsa</a>')
'<H1>sdfsa'
>>> re.sub(r'<(?!/?(?:H1|H2))[^>]*>', r'', '<H1>sdfsa</H2>')
'<H1>sdfsa</H2>'
>>> re.sub(r'<(?!/?(?:H1|H2))[^>]*>', r'', '<H2>sdfsa</H2>')
'<H2>sdfsa</H2>'

I've just grouped things a bit differently so that I only have to write H1 and
H2 once.

Steve




More information about the Python-list mailing list