Stripping HTML tags from a string

William Park parkw at better.net
Wed May 2 16:56:17 EDT 2001


On Wed, May 02, 2001 at 06:34:57PM +0000, Colin Meeks wrote:
> I know I've seen this somewhere before, but can't find it now I want
> it.  Does anybody know how to strip all HTML tags from a string. I
> imagine I would use a regular expression, but am not fully up to speed
> on these yet.
> 
> i.e "<P>Hello<P><FONT FACE="Arial">This is really cool</FONT> isn't
> it<BR>The End" would give me "Hello This is really cool isn't it The
> End" I would like to replace all <P> and <BR> with a space as this
> would result in something that is more readable.

Since others gave solutions using 'sgmllib', here are a solution using
're' as requested:

>>> pat = re.compile(r'<P\b|<BR\b', re.I)
>>> def func(x):
...     if pat.match(x.group()) is None: return ''
...     else: return ' '
...
>>> s = """<P>Hello<P><FONT FACE="Arial">This is really cool</FONT> isn't
... it<BR>The End"""
>>> re.sub('<[^>]*>', func, s)
" Hello This is really cool isn't\012it The End"

--William Park, Open Geometry Consulting, Mississauga, Ontario, Canada.
  8 CPUs, Linux, Python, LaTeX, vim, mutt




More information about the Python-list mailing list