Remove HTML tags (except anchor tag) from a string using regular expressions

Tue Feb 1 11:03:43 EST 2005

On Tue, Feb 01, 2005 at 01:03:31PM +0100, Nico Grubert wrote:
> Hello,
> 
> I want to remove all html tags from a string "content" except <a 
> ...>xxx</a>.
> 
> My script reads like this:
> 
> ###
> import re
> content = re.sub('<([^!>]([^>]|\n)*)>', '', content)
> ###
> 
> It works fine. It removes all html tags from "content".
> Unfortunately, this also removes  <a ...>xxx</a> occurancies.
> Any idea, how to modify this to remove all html tags except <a ...>xxx</a>?

not sure what the outer parenthesis are there for, i.e. afaics 

    <([^!>]([^>]|\n)*)>

is the same as

    <[^!>](?:[^>]|\n)*>

for doing a re.sub; the grouping parentheses are only needed if you
actually need the groups later on.

Try this:

    <(?!(?:a\s|/a|!))[^>]*>

-- 
John Lenton (john at grulic.org.ar) -- Random fortune:
Slurm, n.:
	The slime that accumulates on the underside of a soap bar when
	it sits in the dish too long.
		-- Rich Hall, "Sniglets"
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 196 bytes
Desc: Digital signature
URL: <http://mail.python.org/pipermail/python-list/attachments/20050201/1a46cae2/attachment.sig>