Remove HTML tags (except anchor tag) from a string using regular expressions
John Lenton
john at grulic.org.ar
Tue Feb 1 11:03:43 EST 2005
On Tue, Feb 01, 2005 at 01:03:31PM +0100, Nico Grubert wrote:
> Hello,
>
> I want to remove all html tags from a string "content" except <a
> ...>xxx</a>.
>
> My script reads like this:
>
> ###
> import re
> content = re.sub('<([^!>]([^>]|\n)*)>', '', content)
> ###
>
> It works fine. It removes all html tags from "content".
> Unfortunately, this also removes <a ...>xxx</a> occurancies.
> Any idea, how to modify this to remove all html tags except <a ...>xxx</a>?
not sure what the outer parenthesis are there for, i.e. afaics
<([^!>]([^>]|\n)*)>
is the same as
<[^!>](?:[^>]|\n)*>
for doing a re.sub; the grouping parentheses are only needed if you
actually need the groups later on.
Try this:
<(?!(?:a\s|/a|!))[^>]*>
--
John Lenton (john at grulic.org.ar) -- Random fortune:
Slurm, n.:
The slime that accumulates on the underside of a soap bar when
it sits in the dish too long.
-- Rich Hall, "Sniglets"
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 196 bytes
Desc: Digital signature
URL: <http://mail.python.org/pipermail/python-list/attachments/20050201/1a46cae2/attachment.sig>
More information about the Python-list
mailing list