Clean "Durty" strings

irstas at gmail.com irstas at gmail.com
Mon Apr 2 15:20:26 EDT 2007


On Apr 2, 10:08 pm, Michael Hoffman <cam.ac... at mh391.invalid> wrote:
> irs... at gmail.com wrote:
> > But it could be that he just wants all HTML tags to disappear, like in
> > his example. A code like this might be sufficient then: re.sub(r'<[^>]
> > +>', '', s).
>
> Won't work for, say, this:
>
> <img src="src" alt="<text>">
> --
> Michael Hoffman

True, but is that legal? I think the alt attribute needs to use <
and >. Although I know what you're going to reply. That
BeautifulSoup probably parses it even if it's invalid HTML. And I'd
say that I agree, using BeautifulSoup is a better solution than custom
regexps.




More information about the Python-list mailing list