Stripping scripts from HTML with regular expressions

Stefan Behnel stefan_ml at behnel.de
Fri Apr 11 05:36:14 EDT 2008


Michel Bouwmans wrote:
> I don't think HTMLParser was doing anything wrong here. I needed to parse a
> HTML document, but it contained script-blocks with document.write's in
> them. I only care for the content outside these blocks but HTMLParser will
> choke on such a block when it isn't encapsulated with HTML-comment markers
> and it tries to parse the contents of the document.write's. ;)

Risking to repear myself: using the right tool for the job is generally a good
idea.

http://codespeak.net/lxml/lxmlhtml.html#cleaning-up-html

Stefan



More information about the Python-list mailing list