Stripping scripts from HTML with regular expressions

Michel Bouwmans mfb.chikazuku at gmail.com
Thu Apr 10 13:17:52 EDT 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Reedick, Andrew wrote:

>> -----Original Message-----
>> From: python-list-bounces+jr9445=att.com at python.org [mailto:python-
>> list-bounces+jr9445=att.com at python.org] On Behalf Of Michel Bouwmans
>> Sent: Wednesday, April 09, 2008 5:44 PM
>> To: python-list at python.org
>> Subject: RE: Stripping scripts from HTML with regular expressions
>> 
>> 
>> Thanks! That did the trick. :) I was trying to use HTMLParser but that
>> choked on the script-blocks that didn't contain comment-indicators.
>> Guess I
>> can now move on with this script, thank you.
>> 
> 
> 
> Soooo.... you asked for help with a regex workaround, but didn't ask for
> help with the original problem, namely HTMLParser?  ;-)
> 
> 
> 
> *****
> 
> The information transmitted is intended only for the person or entity to
> which it is addressed and may contain confidential, proprietary, and/or
> privileged material. Any review, retransmission, dissemination or other
> use of, or taking of any action in reliance upon this information by
> persons or entities other than the intended recipient is prohibited. If
> you received this in error, please contact the sender and delete the
> material from all computers. GA625

I don't think HTMLParser was doing anything wrong here. I needed to parse a
HTML document, but it contained script-blocks with document.write's in
them. I only care for the content outside these blocks but HTMLParser will
choke on such a block when it isn't encapsulated with HTML-comment markers
and it tries to parse the contents of the document.write's. ;)

MFB
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFH/kvEDpaqHmOKFdQRAgHgAJ4s2YUN6yynUS+8aunhVUR94rs2yQCgrn94
tAFx/dylzEI0TclRDSTRbJI=
=k8SN
-----END PGP SIGNATURE-----



More information about the Python-list mailing list