Stripping scripts from HTML with regular expressions

Michel Bouwmans mfb.chikazuku at gmail.com
Wed Apr 9 17:43:45 EDT 2008


Reedick, Andrew wrote:

> 
> 
>> -----Original Message-----
>> From: python-list-bounces+jr9445=att.com at python.org [mailto:python-
>> list-bounces+jr9445=att.com at python.org] On Behalf Of Michel Bouwmans
>> Sent: Wednesday, April 09, 2008 3:38 PM
>> To: python-list at python.org
>> Subject: Stripping scripts from HTML with regular expressions
>> 
>> Hey everyone,
>> 
>> I'm trying to strip all script-blocks from a HTML-file using regex.
>> 
>> I tried the following in Python:
>> 
>> testfile = open('testfile')
>> testhtml = testfile.read()
>> regex = re.compile('<script\b[^>]*>(.*?)</script>', re.DOTALL)
> 
> 
> Aha! \b is being interpolated as a backspace character:
>   \b ASCII Backspace (BS)
> 
> Always use a raw string with regexes:
> regex = re.compile(r'<script\b[^>]*>(.*?)</script>', re.DOTALL)
> 
> Your regex should now work.
> 
> 
> 
> *****
> 
> The information transmitted is intended only for the person or entity to
> which it is addressed and may contain confidential, proprietary, and/or
> privileged material. Any review, retransmission, dissemination or other
> use of, or taking of any action in reliance upon this information by
> persons or entities other than the intended recipient is prohibited. If
> you received this in error, please contact the sender and delete the
> material from all computers. GA622

Thanks! That did the trick. :) I was trying to use HTMLParser but that
choked on the script-blocks that didn't contain comment-indicators. Guess I
can now move on with this script, thank you.

MFB



More information about the Python-list mailing list