Stripping scripts from HTML with regular expressions

Reedick, Andrew jr9445 at ATT.COM
Wed Apr 9 16:26:44 EDT 2008



> -----Original Message-----
> From: python-list-bounces+jr9445=att.com at python.org [mailto:python-
> list-bounces+jr9445=att.com at python.org] On Behalf Of Michel Bouwmans
> Sent: Wednesday, April 09, 2008 3:38 PM
> To: python-list at python.org
> Subject: Stripping scripts from HTML with regular expressions
> 
> Hey everyone,
> 
> I'm trying to strip all script-blocks from a HTML-file using regex.
> 
> I tried the following in Python:
> 
> testfile = open('testfile')
> testhtml = testfile.read()
> regex = re.compile('<script\b[^>]*>(.*?)</script>', re.DOTALL)


Aha! \b is being interpolated as a backspace character:
  \b ASCII Backspace (BS)

Always use a raw string with regexes:
	regex = re.compile(r'<script\b[^>]*>(.*?)</script>', re.DOTALL)

Your regex should now work.



*****

The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential, proprietary, and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from all computers. GA622





More information about the Python-list mailing list