Stripping scripts from HTML with regular expressions

Wed Apr 9 16:11:05 EDT 2008

> -----Original Message-----
> From: python-list-bounces+jr9445=att.com at python.org [mailto:python-
> list-bounces+jr9445=att.com at python.org] On Behalf Of Michel Bouwmans
> Sent: Wednesday, April 09, 2008 3:38 PM
> To: python-list at python.org
> Subject: Stripping scripts from HTML with regular expressions
> 
> Hey everyone,
> 
> I'm trying to strip all script-blocks from a HTML-file using regex.
> 
> I tried the following in Python:
> 
> testfile = open('testfile')
> testhtml = testfile.read()
> regex = re.compile('<script\b[^>]*>(.*?)</script>', re.DOTALL)
> result = regex.sub('', blaat)
> print result
> 
> This strips far more away then just the script-blocks. Am I missing
> something from the regex-implementation from Python or am I doing
> something
> else wrong?
> 

[Insert obligatory comment about using a html specific parser
(HTMLParser) instead of regexes.]

Actually your regex didn't appear to strip anything.  You probably saw
stuff disappear because blaat != testhtml:
	testhtml = testfile.read()
	result = regex.sub('', blaat)

Try this:

import re

testfile = open('a.html')
testhtml = testfile.read()
regex = re.compile('<script\s+.*?>(.*?)</script>', re.DOTALL)
result = regex.sub('',testhtml)

print result