Stripping scripts from HTML with regular expressions
Reedick, Andrew
jr9445 at ATT.COM
Wed Apr 9 16:11:05 EDT 2008
> -----Original Message-----
> From: python-list-bounces+jr9445=att.com at python.org [mailto:python-
> list-bounces+jr9445=att.com at python.org] On Behalf Of Michel Bouwmans
> Sent: Wednesday, April 09, 2008 3:38 PM
> To: python-list at python.org
> Subject: Stripping scripts from HTML with regular expressions
>
> Hey everyone,
>
> I'm trying to strip all script-blocks from a HTML-file using regex.
>
> I tried the following in Python:
>
> testfile = open('testfile')
> testhtml = testfile.read()
> regex = re.compile('<script\b[^>]*>(.*?)</script>', re.DOTALL)
> result = regex.sub('', blaat)
> print result
>
> This strips far more away then just the script-blocks. Am I missing
> something from the regex-implementation from Python or am I doing
> something
> else wrong?
>
[Insert obligatory comment about using a html specific parser
(HTMLParser) instead of regexes.]
Actually your regex didn't appear to strip anything. You probably saw
stuff disappear because blaat != testhtml:
testhtml = testfile.read()
result = regex.sub('', blaat)
Try this:
import re
testfile = open('a.html')
testhtml = testfile.read()
regex = re.compile('<script\s+.*?>(.*?)</script>', re.DOTALL)
result = regex.sub('',testhtml)
print result
More information about the Python-list
mailing list