Stripping scripts from HTML with regular expressions

Paul McGuire ptmcg at austin.rr.com
Thu Apr 10 11:21:05 EDT 2008


On Apr 9, 2:38 pm, Michel Bouwmans <mfb.chikaz... at gmail.com> wrote:
> Hey everyone,
>
> I'm trying to strip all script-blocks from a HTML-file using regex.
>
> I tried the following in Python:
>
> testfile = open('testfile')
> testhtml = testfile.read()
> regex = re.compile('<script\b[^>]*>(.*?)</script>', re.DOTALL)
> result = regex.sub('', blaat)
> print result
>
> This strips far more away then just the script-blocks. Am I missing
> something from the regex-implementation from Python or am I doing something
> else wrong?
>
> greetz
> MFB

This pyparsing-based HTML stripper (http://pyparsing.wikispaces.com/
space/showimage/htmlStripper.py) strips *all* HTML tags, scripts, and
comments.  To pare down to just stripping scripts, just change this
line:

firstPass = (htmlComment | scriptBody | commonHTMLEntity |
             anyTag | anyClose ).transformString(targetHTML)

to:

firstPass = scriptBody.transformString(targetHTML)

-- Paul



More information about the Python-list mailing list