Trying to find regex for any script in an html source

Paul McGuire ptmcg at austin.rr._bogus_.com
Wed Dec 21 18:25:58 EST 2005


"28tommy" <28tommy at gmail.com> wrote in message
news:1135200684.657201.263980 at f14g2000cwb.googlegroups.com...
> Hi,
> I'm trying to find scripts in html source of a page retrieved from the
> web.
> I'm trying to use the following rule:
>
> match = re.compile('<script [re.DOTALL]+ src=[re.DOTALL]+>')
>
     <snip>
>

28tommy -

pyparsing includes a built-in HTML tag definition method that handles tag
attributes automatically.  You can also tell pyparsing to *not* accept tags
found inside HTML comments, something not so easy using re's (your target
HTML pages may not have comments, so I dont know if this is of much interest
to you).  Finally, accessing the results is very easy, especially for
getting at the values of attributes defined in the opening tag.  See the
following example.

Note - pyparsing is considered by some to be "way overkill" for simple HTML
scraping, and is probably 20-100X slower than regular expressions.  But as
quick text processing and extraction tools go, it's pretty easy to put
together fairly complex match expressions, without the noisy typography of
regular expressions.

Download pyparsing at http://pyparsing.sourceforge.net.

-- Paul


from pyparsing import *

data = """
<script language="JavaScript1.2"
src="http://i.cnn.net/cnn/.element/ssi/js/1.3/mainVideoMod.js"
type="text/javascript"></script>

<!--
<script language="JavaScript1.2"
src="http://i.cnn.net/cnn/.element/ssi/js/1.3/notSureAboutThisScript.js"
type="text/javascript"></script>
-->

<script language="JavaScript1.2"
src="http://i.cnn.net/cnn/.element/ssi/js/1.3/anotherScript.js"
type="text/javascript"></script>
"""

# next three lines define grammar for <script> and </script>,
# plus arbitrary HTML attributes on <script>, plus detection and
# ignoring of any matching expression that might be found inside
# an HTML comment
scriptStart,scriptEnd = makeHTMLTags("script")
expr = scriptStart + scriptEnd
expr.ignore(htmlComment)

# use the grammar to scan the data string
# for each match, return matching tokens as a ParseResults object
# - supports list-, dictionary-, and object-style token access
for toks,start,end in expr.scanString(data):
    print toks.startScript
    print toks.startScript[0]
    print toks.startScript.keys()
    print "src =", toks.startScript["src"]
    print "src =", toks.startScript.src
    print


====================
['script', ['language', 'JavaScript1.2'], ['src',
'http://i.cnn.net/cnn/.element/ssi/js/1.3/mainVideoMod.js'], ['type',
'text/javascript'], False]
script
['src', 'type', 'language', 'empty']
src = http://i.cnn.net/cnn/.element/ssi/js/1.3/mainVideoMod.js
src = http://i.cnn.net/cnn/.element/ssi/js/1.3/mainVideoMod.js

['script', ['language', 'JavaScript1.2'], ['src',
'http://i.cnn.net/cnn/.element/ssi/js/1.3/anotherScript.js'], ['type',
'text/javascript'], False]
script
['src', 'type', 'language', 'empty']
src = http://i.cnn.net/cnn/.element/ssi/js/1.3/anotherScript.js
src = http://i.cnn.net/cnn/.element/ssi/js/1.3/anotherScript.js





More information about the Python-list mailing list