Trying to find regex for any script in an html source

Mitja Trampus nun at example.com
Wed Dec 21 17:01:06 EST 2005


28tommy wrote:
> Hi,
> I'm trying to find scripts in html source of a page retrieved from the
> web.
> I'm trying to use the following rule:
> 
> match = re.compile('<script [re.DOTALL]+ src=[re.DOTALL]+>')
> 
> I'm testing it on a page that includes the following source:
> 
> <script language="JavaScript1.2"
> src="http://i.cnn.net/cnn/.element/ssi/js/1.3/mainVideoMod.js"
> type="text/javascript"></script>
> 
> But I get - 'None' as my result.
> Here's (in words) what I'm trying to do: '<script ' followed by any
> type and a number of charecters, and then followed by ' src=' followed
> by any type and a number of charecters, and then finished by '>'
> 
> What am I doing wrong?

Several things.
First, re.DOTALL is a flag, a _parameter_ to be passed to 
the compile function, not sumething you stick inside the RE 
itself:
re.compile('<script .+ src=.+>',re.DOTALL)

Second, this won't match your example above, because src 
appears immediately after script. So you probably want 
something like
re.compile('<script .*src=.+>',re.DOTALL)

Third, IIRC * and + are _greedy_ by default, this means they 
will "eat up" as many characters as possible. Try and see 
what I mean. The solution is to use the non-greedy variant 
of *, that is *?
re.compile('<script .*?src=.+?>',re.DOTALL)

All this and more at
http://docs.python.org/lib/module-re.html
and, I'm sure, several online tutorials. To RTFM is never a 
bad idea.



More information about the Python-list mailing list