Trying to find regex for any script in an html source

Mike Meyer mwm at mired.org
Sat Dec 24 15:03:25 EST 2005


"28tommy" <28tommy at gmail.com> writes:
> Hi,
> I'm trying to find scripts in html source of a page retrieved from the
> web.
> I'm trying to use the following rule:
>
> match = re.compile('<script [re.DOTALL]+ src=[re.DOTALL]+>')
>
> I'm testing it on a page that includes the following source:
>
> <script language="JavaScript1.2"
> src="http://i.cnn.net/cnn/.element/ssi/js/1.3/mainVideoMod.js"
> type="text/javascript"></script>
>
> But I get - 'None' as my result.
> Here's (in words) what I'm trying to do: '<script ' followed by any
> type and a number of charecters, and then followed by ' src=' followed
> by any type and a number of charecters, and then finished by '>'
>
> What am I doing wrong?

Trying to use an RE to parse HTML. While possible, it's not nearly as
easy as it looks, and there are lots of gotchas.

Paul has already pointed out the PyParsing comes with HTML parser. If
your HTML is well-formed, you can use HTMLParser in the standard
library. If your HTML comes from the web at large (meaning much of it
was written by the people who handed in code that didn't compile for
their programming assignments), you'll want to try something like
BeautifulSoup.

        <mike
-- 
Mike Meyer <mwm at mired.org>			http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.



More information about the Python-list mailing list