Regular Expression question

Paul McGuire ptmcg at austin.rr._bogus_.com
Wed Jun 7 19:22:03 EDT 2006


<ken.carlino at gmail.com> wrote in message
news:1149714949.542234.148800 at y43g2000cwc.googlegroups.com...
> Hi,
> I am new to python regular expression, I would like to use it to get an
> attribute of an html element from an html file?
>
> for example, I was able to read the html file using this:
>    req = urllib2.Request(url=acaURL)
>     f = urllib2.urlopen(req)
>
>     data = f.read()
>
> my question is how can I just get the src attribute value of an img
> tag?
> something like this:
> (.*)<img src="href of the image source">(.*)
>
> I need to get the href of the image source.
>
> Thanks.
>

As Fredrik pointed out, re's are not the only tool out there.  Here's a
pyparsing solution.

-- Paul


import pyparsing
import urllib

# define HTML tag format using makeHTMLTags helper
# (we don't really care about the ending </img> tag,
# even though makeHTMLTags returns definitions for both
# starting and ending tag patterns)
imgStartTag, dummy = pyparsing.makeHTMLTags("img")

# get HTML source from some web site
htmlPage = urllib.urlopen("http://www.yahoo.com")
htmlSource = htmlPage.read()
htmlPage.close()

# scan HTML source, printing SRC attribute from each <img> tag
for tokens,start,end in imgStartTag.scanString(htmlSource):
    print tokens.src


Prints:

http://us.i1.yimg.com/us.yimg.com/i/ww/beta/edit_plink.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/125.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/13441.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/136.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/ml.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/my.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/msgn.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/v5_mail_t2.gif
http://us.i1.yimg.com/us.yimg.com/i/mntl/aut/06q2/hea_0411.gif
http://us.i1.yimg.com/us.yimg.com/i/mntl/aut/06q2/img_0607.jpg
http://us.i1.yimg.com/us.yimg.com/i/ww/news/2006/06/07/0607notorious_big.jpg
http://us.i1.yimg.com/us.yimg.com/i/ww/beta/news/video.gif
http://us.i1.yimg.com/us.yimg.com/i/buzz/2006/06/wholefoodssmall.jpg
http://us.i1.yimg.com/us.yimg.com/i/mntl/msg/06q2/img_im.jpg
http://us.i1.yimg.com/us.yimg.com/i/ww/trfc_bckt.gif
http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/04q2/camera.gif





More information about the Python-list mailing list