Regular Expression question

Paul McGuire ptmcg at austin.rr._bogus_.com
Thu Jun 8 00:49:34 EDT 2006


"Frank Potter" <could.net at gmail.com> wrote in message
news:mailman.6720.1149730752.27775.python-list at python.org...
> pyparsing is cool.
> but use only re is also OK
> # -*- coding: UTF-8 -*-
> import urllib2
> html=urllib2.urlopen(ur"http://www.yahoo.com/").read()
>
> import re
> r=re.compile('<img\s+src="(?P<image>[^"]+)"[^>]*>',re.IGNORECASE)
> for m in r.finditer(html):
>     print m.group('image')
>

Ouch - this fails to match any <img> tag that has some other attribute, such
as "height" or "width", before the "src" attribute.  www.yahoo.com has
several such tags.

On the other hand, pyparsing's makeHTMLTags defines a starting tag
expression that looks for (conceptually):

    < tagname ZeroOrMore(attrname '=' value) Optional('/') >

and does not assume that the first tag is "src", or anything else for that
matter.

The returned results make the tag attributes accessible as object attributes
or dictionary keys.

-- Paul





More information about the Python-list mailing list