Regular Expression question

Neil Cerutti horpner at yahoo.com
Mon Aug 21 09:46:44 EDT 2006


On 2006-08-21, stevebread at yahoo.com <stevebread at yahoo.com> wrote:
> Hi, I am having some difficulty trying to create a regular expression.
>
> Consider:
>
><tag1 name="john"/>  <br/> <tag2 value="adj__tall__"/>
><tag1 name="joe"/>
><tag1 name="jack"/>
><tag2 value="adj__short__"/>
>
> Whenever a tag1 is followed by a tag 2, I want to retrieve the
> values of the tag1:name and tag2:value attributes. So my end
> result here should be
>
> john, tall
> jack, short
>
> Ideas?

It seems to me that an html parser might be a better solution.

Here's a slapped-together example. It uses a simple state
machine.

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.state = "get name"
        self.name_attrs = None
        self.result = {}
        
    def handle_starttag(self, tag, attrs):
        if self.state == "get name":
            if tag == "tag1":
                self.name_attrs = attrs
                self.state = "found name"
        elif self.state == "found name":
            if tag == "tag2":
                name = None
                for attr in self.name_attrs:
                    if attr[0] == "name":
                        name = attr[1]
                adj = None
                for attr in attrs:
                    if attr[0] == "value" and attr[1][:3] == "adj":
                        adj = attr[1][5:-2]
                if name == None or adj == None:
                    print "Markup error: expected attributes missing."
                else:
                    self.result[name] = adj
                self.state = "get name"
            elif tag == "tag1":
                # A new tag1 overrides the old one
                self.name_attrs = attrs
    
p = MyHTMLParser()
p.feed("""
    <tag1 name="john"/>  <br/> <tag2 value="adj__tall__"/>
    <tag1 name="joe"/>
    <tag1 name="jack"/>
    <tag2 value="adj__short__"/>
""")
print repr(p.result)
p.close()

There's probably a better way to search for attributes in attr
than "for attr in attrs", but I didn't think of it, and the
example I found on the net used the same idiom.  The format of
attrs seems strange. Why isn't it a dictionary?

-- 
Neil Cerutti
Sermon Outline: I. Delineate your fear II. Disown your fear III.
Displace your rear --Church Bulletin Blooper



More information about the Python-list mailing list