SGML parsing tags and leeping track

Heiko Wundram me+python at modelnine.org
Tue May 2 17:49:05 EDT 2006


Am Dienstag 02 Mai 2006 20:38 schrieb hapaboy2059 at gmail.com:
> could i make a global variable and keep track of each tag count?
>
> Also how would i make a list or dictionary of tags that is found?
> how can i handle any tag that is given?

The following snippet does what you want:

>>>
from sgmllib import SGMLParser

class MyParser(SGMLParser):

    def __init__(self):
        SGMLParser.__init__(self)
        self.tagcount = {}
        self.links = set()

    # Tag count handling
    # ------------------

    def handle_starttag(self,tag,method,args):
        self.tagcount[tag] = self.tagcount.get(tag,0) + 1
        method(args)

    def unknown_starttag(self,tag,args):
        self.tagcount[tag] = self.tagcount.get(tag,0) + 1

    # Argument handling
    # -----------------

    def start_a(self,args):
        self.links.update([value for name, value in args if name == "href"])

parser = MyParser()
parser.feed(file("test.html").read()) # Insert your data source here...
parser.close()

print parser.tagcount
print parser.links
>>>

See the documentation for sgmllib for more info on handle_starttag (whose 
logic might just as well have been implemented in start_a, but if you want 
argument handling for more tags, it's best to keep it at this one central 
place) and unknown_starttag.

--- Heiko.



More information about the Python-list mailing list