SGML parsing tags and leeping track
Heiko Wundram
me+python at modelnine.org
Tue May 2 17:49:05 EDT 2006
Am Dienstag 02 Mai 2006 20:38 schrieb hapaboy2059 at gmail.com:
> could i make a global variable and keep track of each tag count?
>
> Also how would i make a list or dictionary of tags that is found?
> how can i handle any tag that is given?
The following snippet does what you want:
>>>
from sgmllib import SGMLParser
class MyParser(SGMLParser):
def __init__(self):
SGMLParser.__init__(self)
self.tagcount = {}
self.links = set()
# Tag count handling
# ------------------
def handle_starttag(self,tag,method,args):
self.tagcount[tag] = self.tagcount.get(tag,0) + 1
method(args)
def unknown_starttag(self,tag,args):
self.tagcount[tag] = self.tagcount.get(tag,0) + 1
# Argument handling
# -----------------
def start_a(self,args):
self.links.update([value for name, value in args if name == "href"])
parser = MyParser()
parser.feed(file("test.html").read()) # Insert your data source here...
parser.close()
print parser.tagcount
print parser.links
>>>
See the documentation for sgmllib for more info on handle_starttag (whose
logic might just as well have been implemented in start_a, but if you want
argument handling for more tags, it's best to keep it at this one central
place) and unknown_starttag.
--- Heiko.
More information about the Python-list
mailing list