Newbie ? -- SGML metadata extraction

Adonis adonisv at DELETETHISTEXTearthlink.net
Mon Jan 16 19:01:44 EST 2006


ProvoWallis wrote:

<snip>

 From what I gather here is a quickie, probably better solutions on the 
way but this accomplishes the idea I think.

Some helpful links:
http://docs.python.org/lib/module-sgmllib.html
http://docs.python.org/lib/module-HTMLParser.html
http://docs.python.org/lib/module-htmllib.html

---

from HTMLParser import HTMLParser

data = """<main-section no="1">

<form id="graphic_1.tif">
<form id="graphic_2.tif">

<main-section no="2">

<form id="graphic_3.tif">

<main-section no="3">

<form id="graphic_4.tif">
<form id="graphic_5.tif">
<form id="graphic_6.tif">
"""

class ParseForms(HTMLParser):

     def handle_starttag(self, tag, attrs):
         if tag == "form":
             # attrs argument is a list of tuples [(attribute, value)]
             # converted it to a dictionary to access attribute easier
             print "form id: %s" % dict(attrs).get('id')

if __name__ == "__main__":
     parser = ParseForms()
     parser.feed(data)



More information about the Python-list mailing list