HTML / DOM

Fri Mar 28 19:41:30 EST 2003

Bo M. Maryniuck wrote:
> Hello, all.
> 
> Can anybody drop me a real code how to work with DOM in _HTML_ which is even 
> not XHTML? I took a look over 4DOM but unfortunately documentation there is 
> too silly. :( Well, for example, I have a HTML string:
> 
> 	<p>Text here <a name="foo">bar</a></p>
> 
> Now, how to build a DOM from this chunk to do the following:
> 	1. Fetch somehow a "name" attribute from the "<A/>" tag
> 	2. Change it (not a "bar", but a "foo" value!)
> 	3. Push it back to the same place
> 	4. Return modified HTML back as string without doctype and so on.
> 
> Any ideas? I know how to work with XML, but HTML-stuff drives me crazy since 
> it does not XML. Yes, I've tried to RTFM and STFW, but now I gave up -- this 
> all does not work as I need.
> 
> What I need to do with it. I have a HTML's where I need to found all the tags 
> <a name="foo"> which contains Unicode data in the "name" attribute and 
> urllib.quote() it than return this HTML back. But how to do it with DOM in 
> HTML -- I have no idea, since this is not XML... :(
> 
> Thank you for any help and any *working* ideas and examples. :)
> 

This seems to work for me:

from HTMLParser import HTMLParser
import urllib

class ParseMe(HTMLParser):

     def __init__(self):
         HTMLParser.__init__(self)
         self.data = ''

     def add_data(self, str):
         self.data = self.data + str.strip(' \t')

     def start_a(self, attr):
         #change your data here
         #implementation left as an exercise for you ;)
         if attr[0] == 'name' and attr[1] == 'foo':
             return (attr[0], 'bar',)
         else:
             return attr

     def handle_startendtag(self, tag, attr):
         self.add_data(self.get_starttag_text())

     def handle_starttag(self, tag, attrs):
         temp = ''
         for attr in attrs:
             #check here for tags you want to manipulate
             #call start_<tag> method (modeled after sgmllib)
             if tag == 'a':
                 attr = self.start_a(attr)
             temp = temp + '%s="%s" ' % (attr[0], attr[1],)
         self.add_data('<%s>' % (tag + ' ' + temp).strip())

     def handle_endtag(self, tag):
         self.add_data('</%s>' % tag)

     def handle_data(self, data):
         self.add_data(data)

     def handle_comment(self, comment):
         self.add_data('<!-- %s -->' % comment)

     def handle_charref(self, name):
         self.add_data('&%s;' % name)

     def handle_entityref(self, name):
         self.add_data('&%s;' % name)

     def handle_decl(self, decl):
         self.add_data('<!%s>' % decl)

if __name__ == '__main__':
     ht = ParseMe()
     ht.feed(urllib.urlopen('http://www.soraia.com/index.php').read())
     ht.close()
     print ht.data