HTML / DOM
Joe Francia
usenet at soraia.com
Fri Mar 28 19:41:30 EST 2003
Bo M. Maryniuck wrote:
> Hello, all.
>
> Can anybody drop me a real code how to work with DOM in _HTML_ which is even
> not XHTML? I took a look over 4DOM but unfortunately documentation there is
> too silly. :( Well, for example, I have a HTML string:
>
> <p>Text here <a name="foo">bar</a></p>
>
> Now, how to build a DOM from this chunk to do the following:
> 1. Fetch somehow a "name" attribute from the "<A/>" tag
> 2. Change it (not a "bar", but a "foo" value!)
> 3. Push it back to the same place
> 4. Return modified HTML back as string without doctype and so on.
>
> Any ideas? I know how to work with XML, but HTML-stuff drives me crazy since
> it does not XML. Yes, I've tried to RTFM and STFW, but now I gave up -- this
> all does not work as I need.
>
> What I need to do with it. I have a HTML's where I need to found all the tags
> <a name="foo"> which contains Unicode data in the "name" attribute and
> urllib.quote() it than return this HTML back. But how to do it with DOM in
> HTML -- I have no idea, since this is not XML... :(
>
> Thank you for any help and any *working* ideas and examples. :)
>
This seems to work for me:
from HTMLParser import HTMLParser
import urllib
class ParseMe(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.data = ''
def add_data(self, str):
self.data = self.data + str.strip(' \t')
def start_a(self, attr):
#change your data here
#implementation left as an exercise for you ;)
if attr[0] == 'name' and attr[1] == 'foo':
return (attr[0], 'bar',)
else:
return attr
def handle_startendtag(self, tag, attr):
self.add_data(self.get_starttag_text())
def handle_starttag(self, tag, attrs):
temp = ''
for attr in attrs:
#check here for tags you want to manipulate
#call start_<tag> method (modeled after sgmllib)
if tag == 'a':
attr = self.start_a(attr)
temp = temp + '%s="%s" ' % (attr[0], attr[1],)
self.add_data('<%s>' % (tag + ' ' + temp).strip())
def handle_endtag(self, tag):
self.add_data('</%s>' % tag)
def handle_data(self, data):
self.add_data(data)
def handle_comment(self, comment):
self.add_data('<!-- %s -->' % comment)
def handle_charref(self, name):
self.add_data('&%s;' % name)
def handle_entityref(self, name):
self.add_data('&%s;' % name)
def handle_decl(self, decl):
self.add_data('<!%s>' % decl)
if __name__ == '__main__':
ht = ParseMe()
ht.feed(urllib.urlopen('http://www.soraia.com/index.php').read())
ht.close()
print ht.data
More information about the Python-list
mailing list