Help with libxml2dom

Wed Aug 19 07:55:02 EDT 2009

I have just started using libxml2dom to read html files and I have some 
questions I hope you guys can answer me.

The page I am working on (teste.htm):
<html>
  <head>
    <title>
      Title
    </title>
  </head>
  <body bgcolor = 'FFFFF'>
    <table>
      <tr bgcolor="#EEEEEE">
        <td nowrap="nowrap">
          <font size="2" face="Tahoma, Arial"> <a name="1375048"></a> 
</font>
        </td>
        <td nowrap="nowrap">
          <font size="-2" face="Verdana"> 8/15/2009</font>
        </td>
      </tr>
    </table>
  </body>
</html>

 >>> import libxml2dom
 >>> foo = open('teste.htm', 'r')
 >>> str1 = foo.read()
 >>> doc = libxml2dom.parseString(str1, html=1)
 >>> html = doc.firstChild
 >>> html.nodeName
u'html'
 >>> head = html.firstChild
 >>> head.nodeName
u'head'
 >>> title = head.firstChild
 >>> title.nodeName
u'title'
 >>> body = head.nextSibling
 >>> body.nodeName
u'body'
 >>> table = body.firstChild
 >>> table.nodeName
u'text' #?! Why!? Shouldn't it be a table? (1)
 >>> table = body.firstChild.nextSibling #why this works? is there a 
text element hidden? (2)
 >>> table.nodeName
u'table'
 >>> tr = table.firstChild
 >>> tr.nodeName
u'tr'
 >>> td = tr.firstChild
 >>> td.nodeName
u'td'
 >>> font = td.firstChild
 >>> font.nodeName
u'text' # (1)
 >>> font = td.firstChild.nextSibling # (2)
 >>> font.nodeName
u'font'
 >>> a = font.firstChild
 >>> a.nodeName
u'text' #(1)
 >>> a = font.firstChild.nextSibling #(2)
 >>> a.nodeName
u'a'

It seems like sometimes there are some text elements 'hidden'. This is 
probably a standard in DOM I simply am not familiar with this and I 
would very much appreciate if anyone had the kindness to explain me this.

Thanks.