[Tutor] ElementTree, TidyHTMLTreeBuilder, find

Wed Dec 14 13:08:01 CET 2005

Bob Tanner wrote:
> Having problem understanding how find() works.
> 
> The html file I'm using is attached.
> 
> Python 2.4.2 (No.2, Nov 20 2005, 17:04:48) 
> 
>>>>from elementtidy import TidyHTMLTreeBuilder
>>>>doc = TidyHTMLTreeBuilder.parse('048229.html')
>>>>root = doc.getroot()
>>>>print root.find('html/body')
> 
> None
> 
>>>>print root.find('body')
> 
> None
> 
> 
> Viewing the html under firefox DOM tool
> 
> -#document
>   -HTML
>     +HEAD
>         #TEXT
>     +BODY
> 
> No sure how to use the find.

Let's try it at the interpreter prompt to see what is going on:

  >>> from elementtidy import TidyHTMLTreeBuilder as Tidy
  >>> doc = Tidy.parse(r'D:\WUTemp\temp.html')
  >>> doc
<elementtree.ElementTree.ElementTree instance at 0x00A4D4E0>
  >>> doc.find('body')
  >>> doc.find('BODY')
  >>> doc.find('//BODY')

OK, that doesn't work :-) but you knew that!

Let's just look at the root element:
  >>> doc.getroot()
<Element {http://www.w3.org/1999/xhtml}html at a55620>

Ah, that explains it! TidyHTMLTreeBuilder puts the elements in a namespace. That means you 
have to include the namespace as part of the search string for find:

  >>> doc.find('{http://www.w3.org/1999/xhtml}body')
<Element {http://www.w3.org/1999/xhtml}body at a557b0>

That works!
Kent