(htmllib) How to capture text that includes tags?

John J. Lee jjl at pobox.com
Wed Nov 5 15:46:51 EST 2003


jennyw <jennyw at dangerousideas.com> writes:

> On Wed, Nov 05, 2003 at 11:23:36AM +0100, Peter Otten wrote:
[...]
> Thanks! Whare are the main advantages of HTMLParser over htmllib?

It won't choke on XHTML.


[...]
> It would be easier if I could find a way to view the HTML as a tree ...
> as a side note, are there any good utils to do this?

Not that I know of (google for it), but DOM is probably the easiest
way to make one.  DOM libraries often have a prettyprint function to
(textually) print DOM nodes (eg. 4DOM from PyXML), which I've found
quite useful -- but of course that's just a chunk of the HTML nicely
reformatted as XHTML.  Alternatively, you could use something like
graphviz / dot and some DOM-traversing code to make graphical trees.
Unfortunately, if this is HTML 'as deployed' (ie. unparseable junk),
you may have to run it through HTMLTidy before it goes into your DOM
parser (use mxTidy or uTidylib).


John




More information about the Python-list mailing list