HTMLparsing abnormal html pages

Mon Mar 19 14:24:39 EST 2001

You wrote:
>
>Tim Roberts wrote:
>> ... If not, is anybody aware of a fair HTML cleaner-upper?
>
>You could use the python-xml code to slurp the HTML into a DOM, and then
>format it using HtmlLineariser:
>
>>>> from xml.dom.writer import HtmlLineariser
>>>> from xml.dom.html_builder import HtmlBuilder
>>>> builder = HtmlBuilder()
>>>> builder.ignore_mismatched_end_tags = 1   # make less fussy 
>>>> html_text = open('public_html/index.html').read()
>>>> builder.feed(html_text)
>>>> pretty_printed = HtmlLineariser().linearise(builder.document)
>
>pretty_printed is now a nicely indented version of html_text.
>
>It's not the fastest thing in the world, but it might help you.

Thanks for taking the time to reply.  Maybe I'm a bonehead, but I can't find the imports you've mentioned.  I downloaded PyXML 0.6.4 (and 0.6.2 just to check), but HtmlLineariser, HtmlBuilder, and html_builder.py do not seem to exist.  The documentation refers to them, and one of the test routines (test_htmlb.py) calls them, but they aren't in the xml/dom tree anywhere.

Has this interface been completely replaced?  It looks to me like this:

   from xml.dom.ext.reader import HtmlLib
   from xml.dom.ext import PrettyPrint
   pretty_printed = PrettyPrint( HtmlLib.FromStream(...) )

performs the same function.  Have I missed a clue somewhere?
--
- Tim Roberts, timr at probo.com
  Providenza & Boekelheide, Inc.