HTMLparsing abnormal html pages

Tim Roberts timr at probo.com
Sun Mar 18 00:58:10 EST 2001


Mark Pilgrim <f8dy at diveintopython.org> wrote:
>
>In fact, the BaseHTMLProcessor class I define in my book can be used to
>properly quote all attribute values, since it works by breaking down the
>entire HTML (via sgmllib) and building up equivalent HTML with proper quotes
>around attribute values.

I have been searching for an HTML pretty-printer; something where I can
feed an arbitrary page and get a more structured, indented view.  I wrote a
simple one myself, based on sgmllib; it does a fair job, but it is easily
confused by such common offenses as omitted </p> tags.  It sounds like your
BaseHTMLProcessor might be such a thing.  Is it available yet?

If not, is anybody aware of a fair HTML cleaner-upper?
--
- Tim Roberts, timr at probo.com
  Providenza & Boekelheide, Inc.



More information about the Python-list mailing list