HTMLparsing abnormal html pages
Tim Roberts
timr at probo.com
Sun Mar 18 00:58:10 EST 2001
Mark Pilgrim <f8dy at diveintopython.org> wrote:
>
>In fact, the BaseHTMLProcessor class I define in my book can be used to
>properly quote all attribute values, since it works by breaking down the
>entire HTML (via sgmllib) and building up equivalent HTML with proper quotes
>around attribute values.
I have been searching for an HTML pretty-printer; something where I can
feed an arbitrary page and get a more structured, indented view. I wrote a
simple one myself, based on sgmllib; it does a fair job, but it is easily
confused by such common offenses as omitted </p> tags. It sounds like your
BaseHTMLProcessor might be such a thing. Is it available yet?
If not, is anybody aware of a fair HTML cleaner-upper?
--
- Tim Roberts, timr at probo.com
Providenza & Boekelheide, Inc.
More information about the Python-list
mailing list