htmllib.py and parsing malformed HTML

John J. Lee jjl at pobox.com
Tue Sep 2 17:59:03 EDT 2003


KC <nskhcarlso at bellsouth.net> writes:

> Thomas Güttler wrote:
> > Hi,
> > You could use tidy (http://www.w3.org/People/Raggett/tidy/) before
> > you
> > parse the html.
> 
> I appreciate the suggestion but unfortunately this will not work well
> for me as the parser runs as part of a cron job.  I wouldn't be able
> to review the tidy error log in a timely fashion if there was a
> problem.
[...]

So, what about *your* code's error log (or the equivalent --
presumably an unhandled traceback)??  It's not obvious that your
solution (in a later post) will be any more robust than just piping
everything through HTMLTidy.  In fact, since you will find a great
variety of nonsense in 'HTML as deployed', it seems likely that
HTMLTidy will do the better job.


John




More information about the Python-list mailing list