Parsing complex web pages safely with htmllib.HTMLParser

Fri Jan 25 02:33:21 EST 2002

Preprocessing the html page using tidy sure does the trick.  The
Python interface to tidy:
http://www.lemburg.com/files/python/mxTidy.html is all I needed to
install - it installs into \python21\mx and you use it like this:

from mx.Tidy import *
cleanhtmltext = tidy(htmltext)[2]  

The tidy function returns a tuple, where tuple positions 0 and 1 are
error and warning counts and tuple position 2 is the clean html.

thanks for the helpful responses!
Andy Bulka
www.atug.com/andypatterns