Help w/ HTMLParser lib

Richard Brodie R.Brodie at rl.ac.uk
Fri May 21 04:47:02 EDT 2004


"Kevin T. Ryan" <kevryan0701 at yahoo.com> wrote in message
news:40ad7619$0$3114$61fed72c at news.rcn.com...

> I'm somewhat new to python (about 1 year), and I'm trying to write a program
> that opens a file like object w/ urllib.urlopen, and then parse the data by
> passing it to a class that subclasses HTMLParser.HTMLParser.  On the web
> page, however, there is javascript - and I think that is causing an error
> in parsing the data.

The trouble is there is so much junk HTML on the web, which only vaguely
follows the syntax. If you are feeding your program with a wide variety of pages,
I would recommend sanitising the page using Tidy or uTidylib first.








More information about the Python-list mailing list