HTMLParser bug ?

Fri May 9 12:45:50 EDT 2003

On 2003-05-08, Anand Pillai <pythonguy at Hotpop.com> wrote:
> I do agree with you. But my requirement is a really robust
> parser which does not fail even if the html code contains some
> invalid HTML. I have seen many pages with this kind of code (my
> own homepage for example ;-)). My program should not fail if it
> encounters such a page.

Finally, mxTidy *is* the robust parser you are looking for, you
just require an extra processing step at its output.

You can see this problem from two points of view: the user or
the programmer.  If the case of the programer yes, you should go
and code a better parser `if that is your objective'. But I dare to
think your objective is writting a web crawler, not an HTML parser.
Writting the parser would mean:

 Rewritting the wheel. You are going to end up with an embedded
 mxTidy in your own code, or maybe the next version of Mozilla.

 Maintainability nightmare. Just look what happened to IExplorer
 when they found about that recent <input type blah> bug. Do you
 want to risk your future free time trying to patch yourself all
 possible wrong cases? Again refers to the previous point.

 Slow down your development time.

In the case of the user, the program is not going to fail if you
first clean up the code and then parse it.

> Thanks for the suggestion but I think I will modify HTMLParser
> code for my purpose than cleaning html using another module. THat
> will take time and will slow down the sucking.

You can always try to parse first with HTMLParser and retry again
with cleaned up code if the parsing exception is raised. Besides,
for netcrawlers the bottleneck is usually bandwidth, not CPU. Using
threads correctly would make the CPU impact negligible from the
point of view of the user. If the CPU is the bottleneck, you must
have something wrong on the hardware part of your solution.

-- 
 Please don't send me private copies of your public answers. Thanks.