Unexpected behaviour with HTMLParser...

Just Another Victim of the Ambient Morality ihatespam at hotmail.com
Tue Oct 9 17:07:45 EDT 2007


    HTMLParser is behaving in, what I find to be, strange ways and I would 
like to better understand what it is doing and why.

    First, it doesn't appear to translate HTML escape characters.  I don't 
know the actual terminology but things like & don't get translated into 
& as one would like.  Furthermore, not only does HTMLParser not translate it 
properly, it seems to omit it altogether!  This prevents me from even doing 
the translation myself, so I can't even working around the issue.
    Why is it doing this?  Is there some mode I need to set?  Can anyone 
else duplicate this behaviour?  Is it a bug?

    Secondly, HTMLParser often calls handle_data() consecutively, without 
any calls to handle_starttag() in between.  I did not expect this.  In HTML, 
you either have text or you have tags.  Why split up my text into successive 
handle_data() calls?  This makes no sense to me.  At the very least, it does 
this in response to text with & like escape sequences (or whatever 
they're called), so that it may successively avoid those translations.
    Again, why is it doing this?  Is there some mode I need to set?  Can 
anyone else duplicate this behaviour?  Is it a bug?

    These are serious problems for me and I would greatly appreciate a 
deeper understanding of these issues.
    Thank you...







More information about the Python-list mailing list