Unexpected behaviour with HTMLParser...

Andrew Durdin adurdin at gmail.com
Wed Oct 10 09:19:58 EDT 2007


On 10/9/07, Just Another Victim of the Ambient Morality
<ihatespam at hotmail.com> wrote:
>
> "Diez B. Roggisch" <deets at nospam.web.de> wrote in message
> news:5n2avjFfh6h8U1 at mid.uni-berlin.de...
> >
> > Without code, that's hard to determine. But you are aware of e.g.
> >
> > handle_entityref(name)
> > handle_charref(ref)
> >
> > ?
>
>     Actually, I am not aware of these methods but I will certainly look into
> them!
>     I was hoping that the issue would be known or simple before I commited
> to posting code, something that is, to my chagrin, not easily done with my
> news client...

For example, here's something simple/simplistic you can do to handle
character and entity references:

from htmlentitydefs import name2codepoint

...

    def handle_charref(self, ref):
        try:
            if ref.startswith('x'):
                char = unichr(int(ref[1:], 16))
            else:
                char = unichr(int(ref))
        except (TypeError, ValueError):
            char = ' '
        # Do something with char

    def handle_entityref(self, ref):
        try:
            char = unichr(name2codepoint[ref])
        except (KeyError, ValueError):
            char = ' '
        # Do something with char


A.



More information about the Python-list mailing list