Special chars with HTMLParser

Wed Aug 5 08:28:15 EDT 2009

>>>>> Fafounet <fafounet at gmail.com> (F) wrote:

>F> Hello,
>F> I am parsing a web page with special chars such as &#xE9; (which
>F> stands for é).
>F> I know I can have the unicode character é from unicode
>F> ("\xe9","iso-8859-1")
>F> but with those extra characters I don' t know.

>F> I tried to implement handle_charref within HTMLParser without success.
>F> Furthermore, if I have the data ab&#xE9;cd, handle_data will get "ab",
>F> handle_charref will get xe9 and then handle_data doesn't have the end
>F> of the string ("cd").

The character references indicate Unicode ordinals, not iso-8859-1
characters. In your example it will give the proper character because
iso-8859-1 coincides with the first part of the Unicode ordinals, but
for character outside of iso-8859-1 it will fail.

This should give you an idea:

from htmlentitydefs import name2codepoint
...
    def handle_charref(self, name):
        if name.startswith('x'):
            num = int(name[1:], 16)
        else:
            num = int(name, 10)
        print 'char:', repr(unichr(num))

    def handle_entityref(self, name):
        print 'char:', unichr(name2codepoint[name])

If your HTML may be illegal you should add some exception handling.
-- 
Piet van Oostrum <piet at cs.uu.nl>
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: piet at vanoostrum.org