[XML-SIG] sgmlop and html parsing

"Martin v. Löwis" martin at v.loewis.de
Wed Jan 14 13:55:52 EST 2004


Alexandre Fayolle wrote:

> +            prev_encoding = self.getProperty(handler.property_encoding)
> +            self.setProperty(handler.property_encoding, 'utf-8')
> +            self.handle_data(unichar.encode('utf-8'))
> +            self.setProperty(handler.property_encoding, prev_encoding)

I think you should not set property_encoding if you can avoid that.
Instead, you should try to encode the character in self._encoding.
Converting to UTF-8 would then become necessary as a fallback - or
you should invoke unknown_charref.

Also, it is questionable whether the character reference really *does*
denote a Unicode character. In SGML, the DTD (or some such) determines
the document character set, and it could be anything.

Of course, if you happen to know that you are parsing HTML, then the
character set would be Latin-1. Dunno what it is for bookmarks
(probably Unicode).

Regards,
Martin




More information about the XML-SIG mailing list