[XML-SIG] sgmlop and html parsing

"Martin v. Löwis" martin at v.loewis.de
Tue Jan 13 15:31:36 EST 2004


Alexandre Fayolle wrote:

> I've looked in the code, and I'm not sure how I can handle this, because
> encoding issues in drv_sgmlop.py only seem to be handled in the callback
> methods, and this problem occurs during before callbacks get called. 

This should happen only if self->unicode is false. This is XML parsing,
right? If so, you should enable self->unicode, and it will give you
a unicode character (in handle_data).

If you want to fix it in sgmlop instead of in the application, you could
do what the comment suggests: encode the charref as UTF-8, and pass a
byte string. This is error-prone, though: the application may not expect
UTF-8.

As another alternative, in the application, you could activate the 
handle_charref callback - it is actually considered *before* sgmlop
tries to deal with the character reference itself.

I'm not quite sure why drv_sgmlop creates a SGMLParser though -
shouldn't it rather create an XMLParser?

If not, implementing handle_charref would be the way to go - but
only if there are convincing arguments why drv_sgmlop need to continue
favouring SGML.

Regards,
Martin




More information about the XML-SIG mailing list