[XML-SIG] sgmlop and html parsing
"Martin v. Löwis"
martin at v.loewis.de
Tue Jan 13 15:31:36 EST 2004
Alexandre Fayolle wrote:
> I've looked in the code, and I'm not sure how I can handle this, because
> encoding issues in drv_sgmlop.py only seem to be handled in the callback
> methods, and this problem occurs during before callbacks get called.
This should happen only if self->unicode is false. This is XML parsing,
right? If so, you should enable self->unicode, and it will give you
a unicode character (in handle_data).
If you want to fix it in sgmlop instead of in the application, you could
do what the comment suggests: encode the charref as UTF-8, and pass a
byte string. This is error-prone, though: the application may not expect
UTF-8.
As another alternative, in the application, you could activate the
handle_charref callback - it is actually considered *before* sgmlop
tries to deal with the character reference itself.
I'm not quite sure why drv_sgmlop creates a SGMLParser though -
shouldn't it rather create an XMLParser?
If not, implementing handle_charref would be the way to go - but
only if there are convincing arguments why drv_sgmlop need to continue
favouring SGML.
Regards,
Martin
More information about the XML-SIG
mailing list