[XML-SIG] "Character reference too large" error with HtmlLib.Reader()

Martin v. Loewis martin@v.loewis.de
31 Jul 2002 08:55:35 +0200


Lars Marius Garshol <larsga@garshol.priv.no> writes:

> This sounds like an obvious bug. I suggest you make the smallest
> document you can that reproduces the error, and then report this as a
> bug in the PyXML Sourceforge project (it seems to be in sgmlop, which
> I don't think is part of Python proper), attaching the file to it.

It turns out that the bug is not that obvious. sgmlop cannot return a
Unicode string, since, in SGML mode, it would have to know what the
character set for character references is. Instead, this was a bug in
xml.dom.reader.SgmlOp.HtmlParser, which failed to implement
handler_charref (sgmlop only tries to interpret the character
references itself if handle_charref is not implemented).

This will be fixed in PyXML 0.8; the fix is in SgmlOp.py 1.10.

Regards,
Martin