[XML-SIG] "Character reference too large" error with HtmlLib.Reader()

Lars Marius Garshol larsga@garshol.priv.no
30 Jul 2002 23:35:48 +0200


* Douglas Bates
|=20
| The HTML returned from this URI occasionally has character references
| such as
| 		<dt>Alberto Luce&#241;o and Jaime Puig-Pey</dt>
| and I get errors of
| [...]
| ValueError: character reference too large

This sounds like an obvious bug. I suggest you make the smallest
document you can that reproduces the error, and then report this as a
bug in the PyXML Sourceforge project (it seems to be in sgmlop, which
I don't think is part of Python proper), attaching the file to it.
=20
| As I understand the code in Sgmlop.py the default characterset is
| ISO-8859-1 and &#241; should be
|=20
| small n, tilde                       =F1    &#241; --> =F1    &ntilde; --=
> =F1
|=20
| in ISO-8859-1.

Actually, character references are based on the document character
set, and for HTML that's declared in the SGML declaration, and
declared to be Unicode. Which gives the same result. (So that might
actually be another bug...)

--=20
Lars Marius Garshol, Ontopian         <URL: http://www.ontopia.net >
ISO SC34/WG3, OASIS GeoLang TC        <URL: http://www.garshol.priv.no >