[issue13612] xml.etree.ElementTree says unknown encoding of a regular encoding

Amaury Forgeot d'Arc report at bugs.python.org
Fri Dec 16 12:26:45 CET 2011


Amaury Forgeot d'Arc <amauryfa at gmail.com> added the comment:

Actually, this fails on 2.6 and 2.7 on wide unicode builds, and passes with narrow unicode builds (on my 64bit Linux box).

In pyexpat.c, PyUnknownEncodingHandler accesses 256 characters of a unicode buffer, without checking its length... which happens to be 192 chars long.
So buffers overflow, etc.  The function has a comment "supports only 8bit encodings"; indeed.
Versions 3.2 and 3.3 happen to pass the test, probably by pure luck.

Supporting multibytes codecs won't be easy: pyexpat requires to fill an array which specifies the number of bytes needed by each start byte (for example, in utf-8, 0xc3 starts a 2-bytes sequence, 0xef starts a 3-bytes sequence).  Our codecs framwork does not provide this information, and some codecs (gb18030 for example) need the second char to determine whether it will need 4 bytes.

----------
nosy: +amaury.forgeotdarc

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue13612>
_______________________________________


More information about the Python-bugs-list mailing list