[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created
Adam Olsen
report at bugs.python.org
Sun Oct 4 07:44:05 CEST 2009
Adam Olsen <rhamph at gmail.com> added the comment:
I've traced down the biggest problem to decode_unicode in ast.c. It
needs to convert everything into a form of escapes so it becomes pure
ascii, which then become evaluated back into a unicode object.
Unfortunately, it uses UTF-16-BE to do so, which always split
surrogates. Switching it to UTF-32-BE is fairly straightforward, and
works even on UTF-16 (or "narrow") builds.
Incidentally, there's no point using the surrogatepass error handler
once we actually support surrogates.
Unfortunately there's a second problem in repr().
'\U0001010F'.isprintable() returns True on UTF-32 builds and False on
UTF-16 builds. This causes repr() to escape it unnecessarily on UTF-16
builds. repr() at least joins surrogate pairs before its internally
printable test (unlike .isprintable() or any other str method), but it
turns out all of the APIs in unicodectype.c only accept a single 16-bit
int in UTF-16 builds anyway. That'll be a bigger patch than the first part.
----------
_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue3297>
_______________________________________
More information about the Python-bugs-list
mailing list