[issue1552880] Unicode Imports
Kristján Valur Jónsson
report at bugs.python.org
Wed Sep 1 03:23:15 CEST 2010
Kristján Valur Jónsson <kristjan at ccpgames.com> added the comment:
I conffess that I didn't follow the utf-8/surrogate discussion.
But the utf-8 encoding can encode all valid unicode characters:
UTF-8 may only legally be used to encode valid Unicode scalar values. According to the Unicode standard the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) and values above U+10FFFF are not legal Unicode values, and the UTF-8 encoding of them is an invalid byte sequence and should be treated as described above. (from wikipedia)
If we encounter surrogate halves when encoding (unicode) to utf-8, it means that we are really trying to decode utf-16 and reencode it as utf-8. (and that python is using 16 bits for its unicode chars). the utf--8 codec should be smart enough to merge the surrogates into a utf-32 char, and encode that.
Anyway, as you remark, my approach is a _patch_, designed to make python (2.x) work in an unicode environment, with the least amount of code change, for those willing to commit such a patch. In 3.x you may want to do things differently.
----------
_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue1552880>
_______________________________________
More information about the Python-bugs-list
mailing list