[issue12742] Add support for CESU-8 encoding
Adal Chiriliuc
report at bugs.python.org
Mon Aug 29 13:50:12 CEST 2011
Adal Chiriliuc <adal.chiriliuc at gmail.com> added the comment:
It's an internal web API at the place I work for.
To be able to use it from Python in some form, I did an workaround in which I just stripped everything outside BMP:
# replace characters outside BMP with 'REPLACEMENT CHARACTER' (U+FFFD)
def cesu8_to_utf8(text):
....result = ""
....index = 0
....length = len(text)
....while index < length:
........if text[index] < "\xf0":
............result += text[index]
............index += 1
........else:
............result += "\xef\xbf\xbd" # u"\ufffd".encode("utf8")
............index += 4
....return result
Now that I look at the workaround again, I'm not even sure it's about CESU-8 (it strips Unicode chars encoded to 4 bytes, not 2 pairs of 3 bytes surrogates).
However I can see why there would be little interest in adding this encoding.
----------
_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue12742>
_______________________________________
More information about the Python-bugs-list
mailing list