[issue2857] add codec for java modified utf-8
Tom Christiansen
report at bugs.python.org
Fri Aug 12 04:41:16 CEST 2011
Tom Christiansen <tchrist at perl.com> added the comment:
Please do not call this "utf-8-java". It is called "cesu-8" per UTS#18 at:
http://unicode.org/reports/tr26/
CESU-8 is *not* a a valid Unicode Transform Format and should not be called UTF-8. It is a real pain in the butt, caused by people who misunderand Unicode mis-encoding UCS-2 into UTF-8, screwing it up. I understand the need to be able to read it, but call it what it is, please.
Despite the talk about Lucene, I note that the Perl port of Lucene uses real UTF-8, not CESU-8.
----------
nosy: +tchrist
_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue2857>
_______________________________________
More information about the Python-bugs-list
mailing list