[issue5127] UnicodeEncodeError - I can't even see license
Ezio Melotti
report at bugs.python.org
Mon Oct 5 13:16:29 CEST 2009
Ezio Melotti <ezio.melotti at gmail.com> added the comment:
>> We might keep the old public API for compatibility, but it should be
>> clearly marked as broken for non-BMP scalar values.
> That has always been the case. UCS2 doesn't support surrogates.
> However, we have been slowly moving into the direction of making
> the UCS2 storage appear like UTF-16 to the Python programmer.
UCS2 died long ago, is there any reason why we keep using an UCS2 that
"appears" like UTF-16 instead of real UTF-16?
> This process is not yet complete and will likely never complete
> since it must still be possible to create things line lone
> surrogates for processing purposes, so care has to be taken
> when using non-BMP code points on narrow builds.
I don't exactly know all the details of the current implementation, but
-- from what I understand reading this (correct me if I'm wrong) -- it
seems that the implementation is half-UCS2 to allow things like the
processing of lone surrogates and half-UTF16 (or UTF-16-compatible) to
work with surrogate pairs and hence with chars outside the BMP.
What are the use cases for processing the lone surrogates? Wouldn't be
better to use UTF-16 and disallow them (since they are illegal) and
possibly provide some other way to deal with them (if it's really needed)?
----------
_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue5127>
_______________________________________
More information about the Python-bugs-list
mailing list