[Python-Dev] Multilingual programming article on the Red Hat Developer blog

Sat Sep 13 00:16:30 CEST 2014

Jim, Stephen:

It seems like we're off topic here, but to answer all as briefly as 
possible:

1. Java does not really have a Unicode type, therefore not one that 
validates. It has a String type that is a sequence of UTF-16 code units. 
There are some String methods and Character methods that deal with code 
points represented as int. I can put any 16-bit values I like in a String.
2. With proper accounting for indices, and as long as surrogates appear 
in pairs, I believe operations like find or endswith give correct 
answers about the unicode, when applied to the UTF-16. This is an 
attractive implementation option, and mostly what we do.
3. I'm fixing some bugs where we get it wrong beyond the BMP, and the 
fix involves banning lone surrogates (completely). At present you can't 
type them in literals but you can sneak them in from Java.
4. I think (with Antoine) if Jython supported PEP-383 byte smuggling, it 
would have to do it the same way as CPython, as it is visible. It's not 
impossible (I think), but is messy. Some are strongly against.

Jeff Allen

On 12/09/2014 16:37, Jim J. Jewett wrote:
>
>
> On September 11, 2014, Jeff Allen wrote:
>
>> ... "surrogateescape" is an error handler, not a codec.
> True, but I believe that is a CPython implementation detail.
>
> Other implementations (including jython) should implement the
> surrogatescape API, but I don't think it is important to use the
> same internal representation for the invalid bytes.
>
>> lone surrogates preclude a naive use of the platform string library
> Invalid input often causes problems.  Are you saying that there are
> situations where the platform string library could easily handle
> invalid characters in general, but has a problem with the specific
> case of lone surrogates?
>