[Python-Dev] Multilingual programming article on the Red Hat Developer blog

Mon Sep 15 20:35:01 CEST 2014

On Sat Sep 13 00:16:30 CEST 2014, Jeff Allen wrote:

> 1. Java does not really have a Unicode type, therefore not one that 
> validates. It has a String type that is a sequence of UTF-16 code units. 
> There are some String methods and Character methods that deal with code 
> points represented as int. I can put any 16-bit values I like in a String.

Including lone surrogates, and invalid characters in general?

> 2. With proper accounting for indices, and as long as surrogates appear 
> in pairs, I believe operations like find or endswith give correct 
> answers about the unicode, when applied to the UTF-16. This is an 
> attractive implementation option, and mostly what we do.

So use it.  The fact that you're having to smuggle bytes already
guarantees that your data is either invalid or misinterpreted, and
bug-free isn't possible.

In terms of best-effort, it is reasonable to treat the smuggled bytes
as representing a character outside of your unicode repertoire -- so
it won't ever match entirely valid strings, except perhaps via a
wildcard.  And it should still work for
   .endswith(<the same invalid characters>).

> 3. I'm fixing some bugs where we get it wrong beyond the BMP, and the 
> fix involves banning lone surrogates (completely).  At present you can't 
> type them in literals but you can sneak them in from Java.

So how will you ban them, and what will you do when some java class
sends you an invalid sequence anyhow?  That is exactly the use case
for these smuggled bytes... 

If you distinguish between a fully constructed PyString and a 
code-unit-sequence-that-could-be-made-into-a-PyString-later,
then you could always have your constructor return an InvalidPyString
subclass on the rare occasions when one is needed.

If you want to avoid invalid surrogates even then, just use the
replacement character and keep a separate list of "original
characters that got replaced in this string" -- a hassle, but no
worse than tracking indices for surrogates.

> 4. I think (with Antoine) if Jython supported PEP-383 byte smuggling, it 
> would have to do it the same way as CPython, as it is visible. It's not 
> impossible (I think), but is messy. Some are strongly against.

If you allow direct write access to the underlying charsequence
(as CPython does to C extensions), then you can't really ban
invalid sequences.  If callers have to go through an API -- even
something as minimal as  getBytes or getChars -- then you can use
whatever internal representation you prefer.  Hopefully, the vast
majority of strings won't actually have smuggled bytes.

-jJ

--

If there are still threading problems with my replies, please
email me with details, so that I can try to resolve them.  -jJ