Python's handling of unicode surrogates

Fri Apr 20 20:21:46 EDT 2007

> I don't believe this specific variant has been discussed.

Now that you clarify it: no, it hasn't been discussed. I find that
not surprising - this proposal is so strange and unnatural that
probably nobody dared to suggest it.

> s[5] does not exist.  You would get an IndexError indicating that it
> refers to the second half of a surrogate.
> 
[...]
> 
> len(s[k]) would be 2 if it involved a surrogate, yes.  One character,
> two code units.

Please consider trade-offs. Study advantages and disadvantages. Compare
them. Can you then seriously suggest that indexing should have 'holes'?
That it will be an IndexError if you access with an index between 0
and len(s)???????

If you absolutely think support for non-BMP characters is necessary
in every program, suggesting that Python use UCS-4 by default on
all systems has a higher chance of finding acceptance (in comparison).

Regards,
Martin