Python's handling of unicode surrogates
"Martin v. Löwis"
martin at v.loewis.de
Fri Apr 20 20:21:46 EDT 2007
> I don't believe this specific variant has been discussed.
Now that you clarify it: no, it hasn't been discussed. I find that
not surprising - this proposal is so strange and unnatural that
probably nobody dared to suggest it.
> s[5] does not exist. You would get an IndexError indicating that it
> refers to the second half of a surrogate.
>
[...]
>
> len(s[k]) would be 2 if it involved a surrogate, yes. One character,
> two code units.
Please consider trade-offs. Study advantages and disadvantages. Compare
them. Can you then seriously suggest that indexing should have 'holes'?
That it will be an IndexError if you access with an index between 0
and len(s)???????
If you absolutely think support for non-BMP characters is necessary
in every program, suggesting that Python use UCS-4 by default on
all systems has a higher chance of finding acceptance (in comparison).
Regards,
Martin
More information about the Python-list
mailing list