Python's handling of unicode surrogates

Fri Apr 20 02:44:53 EDT 2007

(Sorry for the dupe, Martin.  Gmail made it look like your reply was
in private.)

On 4/19/07, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> > Thoughts, from all you readers out there?  For/against?
>
> See PEP 261. This things have all been discussed at that time,
> and an explicit decision against what I think (*) your proposal is
> was taken. If you want to, you can try to revert that
> decision, but you would need to write a PEP.

I don't believe this specific variant has been discussed.  The change
I propose would make indexes non-contiguous, making unicode
technically not a sequence.  I say that's a case for "practicality
beats purity".

Of course I'd appreciate any clarification before I bring it to
python-3000.

> Regards,
> Martin
>
> (*) I don't fully understand your proposal. You say that you
> want "gaps in [the string's] index", but I'm not sure what
> that means. If you have a surrogate pair on index 4, would
> it mean that s[5] does not exist, or would it mean that
> s[5] is the character following the surrogate pair? Is
> there any impact on the length of the string? Could it be
> that len(s[k]) is 2 for some values of s and k?

s[5] does not exist.  You would get an IndexError indicating that it
refers to the second half of a surrogate.

The length of the string will not be changed.  s[s.find(sub):] will
not be changed, so long as sub is a well-formed unicode string.
Nothing that properly handles unicode surrogates will be changed.

len(s[k]) would be 2 if it involved a surrogate, yes.  One character,
two code units.

The only code that will be changed is that which doesn't handle
surrogates properly.  Some will start working properly.  Some (ie
random.choice(u'\U00100000\uFFFF')) will fail explicitly (rather than
silently).

--
Adam Olsen, aka Rhamphoryncus