Python's handling of unicode surrogates

Rhamphoryncus rhamph at gmail.com
Fri Apr 20 22:34:34 EDT 2007


On Apr 20, 6:21 pm, "Martin v. Löwis" <mar... at v.loewis.de> wrote:
> > I don't believe this specific variant has been discussed.
>
> Now that you clarify it: no, it hasn't been discussed. I find that
> not surprising - this proposal is so strange and unnatural that
> probably nobody dared to suggest it.

Difficult problems sometimes need unexpected solutions.

Although Guido seems to be relenting slightly on the O(1) indexing
requirement, so maybe we'll end up with an O(log n) solution (where n
is the number of surrogates, not the length of the string).


> > s[5] does not exist.  You would get an IndexError indicating that it
> > refers to the second half of a surrogate.
>
> [...]
>
> > len(s[k]) would be 2 if it involved a surrogate, yes.  One character,
> > two code units.
>
> Please consider trade-offs. Study advantages and disadvantages. Compare
> them. Can you then seriously suggest that indexing should have 'holes'?
> That it will be an IndexError if you access with an index between 0
> and len(s)???????

If you pick an index at random you will get IndexError.  If you
calculate the index using some combination of len, find, index, rfind,
rindex you will be unaffected by my change.  You can even assume the
length of a character so long as you know it fits in 16 bits (ie any
'\uxxxx' escape).

I'm unaware of any practical use cases that would be harmed by my
change, so that leaves only philosophical issues.  Considering the
difficulty of the problem it seems like an okay trade-off to me.


> If you absolutely think support for non-BMP characters is necessary
> in every program, suggesting that Python use UCS-4 by default on
> all systems has a higher chance of finding acceptance (in comparison).

I wish to write software that supports Unicode.  Like it or not,
Unicode goes beyond the BMP, so I'd be lying if I said I supported
Unicode if I only handled the BMP.

--
Adam Olsen, aka Rhamphoryncus




More information about the Python-list mailing list