Python's handling of unicode surrogates

Sun Apr 22 15:48:21 EDT 2007

On Apr 20, 7:34 pm, Rhamphoryncus <rha... at gmail.com> wrote:
> On Apr 20, 6:21 pm, "Martin v. Löwis" <mar... at v.loewis.de> wrote:
> > If you absolutely think support for non-BMP characters is necessary
> > in every program, suggesting that Python use UCS-4 by default on
> > all systems has a higher chance of finding acceptance (in comparison).
>
> I wish to write software that supports Unicode.  Like it or not,
> Unicode goes beyond the BMP, so I'd be lying if I said I supported
> Unicode if I only handled the BMP.

Having ability to iterate over code points doesn't mean you support
Unicode. For example if you want to determine if a string is one word
and you iterate over code points and call isalpha you'll get incorrect
result in some cases in some languages (just to backup this
claim this isn't going to work at least in Russian. Russian language
uses U+0301 combining acute accent which is not part of the alphabet
but it's an element of the Russian writing system).

IMHO what is really needed is a bunch of high level methods like
.graphemes() - iterate over graphemes
.codepoints() - iterate over codepoints
.isword() - check if the string represents one word
etc...

Then you can actually support all unicode characters in utf-16 build
of Python. Just make all existing unicode methods (except
unicode.__iter__) iterate over code points. Changing __iter__
to iterate over code points will make indexing wierd. When the
programmer is *ready* to support unicode he/she will explicitly
call .codepoints() or .graphemes(). As they say: You can lead
a horse to water, but you can't make it drink.

  -- Leo