break unichr instead of fix ord?

Nobody nobody at nowhere.com
Sun Aug 30 12:42:40 EDT 2009


On Sun, 30 Aug 2009 06:54:21 +0200, Dieter Maurer wrote:

>> What you propose would break the property "unichr(i) always returns
>> a string of length one, if it returns anything at all".
> 
> But getting a "ValueError" in some builds (and not in others)
> is rather worse than getting unicode strings of different length....

Not necessarily. If the code assumes that unichr() always returns a
single-character string, it will silently produce bogus results when
unichr() returns a pair of surrogates. An exception is usually preferable
to silently producing bad data.

If unichr() returns a surrogate pair, what is e.g. unichr(i).isalpha()
supposed to do?

Using surrogates is fine in an external representation (UTF-16), but it
doesn't make sense as an internal representation.

Think: why do people use wchar_t[] rather than a char[] encoded in UTF-8?
Because a wchar_t[] allows you to index *characters*, which you can't do
with a multi-byte encoding. You can't do it with a multi-*word* encoding
either.

UCS-2 and UTF-16 are superficially so similar that people forget that
they're completely different beasts. UCS-2 is fixed-length, UTF-16 is
variable-length. This makes UTF-16 semantically much closer to UTF-8 than
to UCS-2 or UCS-4.

If your wchar_t is 16 bits, the only sane solution is to forego support
for characters outside of the BMP.

The alternative is to process wide strings in exactly the same way that
you process narrow (mbcs) strings; e.g. extracting character N requires
iterating over the string from the beginning until you have counted N-1
characters. This provides no benefit over using narrow strings except for
a slight performance gain from halving the number of iterations. You still
end up with indexing being O(n) rather than O(1).




More information about the Python-list mailing list