RE Module Performance

Thu Jul 25 01:58:34 EDT 2013

On Thu, Jul 25, 2013 at 3:49 PM, Serhiy Storchaka <storchaka at gmail.com> wrote:
> 24.07.13 21:15, Chris Angelico написав(ла):
>
>> To my mind, exposing UTF-16
>> surrogates to the application is a bug to be fixed, not a feature to
>> be maintained.
>
>
> Python 3 uses code points from U+DC80 to U+DCFF (which are in surrogates
> area) to represent undecodable bytes with surrogateescape error handler.

That's a deliberate and conscious use of the codepoints; that's not
what I'm talking about here. Suppose you read a UTF-8 stream of bytes
from a file, and decode them into your language's standard string
type. At this point, you should be working with a string of Unicode
codepoints:

"\22\341\210\264\360\222\215\205"

-->

"\x12\u1234\U00012345"

The incoming byte stream has a length of 8, the resulting character
stream has a length of 3. Now, if the language wants to use UTF-16
internally, it's free to do so:

0012 1234 d808 df45

When I referred to exposing surrogates to the application, this is
what I'm talking about. If decoding the above byte stream results in a
length 4 string where the last two are \xd808 and \xdf45, then it's
exposing them. If it's a length 3 string where the last is \U00012345,
then it's hiding them. To be honest, I don't imagine I'll ever see a
language that stores strings in UTF-16 and then exposes them to the
application as UTF-32; there's very little point. But such *is*
possible, and if it's working closely with libraries that demand
UTF-16, it might well make sense to do things that way.

ChrisA