RE Module Performance

Chris Angelico rosuav at gmail.com
Wed Jul 24 18:19:21 EDT 2013


On Thu, Jul 25, 2013 at 8:09 AM, Terry Reedy <tjreedy at udel.edu> wrote:
> On 7/24/2013 2:15 PM, Chris Angelico wrote:
>> To my mind, exposing UTF-16 surrogates to the application is a bug
>> to be fixed, not a feature to be maintained.
>
> It is definitely not a feature, but a proper UTF-16 implementation would not
> expose them except to codecs, just as with the PEP 393 implementation. (In
> both cases, I am excluding the sys size function as 'exposing to the
> application'.)
>
>> But since we can get the best of both worlds with only
>> a small amount of overhead, I really don't see why anyone should be
>> objecting.
>
> I presume you are referring to the PEP 393 1-2-4 byte implementation. Given
> how well it has been optimized, I think it was the right choice for Python.
> But a language that now uses USC2 or defective UTF-16 on all platforms might
> find the auxiliary array an easier fix.
>

I'm referring here to objections like jmf's, and also to threads like this:

http://mozilla.6506.n7.nabble.com/Flexible-String-Representation-full-Unicode-for-ES6-td267585.html

According to the ECMAScript people, UTF-16 and exposing surrogates to
the application is a critical feature to be maintained. I disagree.
But it's not my language, so I'm stuck with it. (I ended up writing a
little wrapper function in C that detects unpaired surrogates, but
that still doesn't deal with the possibility that character indexing
can create a new character that was never there to start with.)

ChrisA



More information about the Python-list mailing list