RE Module Performance

Steven D'Aprano steve+comp.lang.python at pearwood.info
Thu Jul 25 03:15:44 EDT 2013


On Thu, 25 Jul 2013 04:15:42 +1000, Chris Angelico wrote:

> If nobody had ever thought of doing a multi-format string
> representation, I could well imagine the Python core devs debating
> whether the cost of UTF-32 strings is worth the correctness and
> consistency improvements... and most likely concluding that narrow
> builds get abolished. And if any other language (eg ECMAScript) decides
> to move from UTF-16 to UTF-32, I would wholeheartedly support the move,
> even if it broke code to do so.

Unfortunately, so long as most language designers are European-centric, 
there is going to be a lot of push-back against any attempt to fix (say) 
Javascript, or Java just for the sake of "a bunch of dead languages" in 
the SMPs. Thank goodness for emoji. Wait til the young kids start 
complaining that their emoticons and emoji are broken in Javascript, and 
eventually it will get fixed. It may take a decade, for the young kids to 
grow up and take over Javascript from the old-codgers, but it will happen.


> To my mind, exposing UTF-16 surrogates
> to the application is a bug to be fixed, not a feature to be maintained.

This, times a thousand.

It is *possible* to have non-buggy string routines using UTF-16, but the 
implementation is a lot more complex than most language developers can be 
bothered with. I'm not aware of any language that uses UTF-16 internally 
that doesn't give wrong results for surrogate pairs.



-- 
Steven



More information about the Python-list mailing list