RE Module Performance

Thu Jul 25 03:58:10 EDT 2013

On Thu, Jul 25, 2013 at 5:15 PM, Steven D'Aprano
<steve+comp.lang.python at pearwood.info> wrote:
> On Thu, 25 Jul 2013 04:15:42 +1000, Chris Angelico wrote:
>
>> If nobody had ever thought of doing a multi-format string
>> representation, I could well imagine the Python core devs debating
>> whether the cost of UTF-32 strings is worth the correctness and
>> consistency improvements... and most likely concluding that narrow
>> builds get abolished. And if any other language (eg ECMAScript) decides
>> to move from UTF-16 to UTF-32, I would wholeheartedly support the move,
>> even if it broke code to do so.
>
> Unfortunately, so long as most language designers are European-centric,
> there is going to be a lot of push-back against any attempt to fix (say)
> Javascript, or Java just for the sake of "a bunch of dead languages" in
> the SMPs. Thank goodness for emoji. Wait til the young kids start
> complaining that their emoticons and emoji are broken in Javascript, and
> eventually it will get fixed. It may take a decade, for the young kids to
> grow up and take over Javascript from the old-codgers, but it will happen.

I don't know that that'll happen like that. Emoticons aren't broken in
Javascript - you can use them just fine. You only start seeing
problems when you index into that string. People will start to wonder
why, for instance, a "500 character maximum" field deducts two from
the limit when an emoticon goes in. Example:

Type here:<br><textarea id=content oninput="showlimit(this)"></textarea>
<br>You have <span id=limit1>500</span> characters left (self.value.length).
<br>You have <span id=limit2>500</span> characters left (self.textLength).
<script>
function showlimit(self)
{
	document.getElementById("limit1").innerHTML=500-self.value.length;
	document.getElementById("limit2").innerHTML=500-self.textLength;
}
</script>

I've included an attribute documented here[1] as the "codepoint length
of the control's value", but in Chrome on Windows, it still counts
UTF-16 code units. However, I very much doubt that this will result in
language changes. People will just live with it. Chinese and Japanese
users will complain, perhaps, and the developers will write it off as
whinging, and just say "That's what the internet does". Maybe, if
you're really lucky, they'll acknowledge that "that's what JavaScript
does", but even then I doubt it'd result in language changes.

>> To my mind, exposing UTF-16 surrogates
>> to the application is a bug to be fixed, not a feature to be maintained.
>
> This, times a thousand.
>
> It is *possible* to have non-buggy string routines using UTF-16, but the
> implementation is a lot more complex than most language developers can be
> bothered with. I'm not aware of any language that uses UTF-16 internally
> that doesn't give wrong results for surrogate pairs.

The problem isn't the underlying representation, the problem is what
gets exposed to the application. Once you've decided to expose
codepoints to the app (abstracting over your UTF-16 underlying
representation), the change to using UTF-32, or mimicking PEP 393, or
some other structure, is purely internal and an optimization. So I
doubt any language will use UTF-16 internally and UTF-32 to the app.
It'd be needlessly complex.

ChrisA

[1] https://developer.mozilla.org/en-US/docs/Web/API/HTMLTextAreaElement