RE Module Performance

Thu Jul 25 05:22:17 EDT 2013

On Thu, 25 Jul 2013 17:58:10 +1000, Chris Angelico wrote:

> On Thu, Jul 25, 2013 at 5:15 PM, Steven D'Aprano
> <steve+comp.lang.python at pearwood.info> wrote:
>> On Thu, 25 Jul 2013 04:15:42 +1000, Chris Angelico wrote:
>>
>>> If nobody had ever thought of doing a multi-format string
>>> representation, I could well imagine the Python core devs debating
>>> whether the cost of UTF-32 strings is worth the correctness and
>>> consistency improvements... and most likely concluding that narrow
>>> builds get abolished. And if any other language (eg ECMAScript)
>>> decides to move from UTF-16 to UTF-32, I would wholeheartedly support
>>> the move, even if it broke code to do so.
>>
>> Unfortunately, so long as most language designers are European-centric,
>> there is going to be a lot of push-back against any attempt to fix
>> (say) Javascript, or Java just for the sake of "a bunch of dead
>> languages" in the SMPs. Thank goodness for emoji. Wait til the young
>> kids start complaining that their emoticons and emoji are broken in
>> Javascript, and eventually it will get fixed. It may take a decade, for
>> the young kids to grow up and take over Javascript from the
>> old-codgers, but it will happen.
> 
> I don't know that that'll happen like that. Emoticons aren't broken in
> Javascript - you can use them just fine. You only start seeing problems
> when you index into that string. People will start to wonder why, for
> instance, a "500 character maximum" field deducts two from the limit
> when an emoticon goes in.

I get that. I meant *Javascript developers*, not end-users. The young 
kids today who become Javascript developers tomorrow will grow up in a 
world where they expect to be able to write band names like
"▼□■□■□■" (yes, really, I didn't make that one up) and have it just work.
Okay, all those characters are in the BMP, but emoji aren't, and I 
guarantee that even as we speak some new hipster band is trying to decide 
whether to name themselves "Smiling 😢" or "Crying 😊".

:-)

>> It is *possible* to have non-buggy string routines using UTF-16, but
>> the implementation is a lot more complex than most language developers
>> can be bothered with. I'm not aware of any language that uses UTF-16
>> internally that doesn't give wrong results for surrogate pairs.
> 
> The problem isn't the underlying representation, the problem is what
> gets exposed to the application. Once you've decided to expose
> codepoints to the app (abstracting over your UTF-16 underlying
> representation), the change to using UTF-32, or mimicking PEP 393, or
> some other structure, is purely internal and an optimization. So I doubt
> any language will use UTF-16 internally and UTF-32 to the app. It'd be
> needlessly complex.

To be honest, I don't understand what you are trying to say.

What I'm trying to say is that it is possible to use UTF-16 internally, 
but *not* assume that every code point (character) is represented by a 
single 2-byte unit. For example, the len() of a UTF-16 string should not 
be calculated by counting the number of bytes and dividing by two. You 
actually need to walk the string, inspecting each double-byte:

# calculate length
count = 0
inside_surrogate = False
for bb in buffer:  # get two bytes at a time
    if is_lower_surrogate(bb):
        inside_surrogate = True
        continue
    if is_upper_surrogate(bb):
        if inside_surrogate:
            count += 1
            inside_surrogate = False
            continue
        raise ValueError("missing lower surrogate")
    if inside_surrogate:
        break
    count += 1
if inside_surrogate:
    raise ValueError("missing upper surrogate")

Given immutable strings, you could validate the string once, on creation, 
and from then on assume they are well-formed:

# calculate length, assuming the string is well-formed:
count = 0
skip = False
for bb in buffer:  # get two bytes at a time
    if skip:
        count += 1
        skip = False
        continue
    if is_surrogate(bb):
        skip = True
    count += 1

String operations such as slicing become much more complex once you can 
no longer assume a 1:1 relationship between code points and code units, 
whether they are 1, 2 or 4 bytes. Most (all?) language developers don't 
handle that complexity, and push responsibility for it back onto the 
coder using the language. 

-- 
Steven