RE Module Performance

Thu Jul 25 22:48:17 EDT 2013

On Thu, 25 Jul 2013 15:45:38 -0500, Ian Kelly wrote:

> On Thu, Jul 25, 2013 at 12:18 PM, Steven D'Aprano
> <steve+comp.lang.python at pearwood.info> wrote:
>> On Fri, 26 Jul 2013 01:36:07 +1000, Chris Angelico wrote:
>>
>>> On Fri, Jul 26, 2013 at 1:26 AM, Steven D'Aprano
>>> <steve+comp.lang.python at pearwood.info> wrote:
>>>> On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote:
>>>>> "To conserve memory, Emacs does not hold fixed-length 22-bit numbers
>>>>> that are codepoints of text characters within buffers and strings.
>>>>> Rather, Emacs uses a variable-length internal representation of
>>>>> characters, that stores each character as a sequence of 1 to 5 8-bit
>>>>> bytes, depending on the magnitude of its codepoint[1]. For example,
>>>>> any ASCII character takes up only 1 byte, a Latin-1 character takes
>>>>> up 2 bytes, etc. We call this representation of text multibyte.
>>>>
>>>> Well, you've just proven what Vim users have always suspected: Emacs
>>>> doesn't really exist.
>>>
>>> ... lolwut?
>>
>>
>> JMF has explained that it is impossible, impossible I say!, to write an
>> editor using a flexible string representation. Since Emacs uses such a
>> flexible string representation, Emacs is impossible, and therefore
>> Emacs doesn't exist.
>>
>> QED.
> 
> Except that the described representation used by Emacs is a variant of
> UTF-8, not an FSR.  It doesn't have three different possible encodings
> for the letter 'a' depending on what other characters happen to be in
> the string.
> 
> As I understand it, jfm would be perfectly happy if Python used UTF-8
> (or presumably the Emacs variant) as its internal string representation.

UTF-8 uses a flexible representation on a character-by-character basis. 
When parsing UTF-8, one needs to look at EVERY character to decide how 
many bytes you need to read. In Python 3, the flexible representation is 
on a string-by-string basis: once Python has looked at the string header, 
it can tell whether the *entire* string takes 1, 2 or 4 bytes per 
character, and the string is then fixed-width. You can't do that with 
UTF-8.

To put it in terms of pseudo-code:

# Python 3.3
def parse_string(astring):
    # Decision gets made once per string.
    if astring uses 1 byte:
        count = 1
    elif astring uses 2 bytes:
        count = 2
    else: 
        count = 4
    while not done:
        char = convert(next(count bytes))

# UTF-8
def parse_string(astring):
    while not done:
        b = next(1 byte)
        # Decision gets made for every single char
        if uses 1 byte:
            char = convert(b)
        elif uses 2 bytes:
            char = convert(b, next(1 byte))
        elif uses 3 bytes:
            char = convert(b, next(2 bytes))
        else:
            char = convert(b, next(3 bytes))

So UTF-8 requires much more runtime overhead than Python 3.3, and Emac's 
variation can in fact require more bytes per character than either. 
(UTF-8 and Python 3.3 can require up to four bytes, Emacs up to five.) 
I'm not surprised that JMF would prefer UTF-8 -- he is completely out of 
his depth, and is a fine example of the Dunning-Kruger effect in action. 
He is so sure he is right based on so little evidence.

One advantage of UTF-8 is that for some BMP characters, you can get away 
with only three bytes instead of four. For transmitting data over the 
wire, or storage on disk, that's potentially up to a 25% reduction in 
space, which is not to be sneezed at. (Although in practice it's usually 
much less than that, since the most common characters are encoded to 1 or 
2 bytes, not 4). But that comes at the cost of much more runtime 
overhead, which in my opinion makes UTF-8 a second-class string 
representation compared to fixed-width representations.

-- 
Steven