RE Module Performance

Steven D'Aprano steve+comp.lang.python at pearwood.info
Fri Jul 26 23:37:20 EDT 2013


On Thu, 25 Jul 2013 21:20:45 -0600, Ian Kelly wrote:

> On Thu, Jul 25, 2013 at 8:48 PM, Steven D'Aprano
> <steve+comp.lang.python at pearwood.info> wrote:
>> UTF-8 uses a flexible representation on a character-by-character basis.
>> When parsing UTF-8, one needs to look at EVERY character to decide how
>> many bytes you need to read. In Python 3, the flexible representation
>> is on a string-by-string basis: once Python has looked at the string
>> header, it can tell whether the *entire* string takes 1, 2 or 4 bytes
>> per character, and the string is then fixed-width. You can't do that
>> with UTF-8.
> 
> UTF-8 does not use a flexible representation.

I disagree, and so does Jeremy Sanders who first pointed out the 
similarity between Emacs' UTF-8 and Python's FSR. I'll quote from the 
Emacs documentation again:

"To conserve memory, Emacs does not hold fixed-length 22-bit numbers that
are codepoints of text characters within buffers and strings. Rather,
Emacs uses a variable-length internal representation of characters, that
stores each character as a sequence of 1 to 5 8-bit bytes, depending on
the magnitude of its codepoint. For example, any ASCII character takes
up only 1 byte, a Latin-1 character takes up 2 bytes, etc."

And the Python FSR:

"To conserve memory, Python does not hold fixed-length 21-bit numbers that
are codepoints of text characters within buffers and strings. Rather,
Python uses a variable-length internal representation of characters, that
stores each character as a sequence of 1 to 4 8-bit bytes, depending on
the magnitude of the largest codepoint in the string. For example, any 
all-ASCII or all-Latin1 string takes up only 1 byte per character, an all-
BMP string takes up 2 bytes per character, etc."

See the similarity now? Both flexibly change the width used by code-
points, UTF-8 based on the code-point itself regardless of the rest of 
the string, Python based on the largest code-point in the string.


[...]
> Anyway, my point was just that Emacs is not a counter-example to jmf's
> claim about implementing text editors, because UTF-8 is not what he (or
> anybody else) is referring to when speaking of the FSR or "something
> like the FSR".

Whether JMF can see the similarities between different implementations of 
strings or not is beside the point, those similarities do exist. As do 
the differences, of course, but in this case the differences are in 
favour of Python's FSR. Even if your string is entirely Latin1, a UTF-8 
implementation *cannot know that*, and still has to walk the string byte-
by-byte checking whether the current code point requires 1, 2, 3, or 4 
bytes, while a FSR implementation can simply record the fact that the 
string is pure Latin1 at creation time, and then treat it as fixed-width 
from then on.

JMF claims that FSR is "impossible" to use efficiently, and yet he 
supports encoding schemes which are *less* efficient. Go figure. He tells 
us he has no problem with any of the established UTF encodings, and yet 
the FSR internally uses UTF-16 and UTF-32. (Technically, it's UCS-2, not 
UTF-16, since there are no surrogate pairs. But the difference is 
insignificant.)

Having watched this issue from Day One when JMF first complained about 
it, I believe this is entirely about denying any benefit to ASCII users. 
Had Python implemented a system identical to the current FSR except that 
it added a fourth category, "all ASCII", which used an eight-byte 
encoding scheme (thus making ASCII strings twice as expensive as strings 
including code points from the Supplementary Multilingual Planes), JMF 
would be the scheme's number one champion.

I cannot see any other rational explanation for why JMF prefers broken, 
buggy Unicode implementations, or implementations which are equally 
expensive for all strings, over one which is demonstrably correct, 
demonstrably saves memory, and for realistic, non-contrived benchmarks, 
demonstrably faster, except that he wants to punish ASCII users more than 
he wants to support Unicode users.


-- 
Steven



More information about the Python-list mailing list