RE Module Performance

Ian Kelly ian.g.kelly at gmail.com
Thu Jul 25 23:20:45 EDT 2013


On Thu, Jul 25, 2013 at 8:48 PM, Steven D'Aprano
<steve+comp.lang.python at pearwood.info> wrote:
> UTF-8 uses a flexible representation on a character-by-character basis.
> When parsing UTF-8, one needs to look at EVERY character to decide how
> many bytes you need to read. In Python 3, the flexible representation is
> on a string-by-string basis: once Python has looked at the string header,
> it can tell whether the *entire* string takes 1, 2 or 4 bytes per
> character, and the string is then fixed-width. You can't do that with
> UTF-8.

UTF-8 does not use a flexible representation.  A codec that is
encoding a string in UTF-8 and examining a particular character does
not have any choice of how to encode that character; there is exactly
one sequence of bits that is the UTF-8 encoding for the character.
Further, for any given sequence of code points there is exactly one
sequence of bytes that is the UTF-8 encoding of those code points.  In
contrast, with the FSR there are as many as three different sequences
of bytes that encode a sequence of code points, with one of them (the
shortest) being canonical.  That's what makes it flexible.

Anyway, my point was just that Emacs is not a counter-example to jmf's
claim about implementing text editors, because UTF-8 is not what he
(or anybody else) is referring to when speaking of the FSR or
"something like the FSR".



More information about the Python-list mailing list