RE Module Performance

wxjmfauth at gmail.com wxjmfauth at gmail.com
Fri Jul 26 09:36:49 EDT 2013


Le vendredi 26 juillet 2013 05:20:45 UTC+2, Ian a écrit :
> On Thu, Jul 25, 2013 at 8:48 PM, Steven D'Aprano
> 
> <steve+comp.lang.python at pearwood.info> wrote:
> 
> > UTF-8 uses a flexible representation on a character-by-character basis.
> 
> > When parsing UTF-8, one needs to look at EVERY character to decide how
> 
> > many bytes you need to read. In Python 3, the flexible representation is
> 
> > on a string-by-string basis: once Python has looked at the string header,
> 
> > it can tell whether the *entire* string takes 1, 2 or 4 bytes per
> 
> > character, and the string is then fixed-width. You can't do that with
> 
> > UTF-8.
> 
> 
> 
> UTF-8 does not use a flexible representation.  A codec that is
> 
> encoding a string in UTF-8 and examining a particular character does
> 
> not have any choice of how to encode that character; there is exactly
> 
> one sequence of bits that is the UTF-8 encoding for the character.
> 
> Further, for any given sequence of code points there is exactly one
> 
> sequence of bytes that is the UTF-8 encoding of those code points.  In
> 
> contrast, with the FSR there are as many as three different sequences
> 
> of bytes that encode a sequence of code points, with one of them (the
> 
> shortest) being canonical.  That's what makes it flexible.
> 
> 
> 
> Anyway, my point was just that Emacs is not a counter-example to jmf's
> 
> claim about implementing text editors, because UTF-8 is not what he
> 
> (or anybody else) is referring to when speaking of the FSR or
> 
> "something like the FSR".

--------


BTW, it is not necessary to use an endorsed Unicode coding
scheme (utf*), a string literal would have been possible,
but then one falls on memory issures.

All these utf are following the basic coding scheme.

I repeat again.
A coding scheme works with a unique set of characters
and its implementation works with a unique set of
encoded code points (the utf's, in case of Unicode).

And again, that why we live today with all these coding
schemes, or, to take the problem from the other side,
that's because one has to work with a unique set of
encoded code points, that all these coding schemes had to
be created.

utf's have not been created by newbies ;-)

jmf




More information about the Python-list mailing list