RE Module Performance

wxjmfauth at gmail.com wxjmfauth at gmail.com
Fri Jul 26 11:46:58 EDT 2013


Le vendredi 26 juillet 2013 05:20:45 UTC+2, Ian a écrit :
> On Thu, Jul 25, 2013 at 8:48 PM, Steven D'Aprano
> 
> <steve+comp.lang.python at pearwood.info> wrote:
> 
> > UTF-8 uses a flexible representation on a character-by-character basis.
> 
> > When parsing UTF-8, one needs to look at EVERY character to decide how
> 
> > many bytes you need to read. In Python 3, the flexible representation is
> 
> > on a string-by-string basis: once Python has looked at the string header,
> 
> > it can tell whether the *entire* string takes 1, 2 or 4 bytes per
> 
> > character, and the string is then fixed-width. You can't do that with
> 
> > UTF-8.
> 
> 
> 
> UTF-8 does not use a flexible representation.  A codec that is
> 
> encoding a string in UTF-8 and examining a particular character does
> 
> not have any choice of how to encode that character; there is exactly
> 
> one sequence of bits that is the UTF-8 encoding for the character.
> 
> Further, for any given sequence of code points there is exactly one
> 
> sequence of bytes that is the UTF-8 encoding of those code points.  In
> 
> contrast, with the FSR there are as many as three different sequences
> 
> of bytes that encode a sequence of code points, with one of them (the
> 
> shortest) being canonical.  That's what makes it flexible.
> 
> 
> 
> Anyway, my point was just that Emacs is not a counter-example to jmf's
> 
> claim about implementing text editors, because UTF-8 is not what he
> 
> (or anybody else) is referring to when speaking of the FSR or
> 
> "something like the FSR".

-----

Let's be clear. I'm perfectly understanding what is utf-8
and that's for that precise reason, I put the "editor"
as an exemple on the table.

This FSR is not *a* coding scheme. It is more a composite
coding scheme. (And form there, all the problems).

BTW, I'm pleased to read "sequence of bits" and not bytes.
Again, utf transformers are producing sequence of bits,
call Unicode Transformation Units, with lengths of
8/16/32 *bits*, from there the names utf8/16/32.
UCS transformers are (were) producing bytes, from there
the names ucs-2/4.

jmf





More information about the Python-list mailing list