A few questiosn about encoding

wxjmfauth at gmail.com wxjmfauth at gmail.com
Sun Jun 23 11:51:41 EDT 2013


Le jeudi 20 juin 2013 19:17:12 UTC+2, MRAB a écrit :
> On 20/06/2013 17:37, Chris Angelico wrote:
> 
> > On Fri, Jun 21, 2013 at 2:27 AM,  <wxjmfauth at gmail.com> wrote:
> 
> >> And all these coding schemes have something in common,
> 
> >> they work all with a unique set of code points, more
> 
> >> precisely a unique set of encoded code points (not
> 
> >> the set of implemented code points (byte)).
> 
> >>
> 
> >> Just what the flexible string representation is not
> 
> >> doing, it artificially devides unicode in subsets and try
> 
> >> to handle eache subset differently.
> 
> >>
> 
> >
> 
> >
> 
> > UTF-16 divides Unicode into two subsets: BMP characters (encoded using
> 
> > one 16-bit unit) and astral characters (encoded using two 16-bit units
> 
> > in the D800::/5 netblock, or equivalent thereof). Your beloved narrow
> 
> > builds are guilty of exactly the same crime as the hated 3.3.
> 
> >
> 
> UTF-8 divides Unicode into subsets which are encoded in 1, 2, 3, or 4
> 
> bytes, and those who previously used ASCII still need only 1 byte per
> 
> codepoint!

Sorry, but no, it does not work in that way:
confusion between the set of encoded code points
and the implementation of these called code units.

utf-8: how many bytes to hold an "a" in memory?
one byte.

flexible string representation: how many bytes to
hold an "a" in memory? One byte? No, two.
(Funny, it consumes more memory to hold an ascii char
than ascii itself)


utf-8: In a series of bytes implementing the encoded code
points supposed to hold a string, picking a byte and
finding to which encoded code point it belongs is a no prolem.

flexible string representation: In a series of bytes
implementing the encoded code points supposed to hold a
string, picking a byte and finding to which encoded code
point it belongs is ... impossible !

One of the cause of the bad working of this flexible string
representation.

The basics of any coding scheme, unicode included.

jmf



More information about the Python-list mailing list