FSR and unicode compliance - was Re: RE Module Performance

Sun Jul 28 15:23:04 EDT 2013

Le dimanche 28 juillet 2013 17:52:47 UTC+2, Michael Torrie a écrit :
> On 07/27/2013 12:21 PM, wxjmfauth at gmail.com wrote:
> 
> > Good point. FSR, nice tool for those who wish to teach
> 
> > Unicode. It is not every day, one has such an opportunity.
> 
> 
> 
> I had a long e-mail composed, but decided to chop it down, but still too
> 
> long.  so I ditched a lot of the context, which jmf also seems to do.
> 
> Apologies.
> 
> 
> 
> 1. FSR *is* UTF-32 so it is as unicode compliant as UTF-32, since UTF-32
> 
> is an official encoding.  FSR only differs from UTF-32 in that the
> 
> padding zeros are stripped off such that it is stored in the most
> 
> compact form that can handle all the characters in string, which is
> 
> always known at string creation time.  Now you can argue many things,
> 
> but to say FSR is not unicode compliant is quite a stretch!  What
> 
> unicode entities or characters cannot be stored in strings using FSR?
> 
> What sequences of bytes in FSR result in invalid Unicode entities?
> 
> 
> 
> 2. strings in Python *never change*.  They are immutable.  The +
> 
> operator always copies strings character by character into a new string
> 
> object, even if Python had used UTF-8 internally.  If you're doing a lot
> 
> of string concatenations, perhaps you're using the wrong data type.  A
> 
> byte buffer might be better for you, where you can stuff utf-8 sequences
> 
> into it to your heart's content.
> 
> 
> 
> 3. UTF-8 and UTF-16 encodings, being variable width encodings, mean that
> 
> slicing a string would be very very slow, and that's unacceptable for
> 
> the use cases of python strings.  I'm assuming you understand big O
> 
> notation, as you talk of experience in many languages over the years.
> 
> FSR and UTF-32 both are O(1) for slicing and lookups.  UTF-8, 16 and any
> 
> variable-width encoding are always O(n).  A lot slower!
> 
> 
> 
> 4. Unicode is, well, unicode.  You seem to hop all over the place from
> 
> talking about code points to bytes to bits, using them all
> 
> interchangeably.  And now you seem to be claiming that a particular byte
> 
> encoding standard is by definition unicode (UTF-8).  Or at least that's
> 
> how it sounds.  And also claim FSR is not compliant with unicode
> 
> standards, which appears to me to be completely false.
> 
> 
> 
> Is my understanding of these things wrong?

------

Compare these (a BDFL exemple, where I'using a non-ascii char)

Py 3.2 (narrow build)

>>> timeit.timeit("a = 'hundred'; 'x' in a")
0.09897159682121348
>>> timeit.timeit("a = 'hundre€'; 'x' in a")
0.09079501961732461
>>> sys.getsizeof('d')
32
>>> sys.getsizeof('€')
32
>>> sys.getsizeof('dd')
34
>>> sys.getsizeof('d€')
34

Py3.3

>>> timeit.timeit("a = 'hundred'; 'x' in a")
0.12183182740848858
>>> timeit.timeit("a = 'hundre€'; 'x' in a")
0.2365732969632326
>>> sys.getsizeof('d')
26
>>> sys.getsizeof('€')
40
>>> sys.getsizeof('dd')
27
>>> sys.getsizeof('d€')
42

Tell me which one seems to be more "unicode compliant"?
The goal of Unicode is to handle every char "equaly".

Now, the problem: memory. Do not forget that à la "FSR"
mechanism for a non-ascii user is *irrelevant*. As
soon as one uses one single non-ascii, your ascii feature
is lost. (That why we have all these dedicated coding
schemes, utfs included).

>>> sys.getsizeof('abc' * 1000 + 'z')
3026
>>> sys.getsizeof('abc' * 1000 + '\U00010010')
12044

A bit secret. The larger a repertoire of characters
is, the more bits you needs.
Secret #2. You can not escape from this.

jmf