FSR and unicode compliance - was Re: RE Module Performance

Sun Jul 28 15:55:50 EDT 2013

Op 28-07-13 21:23, wxjmfauth at gmail.com schreef:
> Le dimanche 28 juillet 2013 17:52:47 UTC+2, Michael Torrie a écrit :
>> On 07/27/2013 12:21 PM, wxjmfauth at gmail.com wrote:
>>
>>> Good point. FSR, nice tool for those who wish to teach
>>
>>> Unicode. It is not every day, one has such an opportunity.
>>
>>
>>
>> I had a long e-mail composed, but decided to chop it down, but still too
>>
>> long.  so I ditched a lot of the context, which jmf also seems to do.
>>
>> Apologies.
>>
>>
>>
>> 1. FSR *is* UTF-32 so it is as unicode compliant as UTF-32, since UTF-32
>>
>> is an official encoding.  FSR only differs from UTF-32 in that the
>>
>> padding zeros are stripped off such that it is stored in the most
>>
>> compact form that can handle all the characters in string, which is
>>
>> always known at string creation time.  Now you can argue many things,
>>
>> but to say FSR is not unicode compliant is quite a stretch!  What
>>
>> unicode entities or characters cannot be stored in strings using FSR?
>>
>> What sequences of bytes in FSR result in invalid Unicode entities?
>>
>>
>>
>> 2. strings in Python *never change*.  They are immutable.  The +
>>
>> operator always copies strings character by character into a new string
>>
>> object, even if Python had used UTF-8 internally.  If you're doing a lot
>>
>> of string concatenations, perhaps you're using the wrong data type.  A
>>
>> byte buffer might be better for you, where you can stuff utf-8 sequences
>>
>> into it to your heart's content.
>>
>>
>>
>> 3. UTF-8 and UTF-16 encodings, being variable width encodings, mean that
>>
>> slicing a string would be very very slow, and that's unacceptable for
>>
>> the use cases of python strings.  I'm assuming you understand big O
>>
>> notation, as you talk of experience in many languages over the years.
>>
>> FSR and UTF-32 both are O(1) for slicing and lookups.  UTF-8, 16 and any
>>
>> variable-width encoding are always O(n).  A lot slower!
>>
>>
>>
>> 4. Unicode is, well, unicode.  You seem to hop all over the place from
>>
>> talking about code points to bytes to bits, using them all
>>
>> interchangeably.  And now you seem to be claiming that a particular byte
>>
>> encoding standard is by definition unicode (UTF-8).  Or at least that's
>>
>> how it sounds.  And also claim FSR is not compliant with unicode
>>
>> standards, which appears to me to be completely false.
>>
>>
>>
>> Is my understanding of these things wrong?
>
> ------
>
> Compare these (a BDFL exemple, where I'using a non-ascii char)
>
> Py 3.2 (narrow build)
>
>>>> timeit.timeit("a = 'hundred'; 'x' in a")
> 0.09897159682121348
>>>> timeit.timeit("a = 'hundre€'; 'x' in a")
> 0.09079501961732461
>>>> sys.getsizeof('d')
> 32
>>>> sys.getsizeof('€')
> 32
>>>> sys.getsizeof('dd')
> 34
>>>> sys.getsizeof('d€')
> 34
>
>
> Py3.3
>
>>>> timeit.timeit("a = 'hundred'; 'x' in a")
> 0.12183182740848858
>>>> timeit.timeit("a = 'hundre€'; 'x' in a")
> 0.2365732969632326
>>>> sys.getsizeof('d')
> 26
>>>> sys.getsizeof('€')
> 40
>>>> sys.getsizeof('dd')
> 27
>>>> sys.getsizeof('d€')
> 42
>
> Tell me which one seems to be more "unicode compliant"?

Cant tell, you give no relevant information on which one can decide
this question.

> The goal of Unicode is to handle every char "equaly".

Not to this kind of detail, which is looking at irrelevant
implementation details.

> Now, the problem: memory. Do not forget that à la "FSR"
> mechanism for a non-ascii user is *irrelevant*. As
> soon as one uses one single non-ascii, your ascii feature
> is lost. (That why we have all these dedicated coding
> schemes, utfs included).

So? Why should that trouble me? As far as I understand
whether I have an ascii string or not is totally irrelevant
to the application programmer. Within the application I
just process strings and let the programming environment
keep track of these details in a transparant way unless
you start looking at things like getsizeof, which gives
you implementation details that are mostly irrelevant
in deciding whether the behaviour is compliant or not.

>>>> sys.getsizeof('abc' * 1000 + 'z')
> 3026
>>>> sys.getsizeof('abc' * 1000 + '\U00010010')
> 12044
>
> A bit secret. The larger a repertoire of characters
> is, the more bits you needs.
> Secret #2. You can not escape from this.

And totally unimportant for deciding complyance.

-- 
Antoon Pardon