FSR and unicode compliance - was Re: RE Module Performance

Sun Jul 28 15:44:17 EDT 2013

On 28/07/2013 20:23, wxjmfauth at gmail.com wrote:
[snip]
>
> Compare these (a BDFL exemple, where I'using a non-ascii char)
>
> Py 3.2 (narrow build)
>
Why are you using a narrow build of Python 3.2? It doesn't treat all
codepoints equally (those outside the BMP can't be stored in one code
unit) and, therefore, it isn't "Unicode compliant"!

>>>> timeit.timeit("a = 'hundred'; 'x' in a")
> 0.09897159682121348
>>>> timeit.timeit("a = 'hundre€'; 'x' in a")
> 0.09079501961732461
>>>> sys.getsizeof('d')
> 32
>>>> sys.getsizeof('€')
> 32
>>>> sys.getsizeof('dd')
> 34
>>>> sys.getsizeof('d€')
> 34
>
>
> Py3.3
>
>>>> timeit.timeit("a = 'hundred'; 'x' in a")
> 0.12183182740848858
>>>> timeit.timeit("a = 'hundre€'; 'x' in a")
> 0.2365732969632326
>>>> sys.getsizeof('d')
> 26
>>>> sys.getsizeof('€')
> 40
>>>> sys.getsizeof('dd')
> 27
>>>> sys.getsizeof('d€')
> 42
>
> Tell me which one seems to be more "unicode compliant"?
> The goal of Unicode is to handle every char "equaly".
>
> Now, the problem: memory. Do not forget that à la "FSR"
> mechanism for a non-ascii user is *irrelevant*. As
> soon as one uses one single non-ascii, your ascii feature
> is lost. (That why we have all these dedicated coding
> schemes, utfs included).
>
>>>> sys.getsizeof('abc' * 1000 + 'z')
> 3026
>>>> sys.getsizeof('abc' * 1000 + '\U00010010')
> 12044
>
> A bit secret. The larger a repertoire of characters
> is, the more bits you needs.
> Secret #2. You can not escape from this.
>
>
> jmf
>