trying to strip out non ascii.. or rather convert non ascii

Tue Oct 29 15:16:40 EDT 2013

Le mardi 29 octobre 2013 16:52:49 UTC+1, Tim Chase a écrit :
> On 2013-10-29 08:38, wxjmfauth at gmail.com wrote:
> 
> > >>> import timeit
> 
> > >>> timeit.timeit("a = 'hundred'; 'x' in a")  
> 
> > 0.12621293837694095
> 
> > >>> timeit.timeit("a = 'hundreĳ'; 'x' in a")  
> 
> > 0.26411553466961735
> 
> 
> 
> That reads to me as "If things were purely UCS4 internally, Python
> 
> would normally take 0.264... seconds to execute this test, but core
> 
> devs managed to optimize a particular (lower 127 ASCII characters
> 
> only) case so that it runs in less than half the time."
> 
> 
> 
> Is this not what you intended to demonstrate?  'cuz that sounds
> 
> like a pretty awesome optimization to me.
> 
> 
> 
> -tkc

--------

That's very naive. In fact, what happens is just the opposite.
The "best case" with the FSR is worst than the "worst case"
without the FSR.

And this is just without counting the effect that this poor
Python is spending its time in switching from one internal
representation to one another, without forgetting the fact
that this has to be tested every time.
The more unicode manipulations one applies, the more time
it demands.

Two tasks, that come in my mind: re and normalization.
It's very interesting to observe what happens when one
normalizes latin text and polytonic Greek text, both with
plenty of diactrics.

----

Something different, based on my previous example.

What a European user is supposed to think, when she/he
sees, she/he can be "penalized" by such an amount,
simply by using non ascii characters for a product
which is supposed to be "unicode compliant" ?

jmf