trying to strip out non ascii.. or rather convert non ascii

Chris Angelico rosuav at gmail.com
Tue Oct 29 22:17:21 EDT 2013


On Wed, Oct 30, 2013 at 2:56 AM, Mark Lawrence <breamoreboy at yahoo.co.uk> wrote:
> You've stated above that logically unicode is badly handled by the fsr.  You
> then provide a trivial timing example.  WTF???

His idea of bad handling is "oh how terrible, ASCII and BMP have
optimizations". He hates the idea that it could be better in some
areas instead of even timings all along. But the FSR actually has some
distinct benefits even in the areas he's citing - watch this:

>>> import timeit
>>> timeit.timeit("a = 'hundred'; 'x' in a")
0.3625614428649451
>>> timeit.timeit("a = 'hundreij'; 'x' in a")
0.6753936603674484
>>> timeit.timeit("a = 'hundred'; 'ģ' in a")
0.25663261671525106
>>> timeit.timeit("a = 'hundreij'; 'ģ' in a")
0.3582399439035271

The first two examples are his examples done on my computer, so you
can see how all four figures compare. Note how testing for the
presence of a non-Latin1 character in an 8-bit string is very fast.
Same goes for testing for non-BMP character in a 16-bit string. The
difference gets even larger if the string is longer:

>>> timeit.timeit("a = 'hundred'*1000; 'x' in a")
10.083378194714726
>>> timeit.timeit("a = 'hundreij'*1000; 'x' in a")
18.656413035735
>>> timeit.timeit("a = 'hundreij'*1000; 'ģ' in a")
18.436268855399135
>>> timeit.timeit("a = 'hundred'*1000; 'ģ' in a")
2.8308718007456264

Wow! The FSR speeds up searches immensely! It's obviously the best
thing since sliced bread!

ChrisA



More information about the Python-list mailing list