[Python-Dev] PEP 393 decode() oddity

Sun Mar 25 22:55:13 CEST 2012

> Anyone can test.
>
> $ ./python -m timeit -s 'enc = "latin1"; import codecs; d =  
> codecs.getdecoder(enc); x = ("\u0020" * 100000).encode(enc)' 'd(x)'
> 10000 loops, best of 3: 59.4 usec per loop
> $ ./python -m timeit -s 'enc = "latin1"; import codecs; d =  
> codecs.getdecoder(enc); x = ("\u0080" * 100000).encode(enc)' 'd(x)'
> 10000 loops, best of 3: 28.4 usec per loop
>
> The results are fairly stable (±0.1 µsec) from run to run. It looks  
> funny thing.

This is not surprising. When decoding Latin-1, it needs to determine
whether the string is pure ASCII or not. If it is not, it must be all
Latin-1 (it can't be non-Latin-1).

For a pure ASCII string, it needs to scan over the entire string,
trying to find a non-ASCII character. If there is none, it has to inspect
the entire string.

In your example, as the first character is is already above 127, search
for the maximum character can stop, so it needs to scan the string only
once.

Try '\u0020' * 999999+'\u0080', which is a non-ASCII string but still
takes the same time as the pure ASCII string.

Regards,
Martin