Performance of int/long in Python 3

Tue Apr 2 03:24:21 EDT 2013

On 2 avr, 01:43, Neil Hodgson <nhodg... at iinet.net.au> wrote:
> Mark Lawrence:
>
> > You've given many examples of the same type of micro benchmark, not many
> > examples of different types of benchmark.
>
>     Trying to work out what jmfauth is on about I found what appears to
> be a performance regression with '<' string comparisons on Windows
> 64-bit. Its around 30% slower on a 25 character string that differs in
> the last character and 70-100% on a 100 character string that differs at
> the end.
>
>     Can someone else please try this to see if its reproducible? Linux
> doesn't show this problem.
>
>  >c:\python32\python -u "charwidth.py"
> 3.2 (r32:88445, Feb 20 2011, 21:30:00) [MSC v.1500 64 bit (AMD64)]
> a=['C:/Users/Neil/Documents/b','C:/Users/Neil/Documents/z']176
> [0.7116295577956576, 0.7055591343157613, 0.7203483026429418]
>
> a=['C:/Users/Neil/Documents/λ','C:/Users/Neil/Documents/η']176
> [0.7664397841378787, 0.7199902325464409, 0.713719289812504]
>
> a=['C:/Users/Neil/Documents/b','C:/Users/Neil/Documents/η']176
> [0.7341851791817691, 0.6994205901833599, 0.7106807593741005]
>
> a=['C:/Users/Neil/Documents/𠀀','C:/Users/Neil/Documents/𠀁']180
> [0.7346812372666784, 0.6995411113377914, 0.7064768417728411]
>
>  >c:\python33\python -u "charwidth.py"
> 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit
> (AMD64)]
> a=['C:/Users/Neil/Documents/b','C:/Users/Neil/Documents/z']108
> [0.9913326076446045, 0.9455845241056282, 0.9459076605341776]
>
> a=['C:/Users/Neil/Documents/λ','C:/Users/Neil/Documents/η']192
> [1.0472289217234318, 1.0362342484091207, 1.0197109728048384]
>
> a=['C:/Users/Neil/Documents/b','C:/Users/Neil/Documents/η']192
> [1.0439643704533834, 0.9878581050301687, 0.9949265834034335]
>
> a=['C:/Users/Neil/Documents/𠀀','C:/Users/Neil/Documents/𠀁']312
> [1.0987483965446412, 1.0130257167690004, 1.024832248526499]
>
>     Here is the code:
>
> # encoding:utf-8
> import os, sys, timeit
> print(sys.version)
> examples = [
> "a=['$b','$z']",
> "a=['$λ','$η']",
> "a=['$b','$η']",
> "a=['$\U00020000','$\U00020001']"]
> baseDir = "C:/Users/Neil/Documents/"
> #~ baseDir = "C:/Users/Neil/Documents/Visual Studio
> 2012/Projects/Sigma/QtReimplementation/HLFKBase/Win32/x64/Debug"
> for t in examples:
>      t = t.replace("$", baseDir)
>      # Using os.write as simple way get UTF-8 to stdout
>      os.write(sys.stdout.fileno(), t.encode("utf-8"))
>      print(sys.getsizeof(t))
>      print(timeit.repeat("a[0] < a[1]",t,number=5000000))
>      print()
>
>     For a more significant performance difference try replacing the
> baseDir setting with (may be wrapped):
> baseDir = "C:/Users/Neil/Documents/Visual Studio
> 2012/Projects/Sigma/QtReimplementation/HLFKBase/Win32/x64/Debug"
>
>     Neil

--------

Hi,

>c:\python32\pythonw -u "charwidth.py"
3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)]
a=['D:\jm\jmpy\py3app\stringbenchb','D:\jm\jmpy\py3app
\stringbenchz']168
[0.8343414906182101, 0.8336184057396241, 0.8330473419738562]

a=['D:\jm\jmpy\py3app\stringbenchÎ»','D:\jm\jmpy\py3app
\stringbenchÎ·']168
[0.818378092261062, 0.8180854713107406, 0.8192279926793571]

a=['D:\jm\jmpy\py3app\stringbenchb','D:\jm\jmpy\py3app
\stringbenchÎ·']168
[0.8131353330542339, 0.8126985677326912, 0.8122744051977042]

a=['D:\jm\jmpy\py3app\stringbenchð €€','D:\jm\jmpy\py3app
\stringbenchð €']172
[0.8271094603211102, 0.82704053883214, 0.8265781741004083]

>Exit code: 0
>c:\Python33\pythonw -u "charwidth.py"
3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600 32 bit
(Intel)]
a=['D:\jm\jmpy\py3app\stringbenchb','D:\jm\jmpy\py3app
\stringbenchz']94
[1.3840254166697845, 1.3933888932429768, 1.391664674507438]

a=['D:\jm\jmpy\py3app\stringbenchÎ»','D:\jm\jmpy\py3app
\stringbenchÎ·']176
[1.6217970707185678, 1.6279369907932706, 1.6207041728220117]

a=['D:\jm\jmpy\py3app\stringbenchb','D:\jm\jmpy\py3app
\stringbenchÎ·']176
[1.5150522562729396, 1.5130369919353992, 1.5121890607025037]

a=['D:\jm\jmpy\py3app\stringbenchð €€','D:\jm\jmpy\py3app
\stringbenchð €']316
[1.6135375194801664, 1.6117739170366434, 1.6134331526540109]

>Exit code: 0

- win7 32-bits
- The file is in utf-8
- Do not be afraid by this output, it is just a copy/paste for your
excellent editor, the coding output pane is configured to use the
locale
coding.
- Of course and as expected, similar behaviour from a console. (Which
btw
show, how good is you application).

==========

Something different.

>From a previous msg, on this thread.

---

> Sure. And over a different set of samples, it is less compact. If you
> write a lot of Latin-1, Python will use one byte per character, while
> UTF-8 will use two bytes per character.

    I think you mean writing a lot of Latin-1 characters outside
ASCII.
However, even people writing texts in, say, French will find that only
a
small proportion of their text is outside ASCII and so the cost of
UTF-8
is correspondingly small.

    The counter-problem is that a French document that needs to
include
one mathematical symbol (or emoji) outside Latin-1 will double in size
as a Python string.

---

I already explained this.
It is, how to say, a miss-understanding of Unicode. What's count,
is not the amount of non-ascii chars you have in a stream.
Relevant is the fact that every char is handled with the "same
algorithm", in that case utf-8.
Unicode takes you from the "char" up to the unicode transformated
form. Then it is a question of implementation.

This is exactly what you are doing in Scintilla (maybe without
realizing this deeply).

An editor may reflect very well the example a gave. You enter
thousand ascii chars, then - boum - as you enter a non ascii
char, your editor (assuming is uses a mechanism like the FSR),
has to internally reencode everything!

jmf