From victor.stinner at gmail.com  Wed Jun  1 21:19:32 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Thu, 2 Jun 2016 03:19:32 +0200
Subject: [Speed] A new perf module: toolkit to write benchmarks
Message-ID: <CAMpsgwYbEUHF0W1QBa8pgMf1GFh1oTGSNvVPZ-ZGVqZmaVjpVA@mail.gmail.com>

Hi,

I started to write blog posts on stable benchmarks:

1) https://haypo.github.io/journey-to-stable-benchmark-system.html
2) https://haypo.github.io/journey-to-stable-benchmark-deadcode.html
3) https://haypo.github.io/journey-to-stable-benchmark-average.html

One important point is that minimum is commonly used in Python
benchmarks, whereas it is a bad practice to get a stable benchmark.


I started to work on a toolkit to write benchmarks, the new "perf" module:
http://perf.readthedocs.io/en/latest/
https://github.com/haypo/perf


I used timeit as a concrete use case, since timeit is popular and
badly implemented. timeit currently uses 1 process running the
microbenchmarks 3 times and take the minimum. timeit is *known* to be
unstable, and the common advice is to run it at least 3 times and
again take the minimum of the minimum.

Example of links about timeit being unstable:

* https://mail.python.org/pipermail/python-dev/2012-August/121379.html
* https://bugs.python.org/issue23693
* https://bugs.python.org/issue6422 (not directly related)

Moreover, the timeit module disables the garbage collector which is
also wrong. It's wrong because it's rare to disable the GC in
applications.


My goal for the perf module is to provide basic features and then
reuse it in existing benchmarks:

* mean() and stdev() to display result
* clock chosen for benchmark
* result classes to store numbers
* etc.

Work in progress:

* new implementation of timeit using multiple processes
* perf.metadata module: collect various information about Python, the
system, etc.
* file format to store numbers and metadata

I'm interested by the very basic perf.py internal text format: one
timing per line, that's all. But it's incomplete, the "loops"
informaiton is not stored. Maybe a binary format is better? I don't
know yet.

It should be possible to cumulate files of multiple processes. I'm
also interested to implement a generic "rerun" command to add more
samples if a benchmark doesn't look stable enough.


perf.timeit looks more stable than timeit, the CLI is basically the
same: replace "-m timeit" with "-m perf.timeit".

5 timeit output ("1000000 loops, best of 3: ... per loop"):

* 0.247 usec
* 0.252 usec
* 0.247 usec
* 0.251 usec
* 0.251 usec

It's disturbing to get 3 different "minimums" :-/

5 perf.timeit outputs ("Average: 25 runs x 3 samples x 10^6 loops: ..."):

* 250 ns +- 3 ns
* 250 ns +- 3 ns
* 251 ns +- 3 ns
* 251 ns +- 4 ns
* 251 ns +- 3 ns

Note: I also got " 258 ns +- 17 ns" when I opened a webpage in Firefox
while the benchmark is running.

Note: I ran these benchmarks on a regular Linux without any specific
tuning. ASLR is enabled, but the system was idle.

Victor

From solipsis at pitrou.net  Thu Jun  2 03:17:18 2016
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Thu, 2 Jun 2016 09:17:18 +0200
Subject: [Speed] A new perf module: toolkit to write benchmarks
References: <CAMpsgwYbEUHF0W1QBa8pgMf1GFh1oTGSNvVPZ-ZGVqZmaVjpVA@mail.gmail.com>
Message-ID: <20160602091718.74562cb4@fsol>

On Thu, 2 Jun 2016 03:19:32 +0200
Victor Stinner <victor.stinner at gmail.com>
wrote:
> I'm interested by the very basic perf.py internal text format: one
> timing per line, that's all. But it's incomplete, the "loops"
> informaiton is not stored. Maybe a binary format is better? I don't
> know yet.

Just use a simple JSON format.

Regards

Antoine.


From arigo at tunes.org  Thu Jun  2 04:38:02 2016
From: arigo at tunes.org (Armin Rigo)
Date: Thu, 2 Jun 2016 10:38:02 +0200
Subject: [Speed] A new perf module: toolkit to write benchmarks
In-Reply-To: <CAMpsgwYbEUHF0W1QBa8pgMf1GFh1oTGSNvVPZ-ZGVqZmaVjpVA@mail.gmail.com>
References: <CAMpsgwYbEUHF0W1QBa8pgMf1GFh1oTGSNvVPZ-ZGVqZmaVjpVA@mail.gmail.com>
Message-ID: <CAMSv6X1c4DHtq+mdQaarRUp=XXnWyyVZHkLKhqPZA_PJerRLWA@mail.gmail.com>

Hi Victor,

On 2 June 2016 at 03:19, Victor Stinner <victor.stinner at gmail.com> wrote:
> 5 timeit output ("1000000 loops, best of 3: ... per loop"):
>
> * 0.247 usec
> * 0.252 usec
> * 0.247 usec
> * 0.251 usec
> * 0.251 usec
>
> 5 perf.timeit outputs ("Average: 25 runs x 3 samples x 10^6 loops: ..."):
>
> * 250 ns +- 3 ns
> * 250 ns +- 3 ns
> * 251 ns +- 3 ns
> * 251 ns +- 4 ns
> * 251 ns +- 3 ns

Looks good.  IMHO the important bit is that `timeit` is simple to use,
readily available, and gives just a number, which makes it very
attractive for people.  Your output would achieve the same result
(with the `+-` added, which is fine) assuming that it eventually
replaces `timeit` in the standard library.

I know there are many good reasons for why getting just a single
number is not enough, but I'd say that we still need to achieve the
best practical results given that constrain.  The results you posted
above seem to show that `perf.timeit` is better than `timeit` at doing
that, and I believe that it's a great step forward.


A bient?t,

Armin.

From victor.stinner at gmail.com  Thu Jun  2 04:58:35 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Thu, 2 Jun 2016 10:58:35 +0200
Subject: [Speed] A new perf module: toolkit to write benchmarks
In-Reply-To: <CAMSv6X1c4DHtq+mdQaarRUp=XXnWyyVZHkLKhqPZA_PJerRLWA@mail.gmail.com>
References: <CAMpsgwYbEUHF0W1QBa8pgMf1GFh1oTGSNvVPZ-ZGVqZmaVjpVA@mail.gmail.com>
 <CAMSv6X1c4DHtq+mdQaarRUp=XXnWyyVZHkLKhqPZA_PJerRLWA@mail.gmail.com>
Message-ID: <CAMpsgwY6EFCkX_v5uz1x14BHk6O_fZdgYv07fHQwp6g==QhRWQ@mail.gmail.com>

2016-06-02 10:38 GMT+02:00 Armin Rigo <arigo at tunes.org>:
> Looks good.  IMHO the important bit is that `timeit` is simple to use,
> readily available, and gives just a number, which makes it very
> attractive for people.

By default min & max are hidden. You can show them using -v option.

To make the output even simpler, maybe the standard deviation can be
displayed "in english". Something like:

* "Average: 250 ns +- 3 ns" => "Average: 250 ns (stable)", or just
"Average: 250 ns"
* "Average: 250 ns +- 120 ns" => "Average: 250 ns (not reliable, try
again on an idle system)"

Usually, timeit it used to compare two versions of Python. Maybe we
should focus on this use case, and check if the difference is
significant, as perf.py does? By default, perf.py does *not* display
any number if the difference is not significant. I like this
behaviour, even if it can be surprising for the first time.

For the CLI, we can extend timeit CLI to accept the path/name of two
python binaries. Or we can use something like pybench to store result
into files and then load & compare two files.

Victor

From solipsis at pitrou.net  Thu Jun  2 07:29:14 2016
From: solipsis at pitrou.net (Antoine Pitrou)
Date: Thu, 2 Jun 2016 13:29:14 +0200
Subject: [Speed] A new perf module: toolkit to write benchmarks
References: <CAMpsgwYbEUHF0W1QBa8pgMf1GFh1oTGSNvVPZ-ZGVqZmaVjpVA@mail.gmail.com>
 <CAMSv6X1c4DHtq+mdQaarRUp=XXnWyyVZHkLKhqPZA_PJerRLWA@mail.gmail.com>
 <CAMpsgwY6EFCkX_v5uz1x14BHk6O_fZdgYv07fHQwp6g==QhRWQ@mail.gmail.com>
Message-ID: <20160602132914.6b57c9f5@fsol>

On Thu, 2 Jun 2016 10:58:35 +0200
Victor Stinner <victor.stinner at gmail.com>
wrote:
> 
> Usually, timeit it used to compare two versions of Python.

timeit is used for many different things, including comparing two
versions of Python, but not only.

> For the CLI, we can extend timeit CLI to accept the path/name of two
> python binaries.

That sounds reasonable.

Regards

Antoine.


From arigo at tunes.org  Thu Jun  2 07:53:07 2016
From: arigo at tunes.org (Armin Rigo)
Date: Thu, 2 Jun 2016 13:53:07 +0200
Subject: [Speed] A new perf module: toolkit to write benchmarks
In-Reply-To: <CAMpsgwY6EFCkX_v5uz1x14BHk6O_fZdgYv07fHQwp6g==QhRWQ@mail.gmail.com>
References: <CAMpsgwYbEUHF0W1QBa8pgMf1GFh1oTGSNvVPZ-ZGVqZmaVjpVA@mail.gmail.com>
 <CAMSv6X1c4DHtq+mdQaarRUp=XXnWyyVZHkLKhqPZA_PJerRLWA@mail.gmail.com>
 <CAMpsgwY6EFCkX_v5uz1x14BHk6O_fZdgYv07fHQwp6g==QhRWQ@mail.gmail.com>
Message-ID: <CAMSv6X2c9ZGRE0eFLfM6iH6O_=QtYWxEfd1AWgVwFX_XPFTg2w@mail.gmail.com>

Hi Victor,

On 2 June 2016 at 10:58, Victor Stinner <victor.stinner at gmail.com> wrote:
> Usually, timeit it used to compare two versions of Python.

That's not the use case I'm focusing about here: timeit is also used by Mr.
Random Programmer to tweak their Python code to improve the
performance.  (Often, it's the performance of non-representative
microbenchmarks, but well, better have a tool that at least get some
saner results than the current timeit.)


A bient?t,

Armin.

From victor.stinner at gmail.com  Thu Jun  2 09:22:28 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Thu, 2 Jun 2016 15:22:28 +0200
Subject: [Speed] A new perf module: toolkit to write benchmarks
In-Reply-To: <20160602091718.74562cb4@fsol>
References: <CAMpsgwYbEUHF0W1QBa8pgMf1GFh1oTGSNvVPZ-ZGVqZmaVjpVA@mail.gmail.com>
 <20160602091718.74562cb4@fsol>
Message-ID: <CAMpsgwa9ou1A2qW2r4QYYPYjd_qQotabr8asw=TvbSJEFGNLNA@mail.gmail.com>

2016-06-02 9:17 GMT+02:00 Antoine Pitrou <solipsis at pitrou.net>:
> Just use a simple JSON format.

Yeah, Python 2.7 includes are JSON parser and JSON is human readble
(but not really designed to be modified by a human).

I had a technical issue: I wanted to produce JSON output *and* keep
nice human output at the same time. I found a nice trick: by default
write human output to stdout, but write JSON to stdout and human
output to stderr in JSON mode.

At the end, you get a simple CLI:
---
$ python3 -m perf.timeit --json 1+1 > run.json
.........................
Average: 18.3 ns +- 0.3 ns (25 runs x 3 samples x 10^7 loops)

$ python3 -m perf < run.json
Average: 18.3 ns +- 0.3 ns (25 runs x 3 samples x 10^7 loops)
---

The JSON can contain metadata as well:
---
$ python3 -m perf.timeit --metadata --json 1+1 > run.json
Metadata:
- aslr: enabled
- cpu_count: 4
- (...)
.........................
Average: 18.2 ns +- 0.0 ns (25 runs x 3 samples x 10^7 loops)

$ python3 -m perf < run.json
Metadata:
- aslr: enabled
- cpu_count: 4
- (...)
Average: 18.2 ns +- 0.0 ns (25 runs x 3 samples x 10^7 loops)
---

There are two kinds of objects: a single run, or a result composed of
multiple runs. The format is one JSON object per line.

Example of single runs using individual JSON files and then combine them:
---
$ python3 -m perf.timeit --raw --json 1+1 > run1.json
warmup 1: 18.3 ns
sample 1: 18.3 ns
sample 2: 18.3 ns
sample 3: 18.3 ns

$ python3 -m perf.timeit --raw --json 1+1 > run2.json
warmup 1: 18.2 ns
sample 1: 18.2 ns
sample 2: 18.2 ns
sample 3: 18.2 ns

$ python3 -m perf.timeit --raw --json 1+1 > run3.json
warmup 1: 18.2 ns
sample 1: 18.2 ns
sample 2: 18.2 ns
sample 3: 18.2 ns

$ python3 -m perf < run1.json   # single run
Average: 18.3 ns +- 0.0 ns (3 samples x 10^7 loops)

$ cat run1.json run2.json run3.json | python3 -m perf   # 3 runs
Average: 18.2 ns +- 0.0 ns (3 runs x 3 samples x 10^7 loops)
---

Victor

From victor.stinner at gmail.com  Tue Jun  7 09:03:46 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Tue, 7 Jun 2016 15:03:46 +0200
Subject: [Speed] perf 0.2 released, perf fork of CPython benchmark suite
Message-ID: <CAMpsgwafeMSS=0qxkjz0xiwhh7kWDU08nF58_uoKSp22vJdozQ@mail.gmail.com>

Hi,

I completed the API of my small perf module and released a version 0.2:
https://perf.readthedocs.io/

It is supposed to provide the basic tools to collect samples, compute
the average, display the result, etc. I started to work on JSON
serialization to "easily" run multiple processes. The idea is also to
be split the code to produce numbers and the code to display results.
I expect that we can do better to display results. See for example
speed.python.org and speed.pypy.org, it's nicer than perf.py text
output ;-)

I also started to hack CPython benchmark suite (benchmarks repository)
to use my perf module:
https://hg.python.org/sandbox/benchmarks_perf

I should now stop NIH and see how to merge my work with PyPy fork of
benchmarks ;-)

FYI I started to write the perf module because I started to write an
article about the impact of CPU speed on Python microbenchmarks, and I
wanted to have a smart timeit running multiple processes. Since it was
cool to work on such project, I started to hack benchmarks, but maybe
I gone too far and I should look at PyPy's benchmark instead ;-)

Victor

From victor.stinner at gmail.com  Fri Jun 10 06:50:25 2016
From: victor.stinner at gmail.com (Victor Stinner)
Date: Fri, 10 Jun 2016 12:50:25 +0200
Subject: [Speed] perf 0.3 released
Message-ID: <CAMpsgwaiQq_izqAyUCNTvq_bUbQbtr57yivus3XiehFYXwk3nQ@mail.gmail.com>

Hi,

I just released perf 0.3. Major changes:

- "python -m perf" CLI now has 3 commands: show, compare and
compare_to. Compare commands says if the difference is significant (I
copied the code from perf.py)
- TextRunner is now able to spawn child processes, parse command
arguments and more
- If TextRunner detects isolated CPUs, it automatically pins the CPUs
of the worker processes to isolated CPUs
- Add ``--json-file`` command line option
- Add TextRunner.bench_sample_func() method: the sample function is
responsible to measure the elapsed time, useful for microbenchmarks
- Enhance a lot of the documentation

Writing a benchmark now only takes one line:
"perf.text_runner.TextRunner().bench_func(func)"! Full example:
---
import time
import perf.text_runner

def func():
    time.sleep(0.001)

perf.text_runner.TextRunner().bench_func(func)
---

I looked at PyPy benchmarks:
https://bitbucket.org/pypy/benchmarks

Results can also be serialized to JSON, but the serialization is only
done at the end: the final result is serialized. It's not possible to
save each run in a JSON file.

Running multiple processes is not supported neither.

With perf, the final JSON contains all data: all runs, all samples
even warmup samples.

perf now also collects metadata in each worker process. So it is more
safer to compare runs since it's possible to manually check when and
how the worker executed the benchmark. For example, the CPU affinity
is now saved in metadata.

For example, "python -m perf.timeit" now saves the setup and
statements in metadata.

With perf 0.3, TextRunner now also includes a builtin calibration to
compute the number of outter loop iteartions: repeat each sample so it
takes between 100 ms and 1 sec (min/max are configurable).

Victor

From contrebasse at gmail.com  Sat Jun 11 18:20:18 2016
From: contrebasse at gmail.com (Joseph Martinot-Lagarde)
Date: Sat, 11 Jun 2016 22:20:18 +0000 (UTC)
Subject: [Speed] Performance comparison of regular expression engines
References: <56DB26EA.3070005@gmail.com>
 <CAK5idxQHckY3UJVV3BrXuaA7UdCY7-SGUFu6qANLYW52ckbwyw@mail.gmail.com>
 <nbgsrl$qsu$1@ger.gmane.org>
 <CAK5idxTLbox+40h9-MSZQgtvOgLH2g7g2f3LMRSOFfbwv82eLw@mail.gmail.com>
 <nbhgra$oqj$1@ger.gmane.org>
Message-ID: <loom.20160612T001854-244@post.gmane.org>

Serhiy Storchaka <storchaka at ...> writes:

> The first column is the searched pattern. The second column is the 
> number of found matches (for control, it should be the same with all 
> engines and versions). The third column, under the "re" header is the 
> time in milliseconds. The column under the "str.find" header is the time 
> of searching without using regular expressions.
It would be easier to read with a constant number of digits after the comma,
so that numbers are better aligned.