[pypy-dev] New speed.pypy.org version

Sat Jun 26 09:16:52 CEST 2010

Hi Paolo,

well, you are right of course. I had forgotten about the real problem, which
you actually demonstrate quite well with your CPython and pypy-c case:
depending on the normalization you can make any stacked series look faster
than the others.

I will have a look at the literature and modify normalized stacked plots
accordingly.

Thanks for taking the time to explain things in such detail.

Regards,
Miquel

2010/6/25 Paolo Giarrusso <p.giarrusso at gmail.com>

> On Fri, Jun 25, 2010 at 19:08, Miquel Torres <tobami at googlemail.com>
> wrote:
> > Hi Paolo,
> >
> > I am aware of the problem with calculating benchmark means, but let me
> > explain my point of view.
> >
> > You are correct in that it would be preferable to have absolute times.
> Well,
> > you actually can, but see what it happens:
> > http://speed.pypy.org/comparison/?hor=true&bas=none&chart=stacked+bars
>
> Ahah! I didn't notice that I could skip normalization! This does not
> fully invalidate my point, however.
>
> > Absolute values would only work if we had carefully chosen benchmaks
> > runtimes to be very similar (for our cpython baseline). As it is,
> html5lib,
> > spitfire and spitfire_cstringio completely dominate the cummulative time.
>
> I acknowledge that (btw, it should be cumulative time, with one 'm',
> both here and in the website).
>
> > And not because the interpreter is faster or slower but because the
> > benchmark was arbitrarily designed to run that long. Any improvement in
> the
> > long running benchmarks will carry much more weight than in the short
> > running.
>
> > What is more useful is to have comparable slices of time so that the
> > improvements can be seen relatively over time.
>
> If you want to sum up times (but at this point, I see no reason for
> it), you should rather have externally derived weights, as suggested
> by the paper (in Rule 3).
> As soon as you take weights from the data, lots of maths that you need
> is not going to work any more - that's generally true in many cases in
> statistics.
> And the only way making sense to have external weights is to gather
> them from real world programs. Since that's not going to happen
> easily, just stick with the geometric mean. Or set an arbitrarily low
> weight, manually, without any math, so that the long-running
> benchmarks stop dominating the res. It's no fraud, since the current
> graph is less valid anyway.
>
> > Normalizing does that i
> > think.
> Not really.
>
> > It just says: we have 21 tasks which take 1 second to run each on
> > interpreter X (cpython in the default case). Then we see how other
> > executables compare to that. What would the geometric mean achieve here,
> > exactly, for the end user?
>
> You actually need the geomean to do that. Don't forget that the
> geomean is still a mean: it's a mean performance ratio which averages
> individual performance ratios.
> If PyPy's geomean is 0.5, it means that PyPy is going to run that task
> in 11.5 seconds instead of 21. To me, this sounds exactly like what
> you want to achieve. Moreover, it actually works, unlike what you use.
>
> For instance, ignore PyPy-JIT, and look only CPython and pypy-c (no
> JIT). Then, change the normalization among the two:
>
> http://speed.pypy.org/comparison/?exe=2%2B35,3%2BL&ben=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21&env=1&hor=true&bas=2%2B35&chart=stacked+bars
>
> http://speed.pypy.org/comparison/?exe=2%2B35,3%2BL&ben=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21&env=1&hor=true&bas=3%2BL&chart=stacked+bars
> with the current data, you get that in one case cpython is faster, in
> the other pypy-c is faster.
> It can't happen with the geomean. This is the point of the paper.
>
> I could even construct a normalization baseline $base such that
> CPython seems faster than PyPy-JIT. Such a base should be very fast
> on, say, ai (where CPython is slower), so that $cpython.ai/$base.ai
> becomes 100 and $pypyjit.ai/$base.ai becomes 200, and be very slow on
> other benchmarks (so that they disappear in the sum).
>
> So, the only difference I see is that geomean works, arithm. mean
> doesn't. That's why Real Benchmarkers use geomean.
>
> Moreover, you are making a mistake quite common among non-physicists.
> What you say makes sense under the implicit assumption that dividing
> two times gives something you can use as a time. When you say "Pypy's
> runtime for a 1 second task", you actually want to talk about a
> performance ratio, not about the time. In the same way as when you say
> "this bird runs 3 meters long in one second", a physicist would sum
> that up as "3 m/s" rather than "3 m".
>
> > I am not really calculating any mean. You can see that I carefully
> avoided
> > to display any kind of total bar which would indeed incur in the problem
> you
> > mention. That a stacked chart implicitly displays a total is something
> you
> > can not avoid, and for that kind of chart I still think normalized
> results
> > is visually the best option.
>
> But on a stacked bars graph, I'm not going to look at individual bars
> at all, just at the total: it's actually less convenient than in
> "normal bars" to look at the result of a particular benchmark.
>
> I hope I can find guidelines against stacked plots, I have a PhD
> colleague reading on how to make graphs.
>
> Best regards
> --
> Paolo Giarrusso - Ph.D. Student
> http://www.informatik.uni-marburg.de/~pgiarrusso/<http://www.informatik.uni-marburg.de/%7Epgiarrusso/>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pypy-dev/attachments/20100626/e7965f36/attachment.html>