[pypy-dev] New speed.pypy.org version

Fri Jul 2 09:58:29 CEST 2010

On Fri, Jul 2, 2010 at 09:27, Miquel Torres <tobami at googlemail.com> wrote:
> Hi Paolo,
>
> hey! I think it is a great idea. With logs you get both: correct
> normalized totals AND the ability to display the individual stacked
> series, which necessarily add arithmetically. But it strikes me,
> hasn't anyone written a paper about that method already? or at least
> documented it?

I guess the problem is that the graph is weird enough, and that you
need arbitrary a and b to make it work, since the logarithm might get
negative, and arbitrarily big. log 0 = - inf. I still think that's
fair and makes sense, but it's somewhat hard to sell.

> Anyway I need to check that the math is right (hopefully today), and
> then I would go and implement it.
> I'll tell you how it goes.
>
> Cheers,
> Miquel
>
>
>
> 2010/6/30 Paolo Giarrusso <p.giarrusso at gmail.com>:
>> Hi Miquel,
>> I'm quite busy (because of a paper deadline next Tuesday), sorry for
>> not answering earlier.
>>
>> I was just struck by an idea: there is a stacked bar plot where the
>> total bar is related to the geometric mean, such that it is
>> normalization-invariant. But this graph _is_ complicated.
>>
>> It is a stacked plot of _logarithms_ of performance ratios? This way,
>> the complete stacked bar shows the logarithm of the product, rather
>> than their sum, i.e. the log of the (geometric mean)^N rather than
>> their arithmetic mean. log of the (geometric mean)^N = N*log of the
>> (geometric mean).
>>
>> Some simple maths (I didn't write it out, so please recheck!) seems to
>> show that showing (a+b*log (ratio)), instead of log(ratio), gives
>> still a fair comparison, obtaining N*a+b*N*log(geomean) =
>> \Theta(log(geomean)). You need to put a and b because showing if the
>> ratio is 1, log(1) is zero (b is the representation scale which is
>> always there).
>>
>> About your workaround: I would like a table with the geometric mean of
>> the ratios, where we get the real global performance ratio among the
>> interpreters. As far as the results of your solution do not contradict
>> that _real_ table, it should be a reasonable workaround (but I would
>> embed the check in the code - otherwise other projects _will be_
>> bitten by that). Probably, I would like the website to offer such a
>> table to users, and I would like a graph of the overall performance
>> ratio over time (actually revisions).
>>
>> Finally, the docs of your web application should at the very least
>> reference the paper and this conversation (if there's a public archive
>> of the ML, as I think), and ideally explain the issue.
>>
>> Sorry for being too dense, maybe - if I was unclear, please tell me
>> and I'll answer next week.
>>
>> Best regards,
>> Paolo
>>
>> On Mon, Jun 28, 2010 at 11:21, Miquel Torres <tobami at googlemail.com> wrote:
>>> Hi Paolo,
>>>
>>> I read the paper, very interesting. It is perfectly clear that to
>>> calculate a normalized total only the geometric mean makes sense.
>>>
>>> However, a stacked bars plot shows the individual benchmarks so it
>>> implicitly is an arithmetic mean. The only solution (apart from
>>> removing the stacked charts and only offering total bars) is the
>>> weighted approach.
>>>
>>> External weights are not very practical though. Codespeed is used by
>>> other projects so an extra option would need to be added to the
>>> settings to allow the introducing of arbitrary weights to benchmarks.
>>> A bit cumbersome. I have an idea that may work. Take the weights from
>>> a defined baseline so that the run times are equal, which is the same
>>> as normalizing to a baseline. It would be the same as now, only that
>>> you can't choose the normalization, it will be weighted (normalized)
>>> according the default baseline (which you already can already
>>> configure in the settings).
>>>
>>> You may say that it is still an arithmetic mean, but there won't be
>>> conflicting results because there is only a single normalization. For
>>> PyPy that would be cpython, and everything would make sense.
>>> I know it is a work around, not a solution. If you think it is a bad
>>> idea, the only other possibility is not to have stacked bars (as in
>>> "showing individual benchmarks"). But I find them useful. Yes you can
>>> see the individual benchmark results better in the normal bars chart,
>>> but there you don't see visually which benchmarks take the biggest
>>> part of the pie, which helps visualize what parts of your program need
>>> most improving.
>>>
>>> What do you think?
>>>
>>> Regards,
>>> Miquel
>>>
>>>
>>> 2010/6/25 Paolo Giarrusso <p.giarrusso at gmail.com>:
>>>> On Fri, Jun 25, 2010 at 19:08, Miquel Torres <tobami at googlemail.com> wrote:
>>>>> Hi Paolo,
>>>>>
>>>>> I am aware of the problem with calculating benchmark means, but let me
>>>>> explain my point of view.
>>>>>
>>>>> You are correct in that it would be preferable to have absolute times. Well,
>>>>> you actually can, but see what it happens:
>>>>> http://speed.pypy.org/comparison/?hor=true&bas=none&chart=stacked+bars
>>>>
>>>> Ahah! I didn't notice that I could skip normalization! This does not
>>>> fully invalidate my point, however.
>>>>
>>>>> Absolute values would only work if we had carefully chosen benchmaks
>>>>> runtimes to be very similar (for our cpython baseline). As it is, html5lib,
>>>>> spitfire and spitfire_cstringio completely dominate the cummulative time.
>>>>
>>>> I acknowledge that (btw, it should be cumulative time, with one 'm',
>>>> both here and in the website).
>>>>
>>>>> And not because the interpreter is faster or slower but because the
>>>>> benchmark was arbitrarily designed to run that long. Any improvement in the
>>>>> long running benchmarks will carry much more weight than in the short
>>>>> running.
>>>>
>>>>> What is more useful is to have comparable slices of time so that the
>>>>> improvements can be seen relatively over time.
>>>>
>>>> If you want to sum up times (but at this point, I see no reason for
>>>> it), you should rather have externally derived weights, as suggested
>>>> by the paper (in Rule 3).
>>>> As soon as you take weights from the data, lots of maths that you need
>>>> is not going to work any more - that's generally true in many cases in
>>>> statistics.
>>>> And the only way making sense to have external weights is to gather
>>>> them from real world programs. Since that's not going to happen
>>>> easily, just stick with the geometric mean. Or set an arbitrarily low
>>>> weight, manually, without any math, so that the long-running
>>>> benchmarks stop dominating the res. It's no fraud, since the current
>>>> graph is less valid anyway.
>>>>
>>>>> Normalizing does that i
>>>>> think.
>>>> Not really.
>>>>
>>>>> It just says: we have 21 tasks which take 1 second to run each on
>>>>> interpreter X (cpython in the default case). Then we see how other
>>>>> executables compare to that. What would the geometric mean achieve here,
>>>>> exactly, for the end user?
>>>>
>>>> You actually need the geomean to do that. Don't forget that the
>>>> geomean is still a mean: it's a mean performance ratio which averages
>>>> individual performance ratios.
>>>> If PyPy's geomean is 0.5, it means that PyPy is going to run that task
>>>> in 11.5 seconds instead of 21. To me, this sounds exactly like what
>>>> you want to achieve. Moreover, it actually works, unlike what you use.
>>>>
>>>> For instance, ignore PyPy-JIT, and look only CPython and pypy-c (no
>>>> JIT). Then, change the normalization among the two:
>>>> http://speed.pypy.org/comparison/?exe=2%2B35,3%2BL&ben=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21&env=1&hor=true&bas=2%2B35&chart=stacked+bars
>>>> http://speed.pypy.org/comparison/?exe=2%2B35,3%2BL&ben=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21&env=1&hor=true&bas=3%2BL&chart=stacked+bars
>>>> with the current data, you get that in one case cpython is faster, in
>>>> the other pypy-c is faster.
>>>> It can't happen with the geomean. This is the point of the paper.
>>>>
>>>> I could even construct a normalization baseline $base such that
>>>> CPython seems faster than PyPy-JIT. Such a base should be very fast
>>>> on, say, ai (where CPython is slower), so that $cpython.ai/$base.ai
>>>> becomes 100 and $pypyjit.ai/$base.ai becomes 200, and be very slow on
>>>> other benchmarks (so that they disappear in the sum).
>>>>
>>>> So, the only difference I see is that geomean works, arithm. mean
>>>> doesn't. That's why Real Benchmarkers use geomean.
>>>>
>>>> Moreover, you are making a mistake quite common among non-physicists.
>>>> What you say makes sense under the implicit assumption that dividing
>>>> two times gives something you can use as a time. When you say "Pypy's
>>>> runtime for a 1 second task", you actually want to talk about a
>>>> performance ratio, not about the time. In the same way as when you say
>>>> "this bird runs 3 meters long in one second", a physicist would sum
>>>> that up as "3 m/s" rather than "3 m".
>>>>
>>>>> I am not really calculating any mean. You can see that I carefully avoided
>>>>> to display any kind of total bar which would indeed incur in the problem you
>>>>> mention. That a stacked chart implicitly displays a total is something you
>>>>> can not avoid, and for that kind of chart I still think normalized results
>>>>> is visually the best option.
>>>>
>>>> But on a stacked bars graph, I'm not going to look at individual bars
>>>> at all, just at the total: it's actually less convenient than in
>>>> "normal bars" to look at the result of a particular benchmark.
>>>>
>>>> I hope I can find guidelines against stacked plots, I have a PhD
>>>> colleague reading on how to make graphs.
>>>>
>>>> Best regards
>>>> --
>>>> Paolo Giarrusso - Ph.D. Student
>>>> http://www.informatik.uni-marburg.de/~pgiarrusso/
>>>>
>>>
>>
>>
>>
>> --
>> Paolo Giarrusso - Ph.D. Student
>> http://www.informatik.uni-marburg.de/~pgiarrusso/
>>
>

-- 
Paolo Giarrusso - Ph.D. Student
http://www.informatik.uni-marburg.de/~pgiarrusso/